mention python Reptiles , What you think of is requests still bf4 Or is it scrapy? But there's a crawler Library in GitHub I've got it on the table 3k+ My little star , That's it MechanicalSoup


This article will explain this crawler package from the following dimensions :

  • MechanicalSoup What are the characteristics of

  • MechanicalSoup Which scenes are suitable for

  • Code details MechanicalSoup workflow

MechanicalSoup Introduce

MechanicalSoup It's not just like a normal crawler package that can crawl data from a website , And you can interact with the website automatically through simple commands python library . Its bottom layer uses BeautifulSoup( That is to say bs4) and requests library , So if you are familiar with these two libraries , Then it will be easier to use .

therefore , If you need to constantly interact with the website in the development process , Like clicking a button or filling out a form , that MechanicalSoup It will be of great use . Next , Let's show how this amazing crawler package works in code .

MechanicalSoup install

# Direct installation pip install mechanicalsoup# from GitHub Download and install the development version pip install git+

Code details MechanicalSoup

We will explain how to pass through in two cases MechanicalSoup Achieve web content acquisition and website interaction , So let's look at the first one Crawling Tiger flutter hot post .

Let's open the home page of Hupu community first , You can see that there are several posts with red titles , Now I want to crawl down the titles of these posts and save them . First, create a browser instance :

import mechanicalsoup
= mechanicalsoup.StatefulBrowser()

Now we open tiger puff in the browser instance bbs Website , Tips 200 Express OK Successful visit'')<Response [200]>
Our browser instance is now in tiger puff bbs Home page . Now? , We need to get the list of articles that exist on this page . This part is a little tricky , Because we need to uniquely identify the attribute of the tag that contains the list of articles . however , With the help of Chrome It's easy for browsers like this to do this :

We look at the element and find that it's in ul tag , And this ul Label in a clss by list Of div Inside , Then further examination found that the hot post clss="red", So we can take advantage of things like bs4 Find the title of the article we need .

result = browser.get_current_page().find('div', class_="list")
= list(result.find('ul'))
=[]for i in range(len(result)):
if result[i] != '\n':
= []for i in bbs_list:


See the result , The title with label has been saved successfully , Next, just use .text You can take out the title .

[<span class="red"> It's heartbreaking , Jennifer - Hudson sang before the game ,For Kobe and Gigi</span>,
<span class="red">[ subtitle ] The magician salutes corbistern : Stern is my Savior , There will never be another player like Kobe </span>,
<span class="red">[ Gossip board ]Mambas Forever! If 1 month 27 The day can come again , Maybe ......</span>,
<span class="red"> No matter who wins today ,mamba never out</span>,


Let's look at the next example , mechanicalsoup How to interact with the website . This time we choose a simpler example , Use mechanicalsoup To search Baidu .

Same as before , Let's first create an instance in the browser and open Baidu home page .

import mechanicalsoup
= mechanicalsoup.StatefulBrowser()'')

<Response [200]>

When you see that the response is successful , Let's extract the form we need to submit



<input name="bdorz_come" type="hidden" value="1"/>

<input name="ie" type="hidden" value="utf-8"/>

<input name="f" type="hidden" value="8"/>

<input name="rsv_bp" type="hidden" value="1"/>

<input name="rsv_idx" type="hidden" value="1"/>

<input name="tn" type="hidden" value="baidu"/>

<input autocomplete="off" autofocus="autofocus" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/>

<input autofocus="" class="bg s_btn" id="su" type="submit" value=" use Baidu Search "/>

You can see that the form to be filled is the content of the penultimate line , So we can fill in as follows

browser["wd"] = ' Get up early python'

Then you can use the following command to open a local page with the same content as the original page , And fill in the table with the values we provide .



You can see , The search box has been filled with the content to search , Next, just let the browser we created click for us , perform :

browser.submit_selected()<Response [200]>

return 200 On behalf of the corresponding success , It's done with one simulated click . Next through browser.get_current_page() You can view the content of the returned page !


Although the above two examples are simple , But this is mechanicalsoup The basic work routine of : First create a browser instance , And then use this browser to help you perform the relevant operations you want , You can even open a local visualization page to preview the content of the form you are about to submit before submitting ! What are we waiting for? , Go and have a try !

