mention python Reptiles , What you think of is requests still bf4 Or is it scrapy？ But there's a crawler Library in GitHub I've got it on the table 3k+ My little star , That's it MechanicalSoup：
This article will explain this crawler package from the following dimensions ：
MechanicalSoup What are the characteristics of
MechanicalSoup Which scenes are suitable for
Code details MechanicalSoup workflow
MechanicalSoup It's not just like a normal crawler package that can crawl data from a website , And you can interact with the website automatically through simple commands python library . Its bottom layer uses BeautifulSoup（ That is to say bs4） and requests library , So if you are familiar with these two libraries , Then it will be easier to use .
therefore , If you need to constantly interact with the website in the development process , Like clicking a button or filling out a form , that MechanicalSoup It will be of great use . Next , Let's show how this amazing crawler package works in code .
Direct installation pip install mechanicalsoup# from GitHub Download and install the development version pip install git+https://github.com/MechanicalSoup/MechanicalSoup
Code details MechanicalSoup
We will explain how to pass through in two cases MechanicalSoup Achieve web content acquisition and website interaction , So let's look at the first one Crawling Tiger flutter hot post .
Let's open the home page of Hupu community first , You can see that there are several posts with red titles , Now I want to crawl down the titles of these posts and save them . First, create a browser instance ：
Now we open tiger puff in the browser instance bbs Website , Tips 200 Express OK Successful visit
Our browser instance is now in tiger puff bbs Home page . Now? , We need to get the list of articles that exist on this page . This part is a little tricky , Because we need to uniquely identify the attribute of the tag that contains the list of articles . however , With the help of Chrome It's easy for browsers like this to do this ：
We look at the element and find that it's in ul tag , And this ul Label in a clss by list Of div Inside , Then further examination found that the hot post clss="red", So we can take advantage of things like bs4 Find the title of the article we need .
result = browser.get_current_page().find('div', class_="list")
=for i in range(len(result)):
if result[i] != '\n':
= for i in bbs_list:
See the result , The title with label has been saved successfully , Next, just use .text You can take out the title .
[<span class="red"> It's heartbreaking , Jennifer - Hudson sang before the game ,For Kobe and Gigi</span>,
<span class="red">[ subtitle ] The magician salutes corbistern ： Stern is my Savior , There will never be another player like Kobe </span>,
<span class="red">[ Gossip board ]Mambas Forever！ If 1 month 27 The day can come again , Maybe ......</span>,
<span class="red"> No matter who wins today ,mamba never out</span>,
Let's look at the next example , mechanicalsoup How to interact with the website . This time we choose a simpler example , Use mechanicalsoup To search Baidu .
Same as before , Let's first create an instance in the browser and open Baidu home page .
When you see that the response is successful , Let's extract the form we need to submit
<input name="bdorz_come" type="hidden" value="1"/>
<input name="ie" type="hidden" value="utf-8"/>
<input name="f" type="hidden" value="8"/>
<input name="rsv_bp" type="hidden" value="1"/>
<input name="rsv_idx" type="hidden" value="1"/>
<input name="tn" type="hidden" value="baidu"/>
<input autocomplete="off" autofocus="autofocus" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/>
<input autofocus="" class="bg s_btn" id="su" type="submit" value=" use Baidu Search "/>
You can see that the form to be filled is the content of the penultimate line , So we can fill in as follows
browser["wd"] = ' Get up early python'
Then you can use the following command to open a local page with the same content as the original page , And fill in the table with the values we provide .
You can see , The search box has been filled with the content to search , Next, just let the browser we created click for us , perform ：
return 200 On behalf of the corresponding success , It's done with one simulated click . Next through browser.get_current_page() You can view the content of the returned page ！
Although the above two examples are simple , But this is mechanicalsoup The basic work routine of ： First create a browser instance , And then use this browser to help you perform the relevant operations you want , You can even open a local visualization page to preview the content of the form you are about to submit before submitting ！ What are we waiting for? , Go and have a try ！