Do you know the python crawler Library of 3K + star on GitHub? Mechanical soup crawler Library

osc_ 4qu6doqx 2021-01-23 12:43:47
know python crawler library 3k


mention python Reptiles , What you think of is requests still bf4 Or is it scrapy? But there's a crawler Library in GitHub I've got it on the table 3k+ My little star , That's it MechanicalSoup

 picture

This article will explain this crawler package from the following dimensions :


  • MechanicalSoup What are the characteristics of

  • MechanicalSoup Which scenes are suitable for

  • Code details MechanicalSoup workflow


MechanicalSoup Introduce

MechanicalSoup It's not just like a normal crawler package that can crawl data from a website , And you can interact with the website automatically through simple commands python library . Its bottom layer uses BeautifulSoup( That is to say bs4) and requests library , So if you are familiar with these two libraries , Then it will be easier to use .

therefore , If you need to constantly interact with the website in the development process , Like clicking a button or filling out a form , that MechanicalSoup It will be of great use . Next , Let's show how this amazing crawler package works in code .


MechanicalSoup install

# Direct installation pip install mechanicalsoup# from GitHub Download and install the development version pip install git+https://github.com/MechanicalSoup/MechanicalSoup

Code details MechanicalSoup

We will explain how to pass through in two cases MechanicalSoup Achieve web content acquisition and website interaction , So let's look at the first one Crawling Tiger flutter hot post .

Let's open the home page of Hupu community first , You can see that there are several posts with red titles , Now I want to crawl down the titles of these posts and save them . First, create a browser instance :

import mechanicalsoup
browser
= mechanicalsoup.StatefulBrowser()

Now we open tiger puff in the browser instance bbs Website , Tips 200 Express OK Successful visit

browser.open('https://bbs.hupu.com/')<Response [200]>
Our browser instance is now in tiger puff bbs Home page . Now? , We need to get the list of articles that exist on this page . This part is a little tricky , Because we need to uniquely identify the attribute of the tag that contains the list of articles . however , With the help of Chrome It's easy for browsers like this to do this :

We look at the element and find that it's in ul tag , And this ul Label in a clss by list Of div Inside , Then further examination found that the hot post clss="red", So we can take advantage of things like bs4 Find the title of the article we need .

result = browser.get_current_page().find('div', class_="list")
result
= list(result.find('ul'))
bbs_list
=[]for i in range(len(result)):
   
if result[i] != '\n':
       bbs_list
.append(result[i])
bbs_top
= []for i in bbs_list:
   bbs_top
.append(i.find('span',class_="red"))

bbs_top

See the result , The title with label has been saved successfully , Next, just use .text You can take out the title .

[<span class="red"> It's heartbreaking , Jennifer - Hudson sang before the game ,For Kobe and Gigi</span>,
<span class="red">[ subtitle ] The magician salutes corbistern : Stern is my Savior , There will never be another player like Kobe </span>,
<span class="red">[ Gossip board ]Mambas Forever! If 1 month 27 The day can come again , Maybe ......</span>,
<span class="red"> No matter who wins today ,mamba never out</span>,
None,
None,
None,
None,
None,
None





]


Let's look at the next example , mechanicalsoup How to interact with the website . This time we choose a simpler example , Use mechanicalsoup To search Baidu .

Same as before , Let's first create an instance in the browser and open Baidu home page .

import mechanicalsoup
browser
= mechanicalsoup.StatefulBrowser()

browser.open('https://www.baidu.com/')

<Response [200]>

When you see that the response is successful , Let's extract the form we need to submit

browser.select_form()

browser.get_current_form().print_summary()

<input name="bdorz_come" type="hidden" value="1"/>

<input name="ie" type="hidden" value="utf-8"/>

<input name="f" type="hidden" value="8"/>

<input name="rsv_bp" type="hidden" value="1"/>

<input name="rsv_idx" type="hidden" value="1"/>

<input name="tn" type="hidden" value="baidu"/>

<input autocomplete="off" autofocus="autofocus" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/>

<input autofocus="" class="bg s_btn" id="su" type="submit" value=" use Baidu Search "/>

You can see that the form to be filled is the content of the penultimate line , So we can fill in as follows

browser["wd"] = ' Get up early python'

Then you can use the following command to open a local page with the same content as the original page , And fill in the table with the values we provide .

browser.launch_browser()

 picture

You can see , The search box has been filled with the content to search , Next, just let the browser we created click for us , perform :

browser.submit_selected()<Response [200]>

return 200 On behalf of the corresponding success , It's done with one simulated click . Next through browser.get_current_page() You can view the content of the returned page !


Conclusion


Although the above two examples are simple , But this is mechanicalsoup The basic work routine of : First create a browser instance , And then use this browser to help you perform the relevant operations you want , You can even open a local visualization page to preview the content of the form you are about to submit before submitting ! What are we waiting for? , Go and have a try !


版权声明
本文为[osc_ 4qu6doqx]所创,转载请带上原文链接,感谢
https://pythonmana.com/2021/01/20210123123954590o.html

  1. Mandatory conversion of Python data type
  2. Django reported an error: 'key' ID 'not found in' xxx '. Choices are: xxx'
  3. Python 400 sets of large video, starting from the right direction to learn, a complete set to you
  4. 只需十四步:从零开始掌握Python机器学习(附资源)
  5. Just 14 steps: Master Python machine learning from scratch (resources attached)
  6. Python|文件读写
  7. 安利一个Python界神奇得网站
  8. Python | file reading and writing
  9. Amway is a marvelous website in Python world
  10. 第二热门语言:从入门到精通,Python数据科学简洁教程
  11. The second popular language: from introduction to mastery, python data science concise tutorial
  12. 以我的亲身经历,聊聊学python的流程,同时推荐学python的书
  13. With my own experience, I'd like to talk about the process of learning Python and recommend books for learning python
  14. 以我的亲身经历,聊聊学python的流程,同时推荐学python的书
  15. With my own experience, I'd like to talk about the process of learning Python and recommend books for learning python
  16. Django url 路由匹配过程
  17. Django URL routing matching process
  18. 强者一出,谁与争锋?与Python相比,C++的运行速度究竟有多快?
  19. Who will fight against the strong? How fast is C + + running compared with Python?
  20. python 学习体会
  21. Experience of learning Python
  22. python7、8章
  23. Chapter 7 and 8 of Python
  24. python bool和str转换
  25. python——循环(for循环、while循环)及练习
  26. python变量和常量命名、注释规范
  27. python自定义异常捕获异常处理异常
  28. python 类型转换与数值操作
  29. python 元组(tuple)和列表(list)区别
  30. 解决python tkinter 与 sleep 延迟问题
  31. python字符串截取操作
  32. Python bool and STR conversion
  33. Python -- loop (for loop, while loop) and Practice
  34. Specification for naming and annotating variables and constants in Python
  35. Python custom exception capture exception handling exception
  36. Python type conversion and numerical operation
  37. The difference between tuple and list in Python
  38. Solve the delay problem of Python Tkinter and sleep
  39. Python string interception operation
  40. Python 100天速成中文教程,GitHub标星7700
  41. Python 100 day quick Chinese course, GitHub standard star 7700
  42. 以我的親身經歷,聊聊學python的流程,同時推薦學python的書
  43. With my own experience, I'd like to talk about the process of learning Python and recommend books for learning python
  44. python爬虫获取起点中文网人气排行Top100(快速入门,新手必备!)
  45. Python crawler to get the starting point of Chinese network popularity ranking Top100 (quick start, novice necessary!)
  46. 【Python常用包】itertools
  47. Itertools
  48. (国内首发)最新python初学者上手练习
  49. (国内首发)最新python初学者上手练习
  50. (first in China) the latest practice for beginners of Python
  51. (first in China) the latest practice for beginners of Python
  52. (数据科学学习手札104)Python+Dash快速web应用开发——回调交互篇(上)
  53. (data science learning notes 104) Python + dash rapid web application development -- callback interaction (Part 1)
  54. (数据科学学习手札104)Python+Dash快速web应用开发——回调交互篇(上)
  55. (data science learning notes 104) Python + dash rapid web application development -- callback interaction (Part 1)
  56. (資料科學學習手札104)Python+Dash快速web應用開發——回撥互動篇(上)
  57. (materials science learning notes 104) Python + dash rapid web application development -- callback interaction (Part 1)
  58. Python OpenCV 图片高斯模糊
  59. Python OpenCV image Gaussian blur
  60. Stargan V2: converse image synthesis for multiple domains reading notes and Python code analysis