Quick guide: how to create Python based Crawlers

Chen Python 2020-11-16 20:32:13
quick guide create python based

Web The use of grappling is actively increasing , Especially in large e-commerce companies ,Web Crawling is a collection of data to compete , Ways to analyze competitors and research new products .Web Crawling is a way to extract information from a website . In this article , Learn how to create based on Python The scraper . Delve into code , See how it works .

Multi person learning python, I don't know where to start .

Many people study python, After mastering the basic grammar , I don't know where to look for cases to start .

A lot of people who have done cases , But I don't know how to learn more advanced knowledge .

So for these three kinds of people , I will provide you with a good learning platform , Get a free video tutorial , electronic text , And the source code of the course !??¤

QQ Group :1057034340

In today's big data world , It's hard to track what's going on . For companies that need a lot of information to succeed , The situation has become more complicated . But first of all , They need to collect this data in some way , That means they have to deal with thousands of resources .

There are two ways to collect data . You can use API Services provided by media websites , This is the best way to get all the news . and ,API Very easy to use . Unfortunately , Not every website offers this service . And then there's the second way - web capture .

What is web crawling ?

It's a way to extract information from a website .HTML A page is just a collection of nested tags . Labels form a certain kind of tree , Its root is <html> In the label , And divide the page into different logical parts . Each tag can have its own descendants ( Sub level ) And the parent .

for example ,HTML The page tree can look like this :

To deal with this HTML, You can use text or trees . Bypassing this tree is web crawling . We'll only find the nodes we need in all this diversity , And get information from it ! This approach focuses on unstructured HTML Data is transformed into easy-to-use structured information into a database or worksheet . Data grabbing requires a robot to collect information , And pass HTTP or Web Browser connected to Internet. In this guide , We will use Python Create a scraper .

What we need to do :

  •   Get the page from which we want to grab data URL
  •   Copy or download from this page HTML Content
  •   Deal with this HTML Content and get the data you need

This sequence allows us to pop up the required URL, obtain HTML data , It is then processed to receive the required data . But sometimes we need to go to the website first , Then go to a specific web site to receive data . then , We have to add another step - Log in to this website .


We will use Beautiful Soup Library to analyze HTML Content and get all the necessary data . This is grabbing HTML and XML Excellent documentation Python package .

Selenium The library will help us get the crawler into the website and go to the required URL Address .Selenium Python It can help you perform things like clicking a button , Input content and other operations .

Let's delve into the code

First , Let's import the library we're going to use .

  1. #  Import library   
  2. from selenium import webdriver 
  3. from bs4 import BeautifulSoup 

then , We need to show the browser driver Selenium How to start a web browser ( We will use... Here Google Chrome). If we don't want robots to show Web The browser's graphical interface , Will be in Selenium Add “ headless” Options .

There is no graphical interface ( Headless ) Of Web Browsers can work with all the popular Web Browser is very similar to the environment of automatic management of web pages . But in this case , All activities are carried out through the command line interface or using network communication .

  1. # chrome Driver path   
  2. chromedriver = '/usr/local/bin/chromedriver'  
  3. options = webdriver.ChromeOptions()  
  4. options.add_argument('headless') #open a headless browser   
  5. browser = webdriver.Chrome(executable_path=chromedriver,   
  6. chrome_options=options) 

Set up a browser , After installing the library and creating the environment , We started using HTML. Let's go to the input page , Find the identifier in which the user must enter an email address and password , Category or field name .

  1. #  Enter the login page   
  2. browser.get('http://playsports365.com/default.aspx')  
  3. #  Search tags by name   
  4. email =  
  5. browser.find_element_by_name('ctl00$MainContent$ctlLogin$_UserName')  
  6. password =   
  7. browser.find_element_by_name('ctl00$MainContent$ctlLogin$_Password')  
  8. login =   
  9. browser.find_element_by_name('ctl00$MainContent$ctlLogin$BtnSubmit') 

then , We will send login data to these HTML In the label . So , We need to press the action button to send the data to the server .

  1. #  Add login credentials   
  2. email.send_keys('********')  
  3. password.send_keys('*******')  
  4. #  Click Submit button   
  5. login.click()  
  6. email.send_keys('********')  
  7. password.send_keys('*******')  
  8. login.click() 

After successfully entering the system , We will go to the required page and collect HTML Content .

  1. #  After successful login , go to “ OpenBets” page   
  2. browser.get('http://playsports365.com/wager/OpenBets.aspx')  
  3. #  obtain HTML Content   
  4. requiredHtml = browser.page_source 

Now? , When we have HTML Content time , The only thing left is to process the data . We will be in Beautiful Soup and html5lib With the help of the library .

html5lib It's a Python software package , The realization of modern Web Browser influences HTML5 Grab algorithm . Once you have a standardized structure of the content , You can go to HTML Search for data in any child element of the tag . The information we're looking for is in the form tab , So we're looking for it .

  1. soup = BeautifulSoup(requiredHtml, 'html5lib')  
  2. table = soup.findChildren('table') 
  3. my_table = table[0] 

We're going to find the parent tag once , Then recursively iterate through the child tags and print out the values .

  1. #  Receive labels and print values   
  2. rows = my_table.findChildren(['th', 'tr'])  
  3. for row in rows:  
  4.  cells = row.findChildren('td')  
  5.  for cell in cells:  
  6.  value = cell.text  
  7.  print (value) 

To execute this program , You will need to use pip install Selenium,Beautiful Soup and html5lib. After installing the library , The order is as follows :

  1. # python < Program name

These values will be printed to the console , That's how you grab any website .

If we grab sites that are constantly updated ( for example , The sports score sheet ), You should create cron Task to start the program at specific intervals .

very nice , Everything is all right , The content is captured , The data is filled in , Besides that , Everything else is fine , This is the number of requests we need to get data .

Sometimes , The server is tired of the same person making a bunch of requests , And the server forbids it . Unfortunately , People's patience is limited .

under these circumstances , You have to hide yourself . The most common reason for prohibition is 403 error , And in IP Frequent requests sent to the server when blocked . When the server is available and able to process the request , The server will throw 403 error , But for some personal reasons , Refuse to do so . The first problem has been solved – We can do that by using html5lib Generate fake user agents to disguise adult classes , And the operating system , A random combination of the specification and the browser passes the request to us . in the majority of cases , This is a good way to accurately collect the information you are interested in .

But sometimes it's just time.sleep() It's not enough to put it in the right place and fill in the request header . therefore , You need to find powerful ways to change this IP. To grab a lot of data , You can :

– Develop your own IP Address infrastructure ;

– Use Tor – This topic can be devoted to several large articles , And it's actually done ;

– Using a business agent network ;

For beginners of network crawling , The best option is to contact the agent provider , for example Infatica etc. , They can help you set up agents and solve all the difficulties in proxy server management . Collecting a lot of data requires a lot of resources , So there is no need to develop your own internal infrastructure to proxy “ Reinvent the wheel ”. Even many of the largest e-commerce companies use proxy network services to outsource agent management , Because the number one priority for most companies is data , Not agent management .

本文为[Chen Python]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database