Web The use of grappling is actively increasing , Especially in large e-commerce companies ,Web Crawling is a collection of data to compete , Ways to analyze competitors and research new products .Web Crawling is a way to extract information from a website . In this article , Learn how to create based on Python The scraper . Delve into code , See how it works .
Multi person learning python, I don't know where to start .
Many people study python, After mastering the basic grammar , I don't know where to look for cases to start .
A lot of people who have done cases , But I don't know how to learn more advanced knowledge .
So for these three kinds of people , I will provide you with a good learning platform , Get a free video tutorial , electronic text , And the source code of the course !??¤
QQ Group :1057034340
In today's big data world , It's hard to track what's going on . For companies that need a lot of information to succeed , The situation has become more complicated . But first of all , They need to collect this data in some way , That means they have to deal with thousands of resources .
There are two ways to collect data . You can use API Services provided by media websites , This is the best way to get all the news . and ,API Very easy to use . Unfortunately , Not every website offers this service . And then there's the second way - web capture .
What is web crawling ?
It's a way to extract information from a website .HTML A page is just a collection of nested tags . Labels form a certain kind of tree , Its root is <html> In the label , And divide the page into different logical parts . Each tag can have its own descendants ( Sub level ) And the parent .
for example ,HTML The page tree can look like this :
To deal with this HTML, You can use text or trees . Bypassing this tree is web crawling . We'll only find the nodes we need in all this diversity , And get information from it ! This approach focuses on unstructured HTML Data is transformed into easy-to-use structured information into a database or worksheet . Data grabbing requires a robot to collect information , And pass HTTP or Web Browser connected to Internet. In this guide , We will use Python Create a scraper .
What we need to do :
- Get the page from which we want to grab data URL
- Copy or download from this page HTML Content
- Deal with this HTML Content and get the data you need
This sequence allows us to pop up the required URL, obtain HTML data , It is then processed to receive the required data . But sometimes we need to go to the website first , Then go to a specific web site to receive data . then , We have to add another step - Log in to this website .
Matching
We will use Beautiful Soup Library to analyze HTML Content and get all the necessary data . This is grabbing HTML and XML Excellent documentation Python package .
Selenium The library will help us get the crawler into the website and go to the required URL Address .Selenium Python It can help you perform things like clicking a button , Input content and other operations .
Let's delve into the code
First , Let's import the library we're going to use .
- # Import library
- from selenium import webdriver
- from bs4 import BeautifulSoup
then , We need to show the browser driver Selenium How to start a web browser ( We will use... Here Google Chrome). If we don't want robots to show Web The browser's graphical interface , Will be in Selenium Add “ headless” Options .
There is no graphical interface ( Headless ) Of Web Browsers can work with all the popular Web Browser is very similar to the environment of automatic management of web pages . But in this case , All activities are carried out through the command line interface or using network communication .
- # chrome Driver path
- chromedriver = '/usr/local/bin/chromedriver'
- options = webdriver.ChromeOptions()
- options.add_argument('headless') #open a headless browser
- browser = webdriver.Chrome(executable_path=chromedriver,
- chrome_options=options)
Set up a browser , After installing the library and creating the environment , We started using HTML. Let's go to the input page , Find the identifier in which the user must enter an email address and password , Category or field name .
- # Enter the login page
- browser.get('http://playsports365.com/default.aspx')
- # Search tags by name
- email =
- browser.find_element_by_name('ctl00$MainContent$ctlLogin$_UserName')
- password =
- browser.find_element_by_name('ctl00$MainContent$ctlLogin$_Password')
- login =
- browser.find_element_by_name('ctl00$MainContent$ctlLogin$BtnSubmit')
then , We will send login data to these HTML In the label . So , We need to press the action button to send the data to the server .
- # Add login credentials
- email.send_keys('********')
- password.send_keys('*******')
- # Click Submit button
- login.click()
- email.send_keys('********')
- password.send_keys('*******')
- login.click()
After successfully entering the system , We will go to the required page and collect HTML Content .
- # After successful login , go to “ OpenBets” page
- browser.get('http://playsports365.com/wager/OpenBets.aspx')
- # obtain HTML Content
- requiredHtml = browser.page_source
Now? , When we have HTML Content time , The only thing left is to process the data . We will be in Beautiful Soup and html5lib With the help of the library .
html5lib It's a Python software package , The realization of modern Web Browser influences HTML5 Grab algorithm . Once you have a standardized structure of the content , You can go to HTML Search for data in any child element of the tag . The information we're looking for is in the form tab , So we're looking for it .
- soup = BeautifulSoup(requiredHtml, 'html5lib')
- table = soup.findChildren('table')
- my_table = table[0]
We're going to find the parent tag once , Then recursively iterate through the child tags and print out the values .
- # Receive labels and print values
- rows = my_table.findChildren(['th', 'tr'])
- for row in rows:
- cells = row.findChildren('td')
- for cell in cells:
- value = cell.text
- print (value)
To execute this program , You will need to use pip install Selenium,Beautiful Soup and html5lib. After installing the library , The order is as follows :
- # python < Program name >
These values will be printed to the console , That's how you grab any website .
If we grab sites that are constantly updated ( for example , The sports score sheet ), You should create cron Task to start the program at specific intervals .
very nice , Everything is all right , The content is captured , The data is filled in , Besides that , Everything else is fine , This is the number of requests we need to get data .
Sometimes , The server is tired of the same person making a bunch of requests , And the server forbids it . Unfortunately , People's patience is limited .
under these circumstances , You have to hide yourself . The most common reason for prohibition is 403 error , And in IP Frequent requests sent to the server when blocked . When the server is available and able to process the request , The server will throw 403 error , But for some personal reasons , Refuse to do so . The first problem has been solved – We can do that by using html5lib Generate fake user agents to disguise adult classes , And the operating system , A random combination of the specification and the browser passes the request to us . in the majority of cases , This is a good way to accurately collect the information you are interested in .
But sometimes it's just time.sleep() It's not enough to put it in the right place and fill in the request header . therefore , You need to find powerful ways to change this IP. To grab a lot of data , You can :
– Develop your own IP Address infrastructure ;
– Use Tor – This topic can be devoted to several large articles , And it's actually done ;
– Using a business agent network ;
For beginners of network crawling , The best option is to contact the agent provider , for example Infatica etc. , They can help you set up agents and solve all the difficulties in proxy server management . Collecting a lot of data requires a lot of resources , So there is no need to develop your own internal infrastructure to proxy “ Reinvent the wheel ”. Even many of the largest e-commerce companies use proxy network services to outsource agent management , Because the number one priority for most companies is data , Not agent management .