Welcome to “Python From zero to one ”, Here I'm going to share an appointment 200 piece Python Series articles , Take everyone to study and play together , have a look Python This interesting world . All articles will be combined with cases 、 Code and author's experience , I really want to share my nearly ten years programming experience with you , I hope it will be of some help to you , There are also some shortcomings in the article .
Python The overall framework of the series includes basic grammar 10 piece 、 Web crawler 30 piece 、 Visual analysis 10 piece 、 machine learning 20 piece 、 Big data analysis 20 piece 、 Image recognition 30 piece 、 Artificial intelligence 40 piece 、Python Security 20 piece 、 Other skills 10 piece . Your attention 、 Praise and forward is the greatest support for xiuzhang , Knowledge is priceless, man has love , I hope we can all be happy on the road of life 、 Grow up together .
This paper refers to the author CSDN The article , Links are as follows ：
meanwhile , The author's new “ Na Zhang AI Safe house ” Will focus on Python And security technology , Mainly share Web penetration 、 System security 、 Artificial intelligence 、 Big data analysis 、 Image recognition 、 Malicious code detection 、CVE Reappear 、 Threat intelligence analysis, etc . Although the author is a technical white , But it will ensure that every article will be carefully written , I hope these basic articles will help you , stay Python And on the road to safety, progress with you .
With the rapid development of the Internet , The world wide web has become the carrier of a lot of information , More and more netizens can get the information they need through the Internet , At the same time, how to effectively extract and use this information has become a huge challenge . Search engine （Search Engine） As a tool to help people retrieve information , It has become a gateway and tool for users to access the world wide web , Common search engines such as Google、Yahoo、 Baidu 、 Sogou etc. . however , These general search engines also have some limitations , For example, the results returned by search engines contain a large number of pages that users don't care about ; Another example is that they are based on keyword search , Lack of semantic understanding , This leads to inaccurate feedback ; General search engines cannot process unstructured data , picture 、 Audio 、 Video and other complex types of data .
In order to solve the above problems , The web crawler which grabs the related web resources comes into being , The picture below is Google Architecture of search engine , It crawls data from the World Wide Web , Through text and connection analysis , And then sort it out , Finally, return the relevant search results to the browser . meanwhile , The popular knowledge map is also proposed to solve similar problems .
Web crawlers are also known as web spiders or web robots , It's a rule of thumb , A program or script that automatically grabs information from the world wide web . Web crawler according to the fixed target to grab , Selective access to web pages and related links on the world wide web , Get the information you need . Unlike the general crawler , Directional crawlers don't pursue large coverage , The goal is to capture the web pages related to a particular topic , Preparing data resources for topic oriented user queries .
According to the system structure and implementation technology , It can be roughly divided into the following types ： General purpose web crawler （General Purpose Web Crawler）、 focused crawler （Focused Web Crawler）、 Incremental web crawler （Incremental Web Crawler）、 Deep web crawler （Deep Web Crawler）. The actual network crawler system is usually a combination of several crawler technologies .
Data analysis usually involves preparation 、 Data crawling 、 Data preprocessing 、 Data analysis 、 Six steps of visualization drawing, analysis and evaluation , As shown in the figure below . Data crawling is divided into four steps ：
I hope you can learn from the basics Python knowledge , Finally, you can grab the data set you need and analyze it in depth , Come on ！
Regular expressions are powerful tools for handling strings , Usually used to retrieve 、 Replace text that conforms to certain rules . This article first introduces the basic concepts of regular expressions , Then explain the common methods , And combine Python Web data crawling common modules and common regular expression website analysis methods to explain , Finally, we use regular expressions to crawl personal blog sites .
Regular expressions （Regular Expression, abbreviation Regex or RE） Also known as normal representation or conventional representation , Often used to retrieve 、 Replace the text that matches a pattern , It first sets some special characters and character combinations , By combining “ Rule string ” To filter expressions , To get or match the specific content we want . It's very flexible , It is also very logical and functional , And can quickly find the required information from the string through the expression , But for people who have just met , It's more obscure .
Because the main application object of regular expression is text , So it's used in all kinds of text editors , Small to famous editor EditPlus, As big as Microsoft Word、Visual Studio And so on , You can use regular expressions to handle text content .
Python adopt re Module provides support for regular expressions , But before you can use regular expressions, you need to import re modular , To call the function function of the module .
The basic step is to compile the string form of regular expression into Pattern example , And then use Pattern The instance processes the text and gets a match （match） example , Reuse match Instance to get the required information . A common function is findall, The prototype is as follows ：
This function represents the search string string, Return all matching substrings as a list . The parameter re There are three common values , The content in brackets of each common value is written in complete form .
Pattern Object is a compiled regular expression , adopt Pattern A series of methods are provided to match and search the text .Pattern Cannot instantiate directly , You have to use re.compile() Construct .
re The regular expression module includes some common operation functions , such as complie() function . Its prototype is as follows ：
This function creates a pattern object from a string containing a regular expression , Return to one pattern object . Parameters flags It's a matching pattern , You can use bitwise OR “|” Means effective at the same time , You can also specify... In the regular expression string .Pattern Objects cannot be instantiated directly , Only through compile Method to get .
For example , Using regular expressions to get the numeric content of a string , As shown below ：
>>> import re >>> string="A1.45,b5,6.45,8.82" >>> regex = re.compile(r"\d+\.?\d*") >>> print regex.findall(string) ['1.45', '5', '6.45', '8.82'] >>>
match The method is from the string pos Start matching from the subscript pattern, If pattern At the end of the match , Returns a match object ; If in the process of matching pattern Can't match , Or the match has arrived before it's finished endpos, Then return to None. The prototype of this method is as follows ：
search Method is used to find the substring in a string that can match successfully . From the string of pos The subscript tries to match pattern, If pattern Still match at the end , Returns a match object ; if pattern At the end, it still doesn't match , Will pos Add 1 Then try to match again ; until pos=endpos If it still can't match, it will return None. The function prototype is as follows ：
group([group1, …]) Method is used to get one or more strings intercepted by a group , When it specifies more than one parameter, it will return... In tuples , Groups that do not intercept strings return None, Multiple intercepted groups return the last intercepted substring .groups([default]) Method returns all the strings intercepted by the group as tuples , It's equivalent to calling more than once group, Its parameters default Represents a group that does not intercept a string and replaces it with this value , The default is None.
This section introduces Python A common module for crawling through a network , It mainly includes urlparse modular 、urllib modular 、urllib2 Module and requests modular , The functions in these modules are basic knowledge , But it's also very important .
This book first introduces Python Network data crawling is the most simple and widely used third-party library function urllib.urllib yes Python Used to get URL（Uniform Resource Locators, Unified resource addresser ） Library function , Can be used to grab remote data and save , You can even set the header （header）、 agent 、 Overtime authentication, etc .
urllib The upper interface provided by the module allows us to read just like a local file www or ftp The data on the . It is better than C++、C# Other programming languages are more convenient to use . The common methods are as follows ：
This method is used to create a remote URL Class file object of , Then operate the class file object like a local file to get remote data . Parameters url Represents the path to remote data , It's usually a web address ; Parameters data Said to post Method submitted to url The data of ; Parameters proxies Used to set up agents .urlopen Returns a class file object .urlopen The following table is provided .
Be careful , stay Python We can import the related expansion packages , adopt help Function to view the relevant instructions , As shown in the figure below .
Let's talk about it with an example Urllib Library function crawls the example of Baidu official website .
# -*- coding:utf-8 -*- import urllib.request import webbrowser as web url = "http://www.baidu.com" content = urllib.request.urlopen(url) print(content.info()) # Header information print(content.geturl()) # request url print(content.getcode()) #http Status code # Save the web page locally and open it through a browser open("baidu.html","wb").write(content.read()) web.open_new_tab("baidu.html")
This section calls urllib.urlopen(url) Function to open Baidu link , And output the message header 、url、http Status codes and other information , As shown in the figure below .
Code import webbrowser as web quote webbrowser Third party Library , Then you can use something like “module_name.method” Call the corresponding function .open().write() Represents the creation of static local baidu.html file , And read the baidu web page that has been opened , Perform file write operations .web.open_new_tab(“baidu.html”) It means to open a downloaded static web page through a browser . Among them download and open the static webpage of Baidu official website “baidu.html” The document is shown in the figure below .
It can also be used web.open_new_tab(“http://www.baidu.com”) Open the online web page directly in the browser .
urlretrieve The method is to download the remote data locally . Parameters filename Specifies the path to save locally , If this parameter is omitted ,urllib Will automatically generate a temporary file to save the data ; Parameters reporthook Is a callback function , When connecting to the server , The callback is triggered when the corresponding data block is transferred , This callback function is usually used to display the current download progress ; Parameters data Refers to the data passed to the server . The following is an example to demonstrate how to capture Sina home page to the local , Save in “D:/sina.html” In file , At the same time, the download progress is displayed .
# -*- coding:utf-8 -*- import urllib.request # The functionality ： Download the file to local , And show progress # a- Data blocks that have been downloaded , b- Block size , c- The size of the remote file def Download(a, b, c): per = 100.0 * a * b / c if per > 100: per = 100 print('%.2f' % per) url = 'http://www.sina.com.cn' local = 'd://sina.html' urllib.request.urlretrieve(url, local, Download)
It says urllib Two common methods in the module , among urlopen() Used to open a web page ,urlretrieve() The method is to download the remote data locally , Mainly used for crawling pictures . Be careful ,Python2 You can quote , and Python3 Need to pass through urllib.request call .
# -*- coding:utf-8 -*- import urllib.request url = 'https://www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png' local = 'baidu.png' urllib.request.urlretrieve(url, local)
Grab Baidu logo The picture is shown in the figure below ：
urlparse Module is mainly for url Analyze , Its main operations are splitting and merging url Various parts . It can be url Split into 6 Parts of , And return tuples , You can also split the parts into a new one url. The main functions are urljoin、urlsplit、urlunsplit、urlparse etc. .
This function will urlstring Value resolves to 6 Parts of , from urlstring Get in url, And return tuples (scheme, netloc, path, params, query, fragment). This function can be used to determine the network protocol （HTTP、FTP etc. ）、 Server address 、 File path, etc . The example code is as follows .
# coding=utf-8 from urllib.parse import urlparse url = urlparse('http://www.eastmount.com/index.asp?id=001') print(url) #url It can be divided into six parts print(url.netloc) # Output URL
The output is as follows , Include scheme、netloc、path、params、query、fragment Six parts .
>>> ParseResult( scheme='http', netloc='www.eastmount.com', path='/index.asp', params='', query='id=001', fragment='' ) www.eastmount.com >>>
You can also call urlunparse() Function to construct a tuple content into a Url. Function as follows ：
The tuple is similar to urlparse function , It receives tuples (scheme, netloc, path, params, query, fragment) after , It will be recombined into a... With the correct format url, In order to provide for Python Other HTML The parsing module uses . The sample code is as follows ：
# coding=utf-8 import urllib.parse url = urllib.parse.urlparse('http://www.eastmount.com/index.asp?id=001') print(url) #url It can be divided into six parts print(url.netloc) # Output URL # restructuring URL u = urllib.parse.urlunparse(url) print(u)
The output is shown in the following figure .
Then it introduces some skills of regular expression grabbing network data , These techniques are all from the author's project experience in natural language processing and data capture , It may not be very systematic , But also hope to provide readers with some ideas to grab data , So as to better solve some practical problems .
HTML Language is to use the form of tag pairs to write websites , It includes start tag and end tag , such as < head></ head>、< tr></ tr>、< script>< script> etc. . Here's how to grab the text between tag pairs , Like grabbing < title>Python</ title> Between the label pairs “Python” Content .
(1) Grab title The content between the tags
First of all, we can use the regular expression to grab the start tag < title > And the end tag < /title > Content between ,“(.*?)” It represents what we need to grab . The following code is to crawl the title of Baidu official website , namely “ use Baidu Search , You will know ”.
# coding=utf-8 import re import urllib.request url = "http://www.baidu.com/" content = urllib.request.urlopen(url).read() title = re.findall(r'<title>(.*?)</title>', content.decode('utf-8')) print(title) # use Baidu Search , You will know
Code calls urllib Library urlopen() Function to open the hyperlink , And call the regular expression re In the library findall() Function search title The content between the tags . because findall() Function is to get all the text that satisfies the regular expression , Here you just need to output the first value title that will do . Be careful ,Python3 Need to transform utf8 code , Otherwise, an error will be reported .
Here's another way , Used to get the title start tag （< title>） And the end tag （</ title>） Content between , The same output Baidu official website title “ use Baidu Search , You will know ”.
# coding=utf-8 import re import urllib.request url = "http://www.baidu.com/" content = urllib.request.urlopen(url).read() pat = r'(?<=<title>).*?(?=</title>)' ex = re.compile(pat, re.M|re.S) obj = re.search(ex, content.decode('utf-8')) title = obj.group() print(title) # use Baidu Search , You will know
2. Grab the content between the hyperlink tags
stay HTML in ,< a href=url> Hyperlink title </ a> Used to identify hyperlinks , The following code is used to get the full hyperlink , At the same time get hyperlinks < a> and </ a> Between the title content .
# coding=utf-8 import re import urllib.request url = "http://www.baidu.com/" content = urllib.request.urlopen(url).read() # Get the full hyperlink res = r"<a.*?href=.*?<\/a>" urls = re.findall(res, content.decode('utf-8')) for u in urls: print(u) # Get hyperlinks <a> and </a> Between the content res = r'<a .*?>(.*?)</a>' texts = re.findall(res, content.decode('utf-8'), re.S|re.M) for t in texts: print(t)
The output results are shown below , If we use “print(u)” or “print(t)” Statement output the result directly .
3. Grab tr Labels and td The content between the tags
Common layouts of web pages include table Layout or div Layout , among table Common labels in table layout include tr、th and td, Table behavior tr（table row）, The table data is td（table data）, The table header is th（table heading）. So how to grab the content between these tags ？ Here's the code to get the content between them . Suppose there is HTML The code is as follows ：
<html> <head><title> form </title></head> <body> <table border=1> <tr><th> Student number </th><th> full name </th></tr> <tr><td>1001</td><td> Yang xiuzhang </td></tr> <tr><td>1002</td><td> Yanna </td></tr> </table> </body> </html>
The results are shown in the following figure ：
Regular expression crawling tr、th、td Content between tags Python The code is as follows .
# coding=utf-8 import re import urllib.request content = urllib.request.urlopen("test.html").read() # Open local file # obtain <tr></tr> It's about res = r'<tr>(.*?)</tr>' texts = re.findall(res, content.decode('utf-8'), re.S|re.M) for m in texts: print(m) # obtain <th></th> It's about for m in texts: res_th = r'<th>(.*?)</th>' m_th = re.findall(res_th, m, re.S|re.M) for t in m_th: print(t) # Direct access to <td></td> It's about res = r'<td>(.*?)</td><td>(.*?)</td>' texts = re.findall(res, content.decode('utf-8'), re.S|re.M) for m in texts: print(m,m)
The output is as follows , First of all get tr Content between , And then in tr Between the content < th> and </ th> Between the value of , namely “ Student number ”、“ full name ”, Finally, get two < td> and </ td> Content between . Be careful ,Python3 There may be errors parsing local files , It's more important to master the method .
If you include attribute values , Then the regular expression is modified to “< td id=.?>(.?)</ td>”. Again , If not necessarily id Attribute start , You can use regular expressions “<td .?>(.?)”.
(1) Grab the hyperlink tag url
HTML The basic format of hyperlinks is “< a href=url> Link content </ a>”, Now you need to get some of it url Link address , The method is as follows ：
# coding=utf-8 import re content = ''' <a href="http://news.baidu.com" name="tj_trnews" class="mnav"> Journalism </a> <a href="http://www.hao123.com" name="tj_trhao123" class="mnav">hao123</a> <a href="http://map.baidu.com" name="tj_trmap" class="mnav"> Map </a> <a href="http://v.baidu.com" name="tj_trvideo" class="mnav"> video </a> ''' res = r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')" urls = re.findall(res, content, re.I|re.S|re.M) for url in urls: print(url)
The output is as follows ：
2. Capture the image hyperlink tag url
stay HTML in , We can see all kinds of pictures , The basic format of the picture label is “< img src= Picture address />”, Only by grabbing the original address of these pictures , To download the corresponding picture to local . So how to get the original image address in the image tag ？ The following code is to get the image link address method .
content = '''<img alt="Python" src="http://www.yangxiuzhang.com/eastmount.jpg" />''' urls = re.findall('src="(.*?)"', content, re.I|re.S|re.M) print urls # ['http://www.yangxiuzhang.com/eastmount.jpg']
The original address is “http://…/eastmount.jpg”, It corresponds to a picture , The image is stored in “www.yangxiuzhang.com” Website server side , the last one “/” The following field is the image name , That is to say “eastmount.jpg”. So how to get url What about the last parameter in ？
3. obtain url The last parameter in
In the use of Python In the process of crawling pictures , Usually, you will encounter the image corresponding to url The last field is used to name the picture case , As in front “eastmount.jpg”, It needs to be resolved url“/” After the parameter to get the image .
content = '''<img alt="Python" src="http://www..csdn.net/eastmount.jpg" />''' urls = 'http://www..csdn.net/eastmount.jpg' name = urls.split('/')[-1] print name # eastmount.jpg
This code urls.split(’/’)[-1] Indicates the use of characters “/” Split string , And get the last acquired value , This is the name of the picture “eastmount.jpg”.
When using regular expressions to crawl web page text , Usually you need to call find() Function to find the specified location , And then climb further , Like getting class The attribute is “infobox” Table for table, And then we're going to locate and crawl .
start = content.find(r'<table class="infobox"') # Starting position end = content.find(r'</table>') # Final position infobox = text[start:end] print infobox
meanwhile , During crawling, irrelevant variables may be crawled , At this point, the irrelevant content needs to be filtered , It is recommended to use replace Functions and regular expressions . For example, crawling content is as follows ：
# coding=utf-8 import re content = ''' <tr><td>1001</td><td> Yang xiuzhang <br /></td></tr> <tr><td>1002</td><td> Yan Na </td></tr> <tr><td>1003</td><td><B>Python</B></td></tr> ''' res = r'<td>(.*?)</td><td>(.*?)</td>' texts = re.findall(res, content, re.S|re.M) for m in texts: print(m,m)
The output is as follows ：
At this point, you need to filter the extra strings , Like a new line （< br />）、 Space （& nbsp;）、 In bold （< B></ B>）, The filter code is as follows ：
# coding=utf-8 import re content = ''' <tr><td>1001</td><td> Yang xiuzhang <br /></td></tr> <tr><td>1002</td><td> Yan Na </td></tr> <tr><td>1003</td><td><B>Python</B></td></tr> ''' res = r'<td>(.*?)</td><td>(.*?)</td>' texts = re.findall(res, content, re.S|re.M) for m in texts: value0 = m.replace('<br />', '').replace(' ', '') value1 = m.replace('<br />', '').replace(' ', '') if '<B>' in value1: m_value = re.findall(r'<B>(.*?)</B>', value1, re.S|re.M) print(value0, m_value) else: print(value0, value1)
use replace The string “< br />” and “’& nbsp;” Replace with blank , Implementation of filtering , And bold （< B></ B>） You need to use regular expression filtering . The output is as follows ：
Bear in mind ： This example may not be very good , But as an introduction and regular expression combination is good . Just beginning to learn Python Don't worry about web crawlers , Only through similar training , In the future, you will be able to handle similar problems , Better capture the data you need .
In this paper, we talk about regular expressions 、 Commonly used network data crawling module 、 After regular expression crawls data common method and so on content , We will talk about a simple example of regular expression crawling a website . Here's a simple example of using regular expressions to crawl an author's personal blog site , Get what you need .
The author's personal web address “http://www.eastmountyxz.com/” Open as shown in the figure . Let's assume that what we need to crawl now is as follows ：
First step Browser source location
First, locate the source code of the element to be crawled through the browser , For example, the title of the article 、 Hyperlinks 、 Pictures, etc , We found that these elements correspond to HTML The law of source code , This is called DOM Tree document node analysis . Open a web page through a browser , Select what you want to crawl , Right click the mouse and click “ Review element ” or “ Check ”, Then we can find the corresponding node of crawling HTML Source code , As shown in the figure .
title “ Goodbye, North Tech ： Remember the programming time of graduate students in Beijing ” be located < div class=”essay”></ div> Under the node , It includes a < h1></ h1> Record the title , One < p></ p> Record summary information , namely ：
Here we need to mark the crawler node by the properties and property values of the web page tag , Find out class The attribute is “essay” Of div, You can locate the position of the first article . Empathy , The other three articles are < div class=”essay1”></ div>、< div class=”essay2”></ div> and < div class=”essay3”></ div>, Locate these nodes .
The second step Regular expression crawls title
The title of a website is usually located in < head>< title>…</ title></ head> Between , The title of the website HTML The code is as follows ：
<head> <meta charset=”utf-8”> <title> Xiuzhang studies in heaven and earth </title> .... </head>
Get the title of a blog site “ Xiuzhang studies in heaven and earth ” The way to do this is through regular expressions “< title>(.*?)</ title>” Realization , The code is as follows , First, through urlopen() Function to visit the blog URL , Then we define regular expression crawling .
import re import urllib.request url = "http://www.eastmountyxz.com/" content = urllib.request.urlopen(url).read() title = re.findall(r'<title>(.*?)</title>', content.decode('utf-8')) print(title)
The output result is shown in the figure below ：
The third step Regular expression crawls all image addresses
because HTML Insert a picture label in the format of “< img src= Picture address />”, The method to get the image address using regular expression is ： To get to “src=” start , What ends in double quotation marks is enough . The code is as follows ：
import re import urllib.request url = "http://www.eastmountyxz.com/" content = urllib.request.urlopen(url).read() urls = re.findall(r'src="(.*?)"', content.decode('utf-8')) for url in urls: print(url)
The output is as follows , A total of 6 A picture .
We need to pay attention to ： Every picture here omits the blog address ：
We need to splice the address of the image we crawled , Add the original blog address to make a complete picture address , Download again , And the address can be accessed directly through the browser . Such as :
Step four Regular expressions crawl blog content
The first step is about how to locate the title of the four articles , The first article is located in < div class=”essay”> and </ div> Between the labels , The second one is located in < div class=”essay1”> and </ div>, By analogy . But it's time to HTML There is an error in the code ：class Attributes usually represent a class of tags , They should all have the same value , So these four articles class Attributes should be “essay”, and name or id Is the only attribute used to identify the label .
Use here find(’< div class=“essay” >’) Function to locate the start of the first article , Use find(’< div class=“essay1” >’) Function to locate the end of the first article , To obtain < div class=”essay”> To </ div> Content between . For example, get the title and hyperlink code of the first article as follows ：
import re import urllib.request url = "http://www.eastmountyxz.com/" content = urllib.request.urlopen(url).read() data = content.decode('utf-8') start = data.find(r'<div class="essay">') end = data.find(r'<div class="essay1">') print(data[start:end])
The output is as follows , Get the first blog HTML Source code .
This part of the code is divided into three steps ：
After positioning this paragraph , Then get the specific content through regular expression , The code is as follows ：
import re import urllib.request url = "http://www.eastmountyxz.com/" content = urllib.request.urlopen(url).read() data = content.decode('utf-8') start = data.find(r'<div class="essay">') end = data.find(r'<div class="essay1">') page = data[start:end] res = r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')" t1 = re.findall(res, page) # Hyperlinks print(t1) t2 = re.findall(r'<a .*?>(.*?)</a>', page) # title print(t2) t3 = re.findall('<p style=.*?>(.*?)</p>', page, re.M|re.S) # Abstract print(t3)
Call the regular expression to get the content separately , As a result of crawling （P） There are line breaks , So you need to join re.M and re.S Support line feed search , The results are as follows ：
The complete code is as follows ：
#coding:utf-8 import re import urllib.request url = "http://www.eastmountyxz.com/" content = urllib.request.urlopen(url).read() data = content.decode('utf-8') # Crawling through the title title = re.findall(r'<title>(.*?)</title>', data) print(title) # Crawling image address urls = re.findall(r'src="(.*?)"', data) for url in urls: print(url) # Crawling content start = data.find(r'<div class="essay">') end = data.find(r'<div class="essay1">') page = data[start:end] res = r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')" t1 = re.findall(res, page) # Hyperlinks print(t1) t2 = re.findall(r'<a .*?>(.*?)</a>', page) # title print(t2) t3 = re.findall('<p style=.*?>(.*?)</p>', page, re.M|re.S) # Abstract print(t3) print('') start = data.find(r'<div class="essay1">') end = data.find(r'<div class="essay2">') page = data[start:end] res = r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')" t1 = re.findall(res, page) # Hyperlinks print(t1) t2 = re.findall(r'<a .*?>(.*?)</a>', page) # title print(t2) t3 = re.findall('<p style=.*?>(.*?)</p>', page, re.M|re.S) # Abstract print(t3)
The output result is shown in the figure .
Through the code above , Readers will find that crawling websites with regular expressions is tedious , Especially when locating web nodes , I'll talk about it later Python Common third-party extension packages provided by , Using the functions of these packets for directional crawling .
Regular expressions are composed of “ Rule string ” To filter expressions , Match the desired information from complex content . Its main object is text , It is suitable for matching text strings and other contents , Not suitable for matching text meaning , Such as matching URL、Email This kind of plain text character is very suitable for . Regular expressions can be used in all programming languages , such as C#、Java、Python etc. .
Regular expression crawlers are often used to get something in a string , For example, extract the number of blog reads and comments , Intercept URL Domain name or URL One of the parameters in , Filter out specific characters or check whether the data obtained conforms to a certain logic , verification URL Or date type, etc . Because of its flexibility 、 Logical and functional features , So that it can quickly achieve the purpose of matching from complex strings in a very simple way .
But it's for people who are new to it , Regular expressions are more obscure ; meanwhile , Get through it HTML It's also difficult for certain texts in , Especially when the web page HTML Missing or not obvious end tag in source code . Next, the author will talk about more powerful 、 Intelligent third party crawler expansion pack , Mainly BeautifulSoup and Selenium technology .
I appreciate ：
Last , Thank you for your attention “ Na Zhang's home ” official account , thank CSDN So many years of company , Will always insist on sharing , I hope your article can accompany me to grow up , I also hope to keep moving forward on the road of technology . If the article is helpful to you 、 Have an insight , It's the best reward for me , Let's see and cherish ！2020 year 8 month 18 The official account established by Japan , Thank you again for your attention , Please help to promote it “ Na Zhang's home ”, ha-ha ~ Newly arrived , Please give me more advice .
(By: Na Zhang's home Eastmount 2020-09-30 Night at Wuda https://blog.csdn.net/Eastmount )
The references are as follows ：