[Python from zero to one] 4. Introduction to web crawler and the case of regular expression capture blog

Eastmount 2020-11-13 00:00:53
python zero introduction web crawler


Welcome to “Python From zero to one ”, Here I'm going to share an appointment 200 piece Python Series articles , Take everyone to study and play together , have a look Python This interesting world . All articles will be combined with cases 、 Code and author's experience , I really want to share my nearly ten years programming experience with you , I hope it will be of some help to you , There are also some shortcomings in the article .

Python The overall framework of the series includes basic grammar 10 piece 、 Web crawler 30 piece 、 Visual analysis 10 piece 、 machine learning 20 piece 、 Big data analysis 20 piece 、 Image recognition 30 piece 、 Artificial intelligence 40 piece 、Python Security 20 piece 、 Other skills 10 piece . Your attention 、 Praise and forward is the greatest support for xiuzhang , Knowledge is priceless, man has love , I hope we can all be happy on the road of life 、 Grow up together .

This paper refers to the author CSDN The article , Links are as follows :

meanwhile , The author's new “ Na Zhang AI Safe house ” Will focus on Python And security technology , Mainly share Web penetration 、 System security 、 Artificial intelligence 、 Big data analysis 、 Image recognition 、 Malicious code detection 、CVE Reappear 、 Threat intelligence analysis, etc . Although the author is a technical white , But it will ensure that every article will be carefully written , I hope these basic articles will help you , stay Python And on the road to safety, progress with you .



One . What is a web crawler

With the rapid development of the Internet , The world wide web has become the carrier of a lot of information , More and more netizens can get the information they need through the Internet , At the same time, how to effectively extract and use this information has become a huge challenge . Search engine (Search Engine) As a tool to help people retrieve information , It has become a gateway and tool for users to access the world wide web , Common search engines such as Google、Yahoo、 Baidu 、 Sogou etc. . however , These general search engines also have some limitations , For example, the results returned by search engines contain a large number of pages that users don't care about ; Another example is that they are based on keyword search , Lack of semantic understanding , This leads to inaccurate feedback ; General search engines cannot process unstructured data , picture 、 Audio 、 Video and other complex types of data .

In order to solve the above problems , The web crawler which grabs the related web resources comes into being , The picture below is Google Architecture of search engine , It crawls data from the World Wide Web , Through text and connection analysis , And then sort it out , Finally, return the relevant search results to the browser . meanwhile , The popular knowledge map is also proposed to solve similar problems .

 Insert picture description here

Web crawlers are also known as web spiders or web robots , It's a rule of thumb , A program or script that automatically grabs information from the world wide web . Web crawler according to the fixed target to grab , Selective access to web pages and related links on the world wide web , Get the information you need . Unlike the general crawler , Directional crawlers don't pursue large coverage , The goal is to capture the web pages related to a particular topic , Preparing data resources for topic oriented user queries .

According to the system structure and implementation technology , It can be roughly divided into the following types : General purpose web crawler (General Purpose Web Crawler)、 focused crawler (Focused Web Crawler)、 Incremental web crawler (Incremental Web Crawler)、 Deep web crawler (Deep Web Crawler). The actual network crawler system is usually a combination of several crawler technologies .

Data analysis usually involves preparation 、 Data crawling 、 Data preprocessing 、 Data analysis 、 Six steps of visualization drawing, analysis and evaluation , As shown in the figure below . Data crawling is divided into four steps :

  • Demand analysis . First of all, we need to analyze the demand of network data crawling , Learn about the URL of the topic you're crawling 、 Content distribution , The field of the acquired corpus 、 Atlas, etc .
  • Technology selection . Web crawling technology can be used Python、Java、C++、C# And other programming languages , The main technologies involved include :Urllib library 、 Regular expressions 、Selenium、BeautifulSoup、Scrapy Technology .
  • web capture . After you've determined the climbing technique , Need to analyze the web page DOM Tree structure , adopt XPATH Technology location web page crawling content node , And grab the data ; meanwhile , Some websites involve page skipping 、 Login verification, etc .
  • Storage technology . Data storage technology is mainly to store crawling data information , It mainly includes SQL database 、 Plain text format 、CSV\XLS Documents, etc. .

I hope you can learn from the basics Python knowledge , Finally, you can grab the data set you need and analyze it in depth , Come on !

 Insert picture description here


Two . Regular expressions

Regular expressions are powerful tools for handling strings , Usually used to retrieve 、 Replace text that conforms to certain rules . This article first introduces the basic concepts of regular expressions , Then explain the common methods , And combine Python Web data crawling common modules and common regular expression website analysis methods to explain , Finally, we use regular expressions to crawl personal blog sites .

Regular expressions (Regular Expression, abbreviation Regex or RE) Also known as normal representation or conventional representation , Often used to retrieve 、 Replace the text that matches a pattern , It first sets some special characters and character combinations , By combining “ Rule string ” To filter expressions , To get or match the specific content we want . It's very flexible , It is also very logical and functional , And can quickly find the required information from the string through the expression , But for people who have just met , It's more obscure .

Because the main application object of regular expression is text , So it's used in all kinds of text editors , Small to famous editor EditPlus, As big as Microsoft Word、Visual Studio And so on , You can use regular expressions to handle text content .

1.re modular

Python adopt re Module provides support for regular expressions , But before you can use regular expressions, you need to import re modular , To call the function function of the module .

  • import re

The basic step is to compile the string form of regular expression into Pattern example , And then use Pattern The instance processes the text and gets a match (match) example , Reuse match Instance to get the required information . A common function is findall, The prototype is as follows :

  • findall(string[, pos[, endpos]]) | re.findall(pattern, string[, flags])

This function represents the search string string, Return all matching substrings as a list . The parameter re There are three common values , The content in brackets of each common value is written in complete form .

  • re.I(re.IGNORECASE): Make matching ignore case
  • re.M(re.MULTILINE): Allow multiple lines to match
  • re.S(re.DOTALL): Match all characters including line breaks

Pattern Object is a compiled regular expression , adopt Pattern A series of methods are provided to match and search the text .Pattern Cannot instantiate directly , You have to use re.compile() Construct .


2.complie Method

re The regular expression module includes some common operation functions , such as complie() function . Its prototype is as follows :

  • compile(pattern[,flags] )

This function creates a pattern object from a string containing a regular expression , Return to one pattern object . Parameters flags It's a matching pattern , You can use bitwise OR “|” Means effective at the same time , You can also specify... In the regular expression string .Pattern Objects cannot be instantiated directly , Only through compile Method to get .

For example , Using regular expressions to get the numeric content of a string , As shown below :

>>> import re
>>> string="A1.45,b5,6.45,8.82"
>>> regex = re.compile(r"\d+\.?\d*")
>>> print regex.findall(string)
['1.45', '5', '6.45', '8.82']
>>>

3.match Method

match The method is from the string pos Start matching from the subscript pattern, If pattern At the end of the match , Returns a match object ; If in the process of matching pattern Can't match , Or the match has arrived before it's finished endpos, Then return to None. The prototype of this method is as follows :

  • match(string[, pos[, endpos]]) | re.match(pattern, string[, flags])
    Parameters string Representation string ;pos Indicates subscript ,pos and endpos The default values for are 0 and len(string); Parameters flags Used to compile pattern To specify a matching pattern .

4.search Method

search Method is used to find the substring in a string that can match successfully . From the string of pos The subscript tries to match pattern, If pattern Still match at the end , Returns a match object ; if pattern At the end, it still doesn't match , Will pos Add 1 Then try to match again ; until pos=endpos If it still can't match, it will return None. The function prototype is as follows :

  • search(string[, pos[, endpos]]) | re.search(pattern, string[, flags])
    Parameters string Representation string ;pos Indicates subscript ,pos and endpos The default values for are 0 and len(string)); Parameters flags Used to compile pattern To specify a matching pattern .

5.group and groups Method

group([group1, …]) Method is used to get one or more strings intercepted by a group , When it specifies more than one parameter, it will return... In tuples , Groups that do not intercept strings return None, Multiple intercepted groups return the last intercepted substring .groups([default]) Method returns all the strings intercepted by the group as tuples , It's equivalent to calling more than once group, Its parameters default Represents a group that does not intercept a string and replaces it with this value , The default is None.


3、 ... and .Python Common module of network data crawling

This section introduces Python A common module for crawling through a network , It mainly includes urlparse modular 、urllib modular 、urllib2 Module and requests modular , The functions in these modules are basic knowledge , But it's also very important .

1.urllib modular

This book first introduces Python Network data crawling is the most simple and widely used third-party library function urllib.urllib yes Python Used to get URL(Uniform Resource Locators, Unified resource addresser ) Library function , Can be used to grab remote data and save , You can even set the header (header)、 agent 、 Overtime authentication, etc .

urllib The upper interface provided by the module allows us to read just like a local file www or ftp The data on the . It is better than C++、C# Other programming languages are more convenient to use . The common methods are as follows :

  • urlopen
    urlopen(url, data=None, proxies=None)

This method is used to create a remote URL Class file object of , Then operate the class file object like a local file to get remote data . Parameters url Represents the path to remote data , It's usually a web address ; Parameters data Said to post Method submitted to url The data of ; Parameters proxies Used to set up agents .urlopen Returns a class file object .urlopen The following table is provided .

 Insert picture description here

Be careful , stay Python We can import the related expansion packages , adopt help Function to view the relevant instructions , As shown in the figure below .

 Insert picture description here

Let's talk about it with an example Urllib Library function crawls the example of Baidu official website .

# -*- coding:utf-8 -*-
import urllib.request
import webbrowser as web
url = "http://www.baidu.com"
content = urllib.request.urlopen(url)
print(content.info()) # Header information 
print(content.geturl()) # request url
print(content.getcode()) #http Status code 
# Save the web page locally and open it through a browser 
open("baidu.html","wb").write(content.read())
web.open_new_tab("baidu.html")

This section calls urllib.urlopen(url) Function to open Baidu link , And output the message header 、url、http Status codes and other information , As shown in the figure below .

 Insert picture description here

Code import webbrowser as web quote webbrowser Third party Library , Then you can use something like “module_name.method” Call the corresponding function .open().write() Represents the creation of static local baidu.html file , And read the baidu web page that has been opened , Perform file write operations .web.open_new_tab(“baidu.html”) It means to open a downloaded static web page through a browser . Among them download and open the static webpage of Baidu official website “baidu.html” The document is shown in the figure below .

 Insert picture description here

It can also be used web.open_new_tab(“http://www.baidu.com”) Open the online web page directly in the browser .


  • urlretrieve
    urlretrieve(url, filename=None, reporthook=None, data=None)

urlretrieve The method is to download the remote data locally . Parameters filename Specifies the path to save locally , If this parameter is omitted ,urllib Will automatically generate a temporary file to save the data ; Parameters reporthook Is a callback function , When connecting to the server , The callback is triggered when the corresponding data block is transferred , This callback function is usually used to display the current download progress ; Parameters data Refers to the data passed to the server . The following is an example to demonstrate how to capture Sina home page to the local , Save in “D:/sina.html” In file , At the same time, the download progress is displayed .

# -*- coding:utf-8 -*-
import urllib.request
# The functionality : Download the file to local , And show progress 
# a- Data blocks that have been downloaded , b- Block size , c- The size of the remote file 
def Download(a, b, c):
per = 100.0 * a * b / c
if per > 100:
per = 100
print('%.2f' % per)
url = 'http://www.sina.com.cn'
local = 'd://sina.html'
urllib.request.urlretrieve(url, local, Download)

It says urllib Two common methods in the module , among urlopen() Used to open a web page ,urlretrieve() The method is to download the remote data locally , Mainly used for crawling pictures . Be careful ,Python2 You can quote , and Python3 Need to pass through urllib.request call .

# -*- coding:utf-8 -*-
import urllib.request
url = 'https://www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png'
local = 'baidu.png'
urllib.request.urlretrieve(url, local)

Grab Baidu logo The picture is shown in the figure below :

 Insert picture description here


2.urlparse modular

urlparse Module is mainly for url Analyze , Its main operations are splitting and merging url Various parts . It can be url Split into 6 Parts of , And return tuples , You can also split the parts into a new one url. The main functions are urljoin、urlsplit、urlunsplit、urlparse etc. .

  • urlparse
    urlparse.urlparse(urlstring[, scheme[, allow_fragments]])

This function will urlstring Value resolves to 6 Parts of , from urlstring Get in url, And return tuples (scheme, netloc, path, params, query, fragment). This function can be used to determine the network protocol (HTTP、FTP etc. )、 Server address 、 File path, etc . The example code is as follows .

# coding=utf-8
from urllib.parse import urlparse
url = urlparse('http://www.eastmount.com/index.asp?id=001')
print(url) #url It can be divided into six parts 
print(url.netloc) # Output URL 

The output is as follows , Include scheme、netloc、path、params、query、fragment Six parts .

>>>
ParseResult(
scheme='http',
netloc='www.eastmount.com',
path='/index.asp',
params='',
query='id=001',
fragment=''
)
www.eastmount.com
>>>

You can also call urlunparse() Function to construct a tuple content into a Url. Function as follows :

  • urlunparse
    urlparse.urlunparse(parts)

The tuple is similar to urlparse function , It receives tuples (scheme, netloc, path, params, query, fragment) after , It will be recombined into a... With the correct format url, In order to provide for Python Other HTML The parsing module uses . The sample code is as follows :

# coding=utf-8
import urllib.parse
url = urllib.parse.urlparse('http://www.eastmount.com/index.asp?id=001')
print(url) #url It can be divided into six parts 
print(url.netloc) # Output URL 
# restructuring URL
u = urllib.parse.urlunparse(url)
print(u)

The output is shown in the following figure .

 Insert picture description here


Four . Regular expression grabs network data common method

Then it introduces some skills of regular expression grabbing network data , These techniques are all from the author's project experience in natural language processing and data capture , It may not be very systematic , But also hope to provide readers with some ideas to grab data , So as to better solve some practical problems .

1. Grab the content between the tags

HTML Language is to use the form of tag pairs to write websites , It includes start tag and end tag , such as < head></ head>、< tr></ tr>、< script>< script> etc. . Here's how to grab the text between tag pairs , Like grabbing < title>Python</ title> Between the label pairs “Python” Content .

(1) Grab title The content between the tags

'<title>(.*?)</title>'

First of all, we can use the regular expression to grab the start tag < title > And the end tag < /title > Content between ,“(.*?)” It represents what we need to grab . The following code is to crawl the title of Baidu official website , namely “ use Baidu Search , You will know ”.

# coding=utf-8 
import re
import urllib.request
url = "http://www.baidu.com/"
content = urllib.request.urlopen(url).read()
title = re.findall(r'<title>(.*?)</title>', content.decode('utf-8'))
print(title[0])
# use Baidu Search , You will know 

Code calls urllib Library urlopen() Function to open the hyperlink , And call the regular expression re In the library findall() Function search title The content between the tags . because findall() Function is to get all the text that satisfies the regular expression , Here you just need to output the first value title[0] that will do . Be careful ,Python3 Need to transform utf8 code , Otherwise, an error will be reported .

Here's another way , Used to get the title start tag (< title>) And the end tag (</ title>) Content between , The same output Baidu official website title “ use Baidu Search , You will know ”.

# coding=utf-8 
import re
import urllib.request
url = "http://www.baidu.com/"
content = urllib.request.urlopen(url).read()
pat = r'(?<=<title>).*?(?=</title>)'
ex = re.compile(pat, re.M|re.S)
obj = re.search(ex, content.decode('utf-8'))
title = obj.group()
print(title)
# use Baidu Search , You will know 

2. Grab the content between the hyperlink tags
stay HTML in ,< a href=url> Hyperlink title </ a> Used to identify hyperlinks , The following code is used to get the full hyperlink , At the same time get hyperlinks < a> and </ a> Between the title content .

# coding=utf-8 
import re
import urllib.request
url = "http://www.baidu.com/"
content = urllib.request.urlopen(url).read()
# Get the full hyperlink 
res = r"<a.*?href=.*?<\/a>"
urls = re.findall(res, content.decode('utf-8'))
for u in urls:
print(u)
# Get hyperlinks <a> and </a> Between the content 
res = r'<a .*?>(.*?)</a>'
texts = re.findall(res, content.decode('utf-8'), re.S|re.M)
for t in texts:
print(t)

The output results are shown below , If we use “print(u)” or “print(t)” Statement output the result directly .

 Insert picture description here


3. Grab tr Labels and td The content between the tags
Common layouts of web pages include table Layout or div Layout , among table Common labels in table layout include tr、th and td, Table behavior tr(table row), The table data is td(table data), The table header is th(table heading). So how to grab the content between these tags ? Here's the code to get the content between them . Suppose there is HTML The code is as follows :

<html>
<head><title> form </title></head>
<body>
<table border=1>
<tr><th> Student number </th><th> full name </th></tr>
<tr><td>1001</td><td> Yang xiuzhang </td></tr>
<tr><td>1002</td><td> Yanna </td></tr>
</table>
</body>
</html>

The results are shown in the following figure :

 Insert picture description here

Regular expression crawling tr、th、td Content between tags Python The code is as follows .

# coding=utf-8 
import re
import urllib.request
content = urllib.request.urlopen("test.html").read() # Open local file 
# obtain <tr></tr> It's about 
res = r'<tr>(.*?)</tr>'
texts = re.findall(res, content.decode('utf-8'), re.S|re.M)
for m in texts:
print(m)
# obtain <th></th> It's about 
for m in texts:
res_th = r'<th>(.*?)</th>'
m_th = re.findall(res_th, m, re.S|re.M)
for t in m_th:
print(t)
# Direct access to <td></td> It's about 
res = r'<td>(.*?)</td><td>(.*?)</td>'
texts = re.findall(res, content.decode('utf-8'), re.S|re.M)
for m in texts:
print(m[0],m[1])

The output is as follows , First of all get tr Content between , And then in tr Between the content < th> and </ th> Between the value of , namely “ Student number ”、“ full name ”, Finally, get two < td> and </ td> Content between . Be careful ,Python3 There may be errors parsing local files , It's more important to master the method .

 Insert picture description here

If you include attribute values , Then the regular expression is modified to “< td id=.?>(.?)</ td>”. Again , If not necessarily id Attribute start , You can use regular expressions “<td .?>(.?)”.


2. Crawling through the parameters in the tag

(1) Grab the hyperlink tag url
HTML The basic format of hyperlinks is “< a href=url> Link content </ a>”, Now you need to get some of it url Link address , The method is as follows :

# coding=utf-8 
import re
content = '''
<a href="http://news.baidu.com" name="tj_trnews" class="mnav"> Journalism </a>
<a href="http://www.hao123.com" name="tj_trhao123" class="mnav">hao123</a>
<a href="http://map.baidu.com" name="tj_trmap" class="mnav"> Map </a>
<a href="http://v.baidu.com" name="tj_trvideo" class="mnav"> video </a>
'''
res = r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')"
urls = re.findall(res, content, re.I|re.S|re.M)
for url in urls:
print(url)

The output is as follows :

 Insert picture description here


2. Capture the image hyperlink tag url
stay HTML in , We can see all kinds of pictures , The basic format of the picture label is “< img src= Picture address />”, Only by grabbing the original address of these pictures , To download the corresponding picture to local . So how to get the original image address in the image tag ? The following code is to get the image link address method .

content = '''<img alt="Python" src="http://www.yangxiuzhang.com/eastmount.jpg" />'''
urls = re.findall('src="(.*?)"', content, re.I|re.S|re.M)
print urls
# ['http://www.yangxiuzhang.com/eastmount.jpg']

The original address is “http://…/eastmount.jpg”, It corresponds to a picture , The image is stored in “www.yangxiuzhang.com” Website server side , the last one “/” The following field is the image name , That is to say “eastmount.jpg”. So how to get url What about the last parameter in ?


3. obtain url The last parameter in
In the use of Python In the process of crawling pictures , Usually, you will encounter the image corresponding to url The last field is used to name the picture case , As in front “eastmount.jpg”, It needs to be resolved url“/” After the parameter to get the image .

content = '''<img alt="Python" src="http://www..csdn.net/eastmount.jpg" />'''
urls = 'http://www..csdn.net/eastmount.jpg'
name = urls.split('/')[-1]
print name
# eastmount.jpg

This code urls.split(’/’)[-1] Indicates the use of characters “/” Split string , And get the last acquired value , This is the name of the picture “eastmount.jpg”.


3. String processing and replacement

When using regular expressions to crawl web page text , Usually you need to call find() Function to find the specified location , And then climb further , Like getting class The attribute is “infobox” Table for table, And then we're going to locate and crawl .

start = content.find(r'<table class="infobox"') # Starting position 
end = content.find(r'</table>') # Final position 
infobox = text[start:end]
print infobox

meanwhile , During crawling, irrelevant variables may be crawled , At this point, the irrelevant content needs to be filtered , It is recommended to use replace Functions and regular expressions . For example, crawling content is as follows :

 # coding=utf-8 
import re
content = '''
<tr><td>1001</td><td> Yang xiuzhang <br /></td></tr>
<tr><td>1002</td><td> Yan &nbsp; Na </td></tr>
<tr><td>1003</td><td><B>Python</B></td></tr>
'''
res = r'<td>(.*?)</td><td>(.*?)</td>'
texts = re.findall(res, content, re.S|re.M)
for m in texts:
print(m[0],m[1])

The output is as follows :

 Insert picture description here

At this point, you need to filter the extra strings , Like a new line (< br />)、 Space (& nbsp;)、 In bold (< B></ B>), The filter code is as follows :

# coding=utf-8 
import re
content = '''
<tr><td>1001</td><td> Yang xiuzhang <br /></td></tr>
<tr><td>1002</td><td> Yan &nbsp; Na </td></tr>
<tr><td>1003</td><td><B>Python</B></td></tr>
'''
res = r'<td>(.*?)</td><td>(.*?)</td>'
texts = re.findall(res, content, re.S|re.M)
for m in texts:
value0 = m[0].replace('<br />', '').replace('&nbsp;', '')
value1 = m[1].replace('<br />', '').replace('&nbsp;', '')
if '<B>' in value1:
m_value = re.findall(r'<B>(.*?)</B>', value1, re.S|re.M)
print(value0, m_value[0])
else:
print(value0, value1)

use replace The string “< br />” and “’& nbsp;” Replace with blank , Implementation of filtering , And bold (< B></ B>) You need to use regular expression filtering . The output is as follows :

 Insert picture description here


5、 ... and . Personal blog crawling examples

Bear in mind : This example may not be very good , But as an introduction and regular expression combination is good . Just beginning to learn Python Don't worry about web crawlers , Only through similar training , In the future, you will be able to handle similar problems , Better capture the data you need .

1. The analysis process

In this paper, we talk about regular expressions 、 Commonly used network data crawling module 、 After regular expression crawls data common method and so on content , We will talk about a simple example of regular expression crawling a website . Here's a simple example of using regular expressions to crawl an author's personal blog site , Get what you need .

The author's personal web address “http://www.eastmountyxz.com/” Open as shown in the figure . Let's assume that what we need to crawl now is as follows :

  • The title of the blog address (title) Content
  • Crawling hyperlinks to all images , Like crawling < img src=”xxx.jpg” /> Medium “xxx.jpg”
  • Respectively crawl the title of four articles in the front page of the blog 、 Hyperlinks and Abstracts , For example, the title is “ Goodbye, North Tech : Remember the programming time of graduate students in Beijing ”.

 Insert picture description here


First step Browser source location
First, locate the source code of the element to be crawled through the browser , For example, the title of the article 、 Hyperlinks 、 Pictures, etc , We found that these elements correspond to HTML The law of source code , This is called DOM Tree document node analysis . Open a web page through a browser , Select what you want to crawl , Right click the mouse and click “ Review element ” or “ Check ”, Then we can find the corresponding node of crawling HTML Source code , As shown in the figure .

 Insert picture description here

title “ Goodbye, North Tech : Remember the programming time of graduate students in Beijing ” be located < div class=”essay”></ div> Under the node , It includes a < h1></ h1> Record the title , One < p></ p> Record summary information , namely :

 Insert picture description here
Here we need to mark the crawler node by the properties and property values of the web page tag , Find out class The attribute is “essay” Of div, You can locate the position of the first article . Empathy , The other three articles are < div class=”essay1”></ div>、< div class=”essay2”></ div> and < div class=”essay3”></ div>, Locate these nodes .

The second step Regular expression crawls title
The title of a website is usually located in < head>< title>…</ title></ head> Between , The title of the website HTML The code is as follows :

<head>
<meta charset=”utf-8>
<title> Xiuzhang studies in heaven and earth </title>
....
</head>

Get the title of a blog site “ Xiuzhang studies in heaven and earth ” The way to do this is through regular expressions “< title>(.*?)</ title>” Realization , The code is as follows , First, through urlopen() Function to visit the blog URL , Then we define regular expression crawling .

import re
import urllib.request
url = "http://www.eastmountyxz.com/"
content = urllib.request.urlopen(url).read()
title = re.findall(r'<title>(.*?)</title>', content.decode('utf-8'))
print(title[0])

The output result is shown in the figure below :

 Insert picture description here

The third step Regular expression crawls all image addresses
because HTML Insert a picture label in the format of “< img src= Picture address />”, The method to get the image address using regular expression is : To get to “src=” start , What ends in double quotation marks is enough . The code is as follows :

import re
import urllib.request
url = "http://www.eastmountyxz.com/"
content = urllib.request.urlopen(url).read()
urls = re.findall(r'src="(.*?)"', content.decode('utf-8'))
for url in urls:
print(url)

The output is as follows , A total of 6 A picture .

 Insert picture description here

We need to pay attention to : Every picture here omits the blog address :

  • http://www.eastmountyxz.com/

We need to splice the address of the image we crawled , Add the original blog address to make a complete picture address , Download again , And the address can be accessed directly through the browser . Such as :

  • http://www.eastmountyxz.com/images/11.gif

Step four Regular expressions crawl blog content
The first step is about how to locate the title of the four articles , The first article is located in < div class=”essay”> and </ div> Between the labels , The second one is located in < div class=”essay1”> and </ div>, By analogy . But it's time to HTML There is an error in the code :class Attributes usually represent a class of tags , They should all have the same value , So these four articles class Attributes should be “essay”, and name or id Is the only attribute used to identify the label .

Use here find(’< div class=“essay” >’) Function to locate the start of the first article , Use find(’< div class=“essay1” >’) Function to locate the end of the first article , To obtain < div class=”essay”> To </ div> Content between . For example, get the title and hyperlink code of the first article as follows :

import re
import urllib.request
url = "http://www.eastmountyxz.com/"
content = urllib.request.urlopen(url).read()
data = content.decode('utf-8')
start = data.find(r'<div class="essay">')
end = data.find(r'<div class="essay1">')
print(data[start:end])

The output is as follows , Get the first blog HTML Source code .

 Insert picture description here

This part of the code is divided into three steps :

  • call urllib Library urlopen() Function to open the blog address , And read the content and assign it to content Variable .
  • call find() Function to find specific content , such as class The attribute is “essay” Of div label , Position the start and end of the acquisition in turn .
  • Next step analysis , Get the hyperlinks and titles in the source code .

After positioning this paragraph , Then get the specific content through regular expression , The code is as follows :

import re
import urllib.request
url = "http://www.eastmountyxz.com/"
content = urllib.request.urlopen(url).read()
data = content.decode('utf-8')
start = data.find(r'<div class="essay">')
end = data.find(r'<div class="essay1">')
page = data[start:end]
res = r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')"
t1 = re.findall(res, page) # Hyperlinks 
print(t1[0])
t2 = re.findall(r'<a .*?>(.*?)</a>', page) # title 
print(t2[0])
t3 = re.findall('<p style=.*?>(.*?)</p>', page, re.M|re.S) # Abstract 
print(t3[0])

Call the regular expression to get the content separately , As a result of crawling (P) There are line breaks , So you need to join re.M and re.S Support line feed search , The results are as follows :

 Insert picture description here

2. Code implementation

The complete code is as follows :

#coding:utf-8
import re
import urllib.request
url = "http://www.eastmountyxz.com/"
content = urllib.request.urlopen(url).read()
data = content.decode('utf-8')
# Crawling through the title 
title = re.findall(r'<title>(.*?)</title>', data)
print(title[0])
# Crawling image address 
urls = re.findall(r'src="(.*?)"', data)
for url in urls:
print(url)
# Crawling content 
start = data.find(r'<div class="essay">')
end = data.find(r'<div class="essay1">')
page = data[start:end]
res = r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')"
t1 = re.findall(res, page) # Hyperlinks 
print(t1[0])
t2 = re.findall(r'<a .*?>(.*?)</a>', page) # title 
print(t2[0])
t3 = re.findall('<p style=.*?>(.*?)</p>', page, re.M|re.S) # Abstract 
print(t3[0])
print('')
start = data.find(r'<div class="essay1">')
end = data.find(r'<div class="essay2">')
page = data[start:end]
res = r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')"
t1 = re.findall(res, page) # Hyperlinks 
print(t1[0])
t2 = re.findall(r'<a .*?>(.*?)</a>', page) # title 
print(t2[0])
t3 = re.findall('<p style=.*?>(.*?)</p>', page, re.M|re.S) # Abstract 
print(t3[0])

The output result is shown in the figure .

 Insert picture description here

Through the code above , Readers will find that crawling websites with regular expressions is tedious , Especially when locating web nodes , I'll talk about it later Python Common third-party extension packages provided by , Using the functions of these packets for directional crawling .

6、 ... and . summary

Regular expressions are composed of “ Rule string ” To filter expressions , Match the desired information from complex content . Its main object is text , It is suitable for matching text strings and other contents , Not suitable for matching text meaning , Such as matching URL、Email This kind of plain text character is very suitable for . Regular expressions can be used in all programming languages , such as C#、Java、Python etc. .

Regular expression crawlers are often used to get something in a string , For example, extract the number of blog reads and comments , Intercept URL Domain name or URL One of the parameters in , Filter out specific characters or check whether the data obtained conforms to a certain logic , verification URL Or date type, etc . Because of its flexibility 、 Logical and functional features , So that it can quickly achieve the purpose of matching from complex strings in a very simple way .

But it's for people who are new to it , Regular expressions are more obscure ; meanwhile , Get through it HTML It's also difficult for certain texts in , Especially when the web page HTML Missing or not obvious end tag in source code . Next, the author will talk about more powerful 、 Intelligent third party crawler expansion pack , Mainly BeautifulSoup and Selenium technology .

I appreciate :

Last , Thank you for your attention “ Na Zhang's home ” official account , thank CSDN So many years of company , Will always insist on sharing , I hope your article can accompany me to grow up , I also hope to keep moving forward on the road of technology . If the article is helpful to you 、 Have an insight , It's the best reward for me , Let's see and cherish !2020 year 8 month 18 The official account established by Japan , Thank you again for your attention , Please help to promote it “ Na Zhang's home ”, ha-ha ~ Newly arrived , Please give me more advice .

 Insert picture description here

 Insert picture description here

(By: Na Zhang's home Eastmount 2020-09-30 Night at Wuda https://blog.csdn.net/Eastmount )


The references are as follows :

版权声明
本文为[Eastmount]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database