Python crawler: from introduction to abandonment

SunriseCai 2020-11-13 11:31:52
python crawler introduction abandonment


This blog is only for my spare time to record articles , Publish to , Only for users to read , If there is any infringement , Please let me know , I'll delete it .
This article is pure and wild , There is no reference to other people's articles or plagiarism . Insist on originality !!

Preface

Hello . Here is Python Reptiles from getting started to giving up series of articles . I am a SunriseCai.

Use Python Reptiles Here are three steps , One step corresponds to one article .

  • Request web page
  • Get web response , Parsing data ( Webpage )
  • Save the data ( Hang in the air

This article introduces Python Reptiles The second step : Parse web pages .

  • Mainly from HTML Extract data from files , In other words, parsing a web page . This article mainly introduces the following three ways of parsing web pages :
  1. BeautifulSoup
  2. XPath
  3. Regular expressions (re)

Check out the differences among the three web page parsing methods introduced in this article .

Method describe
BeautifulSoup One can come from HTML or XML Extracting data from a file Python library
XPath stay XML The language in which information is found in a document
Regular expressions (re) A special sequence of characters , It can easily check whether a string matches a certain pattern .

Here we mainly introduce their basic use , For more detailed suggestions, click the link above to learn the system .


Install the module

First , Need to be in cmd Window input a command , Install the... Used in this article bs4 and lxml modular .

pip install beautifulsoup4
pip install lxml


1) BeautifulSoup

BeautifulSoup Is a very powerful data processing tool , It has a Four categories , Traverse the document book , Search for documents ,CSS Selectors Wait for the operation , Here is a brief introduction to its four categories and basic use .

1.1 BeautifulSoup4 Four kinds of objects

BeautifulSoup The complex HTML The document is transformed into a complex tree structure , Every node is Python object , All objects can be summed up as 4 Kind of :

species describe
Tag Understood as a HTML The labels in
NavigableString Get the text in the tag
BeautifulSoup Represents the entire content of the document
Comment After filtering the annotation, get the text in the label

1.2 BeautifulSoup Basic usage examples

First , Import BeautifulSoup modular .

from bs4 import BeautifulSoup

Here is an example of the last operation HTML Code :

html_doc = '''
<html>
<body>
<div id='nothing'>
<ul>
<li class="animal">
<a href="www.animal.html" class='one'> puppy </a>
</li>
<li class="fruits">
<a href="www.fruits.html" class='two'> Fruits </a>
</li>
<li class="vegetable">
<a href="www.vegetable.html class='three'"> Chinese cabbage </a>
</li>
</ul>
</div>
</body>
</html>
'''

establish BeautifulSoup object :

soup = BeautifulSoup(html_doc,'lxml')

Basic usage examples :

soup = BeautifulSoup(open("index.html")) # Create a file handle of BeautifulSoup object
soup = BeautifulSoup("<html>data</html>") # Create a string of BeautifulSoup object
print(soup.prettify()) # Format output soup object
print(soup.li) # Get the first one li Content of the label
print(soup.div) # Get the first one div Content of the label
print(soup.find('a')) # Get the first... In the document <a> label
print(soup.find_all('a')) # Get all in the document <a> label
print(soup.a.string) # obtain a The text of the label
print(soup.get_text()) # Get all the text in the document
......
' Too many examples to illustrate one by one , use 1 Ten thousand words to write it is not enough .'

1.3 BeautifulSoup Four kinds of examples

The following example is quoted from BeautifulSoup Official documents . But the quotation is not complete , It is recommended to read official documents for systematic learning .

1.3.1 Tag

Tag:

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>','lxml')
tag = soup.a
type(tag) # <class 'bs4.element.Tag'>
  • Tag There are many methods and properties , Now let's introduce tag The most important attribute in :name and attributes

name:

# adopt .name To get tag Name
print(tag.name) # 'a'
# change tag Of name
tag.name = "blockquote"
print(tag) # <blockquote href="www.animal.html"> puppy </blockquote>

attributes:

# One tag There may be many attributes .tag <b class="boldest"> There is one "class" Properties of , The value is "boldest"
print(tag['class']) # ['one']
# It can also be direct " spot " Take property , such as : .attrs
print(tag.attrs) # {
'href': 'www.animal.html', 'class': ['one']}
--------------------------------------------------------------------
# tag The properties of can be added , Delete or modify .tag The property operation method of is the same as the dictionary
tag['class'] = 'verybold'
tag['id'] = 1
print(tag) # <a class="verybold" href="www.animal.html" id="1"> puppy </a>
del tag['class'] # Delete 'class'
del tag['id'] # Delete 'id'
print(tag) # <a href="www.animal.html"> puppy </a>
print(tag['class']) # KeyError: 'class'
print(tag.get('class')) # None

1.3.2 NavigableString

  • The above operation is to get all the contents of the tag , Now use **NavigableString ** To get the text contained in the label .
# use .string You can get the text inside the label
soup = BeautifulSoup(html_doc,'lxml')
print(soup.a.string) # puppy
print(type(soup.a.string)) # <class 'bs4.element.NavigableString'>

1.3.3 BeautifulSoup

  • BeautifulSoup Object represents the whole content of a document , Most of the time , You can think of it as Tag object .
soup = BeautifulSoup(html_doc,'lxml')
print(soup.name) # [document]

1.3.4 Comment

  • Comment The object is of a special type NavigableString object , Its output is to filter out the annotation symbols
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup,'lxml')
comment = soup.b.string
print(comment) # Hey, buddy. Want to buy a used parser?
print(type(comment)) # <class 'bs4.element.Comment'>

Is this what you want ? I can guarantee that it's not , But please don't worry , Official documents are better to use , Click through BeautifulSoup Official documents .

2) XPath(XML Path Language)

2.1 XPath Introduce

XPath That is to say XML Path to the language (XML Path Language), Is a door in XML The language in which information is found in a document . Here to XPath Make an introduction to the basic use of , It is suggested to move to XPath course , Systematic learning XPath.

Here is the most useful path expression :

expression usage describe
nodename xpath(‘//div’) Select all children of this node .
/ xpath(‘/div’) Select from root node .
// xpath(‘//div’) Select the node in the document from the current node that matches the selection , Regardless of their location .
. xpath(‘./div’) Select the current node .
.. xpath(‘..’) Select the parent of the current node .
@ xpath(’/@calss’) Select Properties .

2.2 XPath Basic usage examples

The first is the import module etree, The module has been installed on it , utilize XPath Conduct HTML Parsing .

from lxml import etree

etree The module can complete the text , use etree.tostring() perhaps etree.tostringlist(), But the point here is not to say it .
The following is to use XPath Analyze the relevant instance code of the web page :

html_doc = '''
<html>
<body>
<div id='nothing'>
<ul>
<li class="animal">
<a href="www.animal.html" class='one'> puppy </a>
</li>
<li class="fruits">
<a href="www.fruits.html" class='two'> Fruits </a>
</li>
<li class="vegetable">
<a href="www.vegetable.html class='three'"> Chinese cabbage </a>
</li>
</ul>
</div>
</body>
</html>
'''
# You need to make a statement first HTML Text , Construct a XPath Parse object
parse_html = etree.HTML(html_doc)

2.2.1 get attribute

  • Example : obtain <li> Labeled class attribute
# Get the first one <li> Labeled class attribute
class_content = parse_html.xpath('//li[1]/a/@href')
print(class_content) # ['www.animal.html']
# Access to all <li> Labeled class attribute
class_content =parse_html.xpath('//li/a/@href')
print(class_content) # ['www.animal.html', 'www.fruits.html', 'www.vegetable.html']

2.2.2 Get text , Specify text

  • Example : obtain <a> Tag text , Specify node text
# Get the first one <a> Text content of label
text = parse_html.xpath('//li[1]/a/text()')
print(text) # [' puppy ']
# Access to all <a> Text content of label
text = parse_html.xpath('//li/a/text()')
print(text) # [' puppy ', ' Fruits ', ' Chinese cabbage ']
# Get specified <a> Text content of label
text = parse_html.xpath('//*[@class="fruits"]/a/text()')
print(text) # [' Fruits ']

2.2.3 Get all nodes

  • Example : Get all li and a node
# Get all <li> node
node = parse_html.xpath('//li')
print(node) # [<Element li at 0x2af2b3b3a88>, <Element li at 0x2af2b3b3a48>, <Element li at 0x2af2b3b3b48>]
# Get all <a> node
node = parse_html.xpath('//a')
print(node) # [<Element a at 0x26978ab5a48>, <Element a at 0x26978ab5a08>, <Element a at 0x26978ab5b08>]

2.2.4 Get child nodes

  • Example : Get all li Child of a node a Text content of
child_node = parse_html.xpath('//li/a/text()')
print(child_node) # [' puppy ', ' Fruits ', ' Chinese cabbage ']

2.2.5 Get parent node

  • Example : Get all a Parent of node li Of class attribute
parent_node = parse_html.xpath('//a/../@class')
print(parent_node) # ['animal', 'fruits', 'vegetable']

Is this what you want ? I guess not , But please don't worry , Official documents are better to use , Click through XPath course ,XPath Common grammar .

3) Regular expressions (re)

First , It is recommended to move to official documents for systematic learning of regular expressions :https://docs.python.org/3/library/re.html
Regular is too much , I'm afraid I'll miss you , Please refer to the official documents to learn the system .

3.1 Regular common functions

  • Common methods
Method grammar describe
re.sub re.sub(pattern, repl, string, count=0, flags=0) Used to replace matches in strings
re.compile re.compile(pattern, flags=0) For compiling regular expressions , Generate a regular expression ( Pattern ) object , for match() and search() These two functions use .
re.match re.match(pattern, string, flags=0) Match regular expressions from the beginning of characters , If not, return None
re.search re.search(pattern, string, flags=0) Match the string and return the first successful match , If not, return None
re.findall re.findall(pattern, string, flags=0) Returns all the matches of the regular expression , If not, return an empty list
re.finditer re.finditer(pattern, string, flags=0) and findall equally , Just return the matching string as an iterator
  • See that there are many common words in the method , Such as pattern,string,flags=0 etc. , Let's introduce them .
Method describe
pattern Matching regular expressions
string String to match
flags Sign a , Used to control how regular expressions are matched
  • there flags What does it stand for ? Please see the table below .
Method describe
re.I Ignore case
re.L Do localization identification (locale-aware) matching
re.M Multi-line matching , influence ^ and $
re.S send . Match all characters including line breaks
re.U according to Unicode Character set parsing characters . This sign affects \w, \W, \b, \B
re.X This flag allows you to write regular expressions more easily by giving you a more flexible format .
  • Regular expression instances ( The following figure is quoted from : Novice tutorial

 Insert picture description here

3.2 Regular expressions (re) Examples of use

Import module first . The module is Python Bring their own , No additional installation is required .

import re

The following is to use XPath Analyze the relevant instance code of the web page :

html_doc = '''
<html>
<body>
<div id='nothing'>
<ul>
<li class="animal">
<a href="www.animal.html" class='one'> puppy </a>
</li>
<li class="fruits">
<a href="www.fruits.html" class='two'> Fruits </a>
</li>
<li class="vegetable">
<a href="www.vegetable.html" class='three'> Chinese cabbage </a>
</li>
</ul>
</div>
</body>
</html>
'''

3.2.1 re.sub Example

  • take 2020SunriseCai Replace the number with 0
result = re.sub('\d', '0', '2020SunriseCai')
print(result) # 0000SunriseCai

3.2.2 re.compile Example

 Insert picture description here

  • Match the first <> Content of the label
pattren = re.compile('<.*?>')
result = pattren.search(html_doc)
print(result) # <re.Match object; span=(1, 7), match='<html>'>
print(result) # <html>

3.2.3 re.match Example

  • Match string 2020SunriseCai.
# The match is successful
result = re.match("\d+", '2020SunriseCai')
print(result) # <re.Match object; span=(0, 3), match='2020'>
print(result) # 2020
# The match didn't work return None
result = re.match("\s+", '2020SunriseCai')
print(result) # None

3.2.4 re.search Example

  • Match the first a label
result = re.search("(<a .*?</a>)", html_doc, re.M)
print(result.group())
# <a href="www.animal.html" class='one'> puppy </a>

3.2.5 re.findall Example

  • Match all a label , Returns a list of
result = re.findall("(<a .*</a>)", html_doc, re.M)
print(result)
# ['<a href="www.animal.html" class=\'one\'> puppy </a>',
'<a href="www.fruits.html" class=\'two\'> Fruits </a>',
'<a href="www.vegetable.html" class=\'three\'> Chinese cabbage </a>']

3.2.6 re.finditer Example

  • Match all a label , Return iterator
result = re.finditer("(<a .*</a>)", html_doc, re.M)
for data in result:
print(data.group())
# <a href="www.animal.html" class='one'> puppy </a>
# <a href="www.fruits.html" class='two'> Fruits </a>
# <a href="www.vegetable.html" class='three'> Chinese cabbage </a>

Is this what you want ? I guess not , But please don't worry , Official documents are better to use , Click through
Regular expressions :https://docs.python.org/3/library/re.html
Excellent blog :Python Regular expression details ——re library


Undeniable? , This article is poorly written , It is suggested that you click the link of official documents to learn the system in the past .
Multiple web page parsers always have one for you , Choose the way you like to study systematically .


Finally, I will summarize the content of this chapter :

  1. This paper introduces the basic use of several web page parsers
  2. It introduces BeautifulSoup Basic use of
  3. It introduces XPath Basic use of
  4. It introduces Regular expressions re Basic use of

sunrisecai

  • Thank you for your patience in watching , Focus , Neverlost .
  • For the convenience of chicken pecking each other , Welcome to join QQ Group organization :648696280

Next article , be known as 《Python Reptiles from getting started to giving up 06 | Python The first shot of reptile to save data 》.

版权声明
本文为[SunriseCai]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database