In the 120 series columns, you can learn the python beautiful oup4 module, 7000 word blog + climb the ninth workshop network

Dream eraser 2021-10-25 21:52:55
series columns learn python beautiful

Today comes again 《 Reptiles 120 example 》 A new day in the series , Next 3 This article will focus on BeautifulSoup4 To study .

BeautifulSoup4 Basic knowledge supplement

BeautifulSoup4 Is a python Parsing library , It is mainly used to analyze HTML and XML, Analyze... In reptile knowledge system HTML There will be more , The library installation commands are as follows :

pip install beautifulsoup4

BeautifulSoup When parsing data , You need to rely on a third-party parser , Common parsers and advantages are as follows :

  • python Standard library html.parser:python Built in standard library , Strong fault tolerance ;
  • lxml Parser : Fast , Strong fault tolerance ;
  • html5lib: The fault tolerance is the strongest , The parsing method is consistent with that of the browser .

Next, use a custom HTML Code to demonstrate beautifulsoup4 Basic use of Library , The test code is as follows :

<html>
<head>
<title> test bs4 Module script </title>
</head>
<body>
<h1> Eraser reptile class </h1>
<p> Use a custom HTML Code to demonstrate </p>
</body>
</html>

Use BeautifulSoup Simply operate it , Contains instantiation BS object , Output page labels, etc .

from bs4 import BeautifulSoup
text_str = """<html> <head> <title> test bs4 Module script </title> </head> <body> <h1> Eraser reptile class </h1> <p> use 1 Segment custom HTML Code to demonstrate </p> <p> use 2 Segment custom HTML Code to demonstrate </p> </body> </html> """
# Instantiation Beautiful Soup object 
soup = BeautifulSoup(text_str, "html.parser")
# The above is to format the string as Beautiful Soup object , You can format from a file 
# soup = BeautifulSoup(open('test.html'))
print(soup)
# Enter page title title label 
print(soup.title)
# Enter web page head label 
print(soup.head)
# Test input paragraph labels p
print(soup.p) # Get the first... By default 

We can go through BeautifulSoup object , Call the web page tag directly , There is a problem here , adopt BS Object call tag can only get the tag in the first position , As in the above code , Only one... Was obtained p label , If you want more , Please read on .

So that's it , You need to understand BeautifulSoup Medium 4 Built in objects .

  • BeautifulSoup: The basic object , Whole HTML object , Generally regarded as Tag Just look at the object ;
  • Tag: Label object , Tags are the nodes in a web page , for example title,head,p;
  • NavigableString: Tag internal string ;
  • Comment: annotation objects , There are not many use scenarios in the crawler .

The following code shows you the scene of these objects , Pay attention to the relevant comments in the code .

from bs4 import BeautifulSoup
text_str = """<html> <head> <title> test bs4 Module script </title> </head> <body> <h1> Eraser reptile class </h1> <p> use 1 Segment custom HTML Code to demonstrate </p> <p> use 2 Segment custom HTML Code to demonstrate </p> </body> </html> """
# Instantiation Beautiful Soup object 
soup = BeautifulSoup(text_str, "html.parser")
# The above is to format the string as Beautiful Soup object , You can format from a file 
# soup = BeautifulSoup(open('test.html'))
print(soup)
print(type(soup)) # <class 'bs4.BeautifulSoup'>
# Enter page title title label 
print(soup.title)
print(type(soup.title)) # <class 'bs4.element.Tag'>
print(type(soup.title.string)) # <class 'bs4.element.NavigableString'>
# Enter web page head label 
print(soup.head)

about Tag object , There are two important properties , yes name and attrs

from bs4 import BeautifulSoup
text_str = """<html> <head> <title> test bs4 Module script </title> </head> <body> <h1> Eraser reptile class </h1> <p> use 1 Segment custom HTML Code to demonstrate </p> <p> use 2 Segment custom HTML Code to demonstrate </p> <a href="http://www.csdn.net">CSDN Website </a> </body> </html> """
# Instantiation Beautiful Soup object 
soup = BeautifulSoup(text_str, "html.parser")
print(soup.name) # [document]
print(soup.title.name) # Get the tag name title
print(soup.html.body.a) # You can get the lower level tags through the tag level 
print(soup.body.a) # html As a special root tag , It can be omitted 
print(soup.p.a) # Can't get a label 
print(soup.a.attrs) # get attribute 

The above code demonstrates how to get name Properties and attrs Property usage , among attrs Property gets a dictionary , You can get the corresponding value through the key .

Get the attribute value of the tag , stay BeautifulSoup in , You can also use the following methods :

print(soup.a["href"])
print(soup.a.get("href"))

obtain NavigableString object
After getting the page tag , You are about to get the text in the label , Proceed with the following code .

print(soup.a.string)

besides , You can still use it text Properties and get_text() Method to get the label content .

print(soup.a.string)
print(soup.a.text)
print(soup.a.get_text())

You can also get all the text in the label , Use strings and stripped_strings that will do .

print(list(soup.body.strings)) # Get spaces or line breaks 
print(list(soup.body.stripped_strings)) # Remove spaces or line breaks 

Extended tags / The node selector traverses the document tree

Direct child nodes

label (Tag) The direct child element of the object , have access to contents and children Property acquisition .

from bs4 import BeautifulSoup
text_str = """<html> <head> <title> test bs4 Module script </title> </head> <body> <div id="content"> <h1> Eraser reptile class <span> best </span></h1> <p> use 1 Segment custom HTML Code to demonstrate </p> <p> use 2 Segment custom HTML Code to demonstrate </p> <a href="http://www.csdn.net">CSDN Website </a> </div> <ul class="nav"> <li> home page </li> <li> Blog </li> <li> Column courses </li> </ul> </body> </html> """
# Instantiation Beautiful Soup object 
soup = BeautifulSoup(text_str, "html.parser")
# contents Property to get the direct child node of the node , Return the contents as a list 
print(soup.div.contents) # Returns a list of 
# children Property is also the direct child node of the node , Returns... As the type of generator 
print(soup.div.children) # return <list_iterator object at 0x00000111EE9B6340>

Note that both of the above properties get direct Child node , for example h1 The descendant tag within the tag span , You won't get... Alone .

If you want to get all the tags to , Use descendants attribute , It returns a generator , All tags, including the text in the tag, will be obtained separately .

print(list(soup.div.descendants))

Acquisition of other nodes ( Understanding can , Check and use )

  • parent and parents: Direct parent node and all parent nodes ;
  • next_sibling,next_siblings,previous_sibling,previous_siblings: Each represents the next sibling node 、 All sibling nodes below 、 Last sibling node 、 All brother nodes above , Because the newline character is also a node , When using these properties , Pay attention to line breaks ;
  • next_element,next_elements,previous_element,previous_elements: These attributes represent the previous node or the next node , Note that they are not hierarchical , But for all nodes , For example, in the above code div The next node of the node is h1, and div The sibling node of the node is ul.

Document tree search related functions

The first function to learn is find_all() function , The prototype is shown below :

find_all(name,attrs,recursive,text,limit=None,**kwargs)
  • name: The parameter is tag Name of label , for example find_all('p') Is to find all p label , Acceptable tag name string 、 Regular expressions and lists ;
  • attrs: The properties passed in , This parameter can be passed in as a dictionary , for example attrs={'class': 'nav'}, The result is tag A list of types ;

The usage examples of the above two parameters are as follows :

print(soup.find_all('li')) # Get all li
print(soup.find_all(attrs={
'class': 'nav'})) # Pass in attrs attribute 
print(soup.find_all(re.compile("p"))) # Transitive regularization , The measured effect is not ideal 
print(soup.find_all(['a','p'])) # Delivery list 
  • recursive: call find_all () When the method is used ,BeautifulSoup Will retrieve the current tag All descendants of , If you just want to search tag Direct child of , You can use parameters recursive=False, The test code is as follows :
print(soup.body.div.find_all(['a','p'],recursive=False)) # Delivery list 
  • text: You can retrieve the text string content in the document , And name The optional values of the parameters are the same ,text Parameter accepts the tag name string 、 Regular expressions 、 list ;
print(soup.find_all(text=' home page ')) # [' home page ']
print(soup.find_all(text=re.compile("^ The first "))) # [' home page ']
print(soup.find_all(text=[" home page ",re.compile(' course ')])) # [' Eraser reptile class ', ' home page ', ' Column courses ']
  • limit: Can be used to limit the number of returned results ;
  • kwargs: If a parameter with a specified name is not a search built-in parameter name , This parameter will be treated as tag To search for . Here, press class Property search , because class yes python Reserved words , Need to write class_, Press class_ When looking for , As long as a CSS If the class name is satisfied , If more than one CSS name , The filling order shall be consistent with the label .
print(soup.find_all(class_ = 'nav'))
print(soup.find_all(class_ = 'nav li'))

You also need to pay attention to , Some properties cannot be used as kwargs Parameters use , such as html5 Medium data-* attribute , Need to pass through attrs Parameters to match .

And find_all() The list of other methods basically consistent with the user is as follows :

  • find(): The function prototype find( name , attrs , recursive , text , **kwargs ), Returns a matching element ;
  • find_parents(),find_parent(): The function prototype find_parent(self, name=None, attrs={}, **kwargs), Returns the parent node of the current node ;
  • find_next_siblings(),find_next_sibling(): The function prototype find_next_sibling(self, name=None, attrs={}, text=None, **kwargs), Returns the next sibling node of the current node ;
  • find_previous_siblings(),find_previous_sibling(): ditto , Returns the previous sibling node of the current node ;
  • find_all_next(),find_next(),find_all_previous () ,find_previous (): The function prototype find_all_next(self, name=None, attrs={}, text=None, limit=None, **kwargs), Retrieves the descendants of the current node .

CSS Selectors
The knowledge points in this section are similar to pyquery A bit of a crash , Core use select() Method can be realized , The returned data is a list tuple .

  • Find... By tag name ,soup.select("title");
  • Find... By class name ,soup.select(".nav");
  • adopt id Name search ,soup.select("#content");
  • Find by combining ,soup.select("div#content");
  • Search by property ,soup.select("div[id='content'"),soup.select("a[href]");

When searching through attributes , There are other techniques you can use , for example :

  • ^=: Can be obtained to XX The starting node :
print(soup.select('ul[class^="na"]'))
  • *=: Gets the node whose property contains the specified character :
print(soup.select('ul[class*="li"]'))

Reptiles in workshop 9

BeautifulSoup After mastering the basic knowledge of , In the preparation of reptile cases , It's very simple , The goal of this collection is http://www.9thws.com/#p2, The target website has a large number of art QR codes , It can be used as a reference for the design brother .

 stay 120 In a series of columns , To learn python beautifulsoup4 modular ,7000 Word blog + Climb the ninth workshop net
The following applies to BeautifulSoup Tag retrieval and attribute retrieval of module , The complete code is as follows :

from bs4 import BeautifulSoup
import requests
import logging
logging.basicConfig(level=logging.NOTSET)
def get_html(url, headers) -> None:
try:
res = requests.get(url=url, headers=headers, timeout=3)
except Exception as e:
logging.debug(" Collection exception ", e)
if res is not None:
html_str = res.text
soup = BeautifulSoup(html_str, "html.parser")
imgs = soup.find_all(attrs={
'class': 'lazy'})
print(" The amount of data obtained is ", len(imgs))
datas = []
for item in imgs:
name = item.get('alt')
src = item["src"]
logging.info(f"{
name},{
src}")
# Get splice data 
datas.append((name, src))
save(datas, headers)
def save(datas, headers) -> None:
if datas is not None:
for item in datas:
try:
# Grab pictures 
res = requests.get(url=item[1], headers=headers, timeout=5)
except Exception as e:
logging.debug(e)
if res is not None:
img_data = res.content
with open("./imgs/{}.jpg".format(item[0]), "wb+") as f:
f.write(img_data)
else:
return None
if __name__ == '__main__':
headers = {

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36"
}
url_format = "http://www.9thws.com/#p{}"
urls = [url_format.format(i) for i in range(1, 2)]
get_html(urls[0], headers)

The output of this code test adopts logging Module implementation , The effect is shown below .
The test only collected 1 Page data , To expand the acquisition range , It just needs to be modified main Page number rules in the function .
In the process of coding , The data request type was found to be POST, The data return format is JSON, Therefore, this case is only used as BeautifulSoup Let's start with
 stay 120 In a series of columns , To learn python beautifulsoup4 modular ,7000 Word blog + Climb the ninth workshop net

Code warehouse address :https://codechina.csdn.net/hihell/python120, Give attention or Star Well .

Written in the back

bs4 The way of module learning , The official start of the , Come on .

Today is the first day of continuous writing 238 / 365 God .
expect Focus on , give the thumbs-up Comment on Collection .

More exciting

《 Reptiles 100 example , Column sales , After buying, you can learn a series of columns 》
 stay 120 In a series of columns , To learn python beautifulsoup4 modular ,7000 Word blog + Climb the ninth workshop net

版权声明
本文为[Dream eraser]所创,转载请带上原文链接,感谢
https://pythonmana.com/2021/10/20211012201245349i.html

  1. 练手练到阅文集团作家中心了,python crawlspider 二维抓取学习
  2. python阶乘和数题,要求用for循环,if语句
  3. Python 求最大幂次,用while循环
  4. L'entraîneur s'est entraîné au centre d'écriture du Groupe Yuewen, Python crawlspider 2D grapping Learning
  5. 这六个Python程序的代码要怎么写
  6. Python calcule la puissance maximale et utilise la Boucle while
  7. python入门:请问怎么避免用户输入相同内容呢
  8. python用迭代法求平方根要求用while循环
  9. Comment écrire le Code de ces six programmes Python
  10. python用迭代法求平方根要求用while循環
  11. Python a besoin d'une Boucle while pour trouver la racine carrée par itération
  12. Démarrer avec Python: Comment puis - je empêcher les utilisateurs d'entrer le même contenu?
  13. 这么多的内置函数能记住吗?对python的68个内置函数分类总结!
  14. 這麼多的內置函數能記住嗎?對python的68個內置函數分類總結!
  15. Est - ce que tant de fonctions intégrées peuvent être mémorisées? Résumé de la classification des 68 fonctions intégrées de Python!
  16. 这么多的内置函数能记住吗?对python的68个内置函数分类总结!
  17. Est - ce que tant de fonctions intégrées peuvent être mémorisées? Résumé de la classification des 68 fonctions intégrées de Python!
  18. python 假设lst=[3,4,12,[6,9,12,24],[12,18,34]]统计list中包含元素12的个数
  19. python 假設lst=[3,4,12,[6,9,12,24],[12,18,34]]統計list中包含元素12的個數
  20. Python suppose que LST = [3,4,12, [6,9,12,24], [12,18,34]] compte le nombre d'éléments 12 dans la Liste
  21. 你需要知道的 20 个 Python 技巧
  22. 如何在 Python 中搜索和替换文件中的文本?
  23. 只需 15 行代码即可进行人脸检测!(使用Python 和 OpenCV)
  24. Python中选择结构问题求解
  25. La détection faciale ne nécessite que 15 lignes de code! (en utilisant Python et OpenCV)
  26. Comment rechercher et remplacer du texte dans un fichier en python?
  27. 20 conseils Python que vous devez connaître
  28. python计算机视觉项目供Java后端调用
  29. python計算機視覺項目供Java後端調用
  30. Python Computer Vision Project for Java Backend Calls
  31. Résoudre le problème de la structure de sélection en python
  32. 使用Python,OpenCV的Meanshift 和 Camshift 算法來查找和跟踪視頻中的對象
  33. Trouver et suivre des objets dans la vidéo en utilisant Python, les algorithmes meanshift et camshift d'OpenCV
  34. Visualisation python - solutions de dessin 3D pyecharts, matplotlib, openpyxl
  35. Automatically generate API documents from Python source code comments
  36. 下载pandas出错了,怎么解决啊
  37. Une erreur s'est produite lors du téléchargement de pandas.
  38. Python Type Hints 从入门到实践
  39. Python Type Hints 從入門到實踐
  40. Type Python hints from starting to Practice
  41. django channels channel_layer.group_send 造成内存溢出
  42. Python布置了个感觉不大理解的题..
  43. Python a posé une question qui ne semblait pas très compréhensible.
  44. Python中yield返回生成器的详细方法
  45. Python函数中apply、map、applymap的区别
  46. Python字符串前加f、r、b、u的不同用法
  47. 5分钟教会你用Python采集CSDN的热榜
  48. 5分鐘教會你用Python采集CSDN的熱榜
  49. 5 minutes pour vous apprendre à utiliser Python pour collecter des listes chaudes de csdn
  50. Quick start of automation -- python (1) - [variables] - half an hour a day
  51. Python爬虫:给我一个链接,快手视频随便下载
  52. Python爬蟲:給我一個鏈接,快手視頻隨便下載
  53. 经验丰富程序员才知道的15种高级Python小技巧
  54. 經驗豐富程序員才知道的15種高級Python小技巧
  55. 15 conseils Python avancés que les programmeurs expérimentés connaissent
  56. Python crawler: Donnez - moi un lien pour télécharger des vidéos rapides
  57. Python爬虫:给我一个链接,快手视频随便下载
  58. [algorithm learning] sword finger offer 64. Find 1 + 2 +... + n (Java / C / C + + / Python / go / trust)
  59. 怎么系统的学习python,有没有一些比较完整的资料,基础知识+框架+项目实战此类pdf
  60. Python crawler: Donnez - moi un lien pour télécharger des vidéos rapides