It is said that Python is omnipotent. It's really good to see Liyang photography circle with Python this time

Dream eraser 2021-10-29 06:24:12
said python omnipotent. omnipotent really

This blog continues to learn BeautifulSoup, Target site selection “ Liyang photography circle ”, This local forum .

Target site analysis

The paging rules of the target site collected this time are as follows :

http://www.jsly001.com/thread-htm-fid-45-page-{ Page number }.html

The code adopts multithreading threading modular +requests modular +BeautifulSoup Module writing .

Take rules according to the list page → Details page .
 use python Look at Liyang photography circle , The photos inside are very real , A local active Photography Forum collection road

Liyang photography circle picture collection code

This case is a practical case ,bs4 Relevant knowledge points have been paved in the last blog , Gu first shows the complete code , Then based on comments and key functions .

import random
import threading
import logging
from bs4 import BeautifulSoup
import requests
import lxml
logging.basicConfig(level=logging.NOTSET) # Set the log output level 
# Make a statement LiYang class , It is inherited from threading.Thread
class LiYangThread(threading.Thread):
def __init__(self):
threading.Thread.__init__(self) # Instantiate multithreaded objects 
self._headers = self._get_headers() # Random access ua
self._timeout = 5 # Set timeout 
# Each thread goes to get global resources 
def run(self):
# while True: # This is the multithreading on position 
try:
res = requests.get(url="http://www.jsly001.com/thread-htm-fid-45-page-1.html", headers=self._headers,
timeout=self._timeout) # Test to get the first page of data 
except Exception as e:
logging.error(e)
if res is not None:
html_text = res.text
self._format_html(html_text) # call html analytic function 
def _format_html(self, html):
# Use lxml To analyze 
soup = BeautifulSoup(html, 'lxml')
# Get the section theme segmentation area , Mainly to prevent getting the top theme 
part_tr = soup.find(attrs={
'class': 'bbs_tr4'})
if part_tr is not None:
items = part_tr.find_all_next(attrs={
"name": "readlink"}) # Get the details page address 
else:
items = soup.find_all(attrs={
"name": "readlink"})
# Parse out the title and data 
data = [(item.text, f'http://www.jsly001.com/{
item["href"]}') for item in items]
# Enter the inner page of the title 
for name, url in data:
self._get_imgs(name, url)
def _get_imgs(self, name, url):
""" Resolve picture address """
try:
res = requests.get(url=url, headers=self._headers, timeout=self._timeout)
except Exception as e:
logging.error(e)
# Picture extraction logic 
if res is not None:
soup = BeautifulSoup(res.text, 'lxml')
origin_div1 = soup.find(attrs={
'class': 'tpc_content'})
origin_div2 = soup.find(attrs={
'class': 'imgList'})
content = origin_div2 if origin_div2 else origin_div1
if content is not None:
imgs = content.find_all('img')
# print([img.get("src") for img in imgs])
self._save_img(name, imgs) # Save the picture 
def _save_img(self, name, imgs):
""" Save the picture """
for img in imgs:
url = img.get("src")
if url.find('http') < 0:
continue
# Find... In the parent tag id attribute 
id_ = img.find_parent('span').get("id")
try:
res = requests.get(url=url, headers=self._headers, timeout=self._timeout)
except Exception as e:
logging.error(e)
if res is not None:
name = name.replace("/", "_")
with open(f'./imgs/{
name}_{
id_}.jpg', "wb+") as f: # Pay attention to python The runtime directory is created in advance imgs Folder 
f.write(res.content)
def _get_headers(self):
uas = [
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",
]
ua = random.choice(uas)
headers = {

"user-agent": ua
}
return headers
if __name__ == '__main__':
my_thread = LiYangThread()
my_thread.run()

This case adopts ,BeautifulSoup Module adoption lxml Parser Yes HTML Data analysis , This parser will be used in the future , Pay attention to import... Before use lxml modular .

The data extraction part adopts soup.find() And soup.find_all() Two functions do , The code also uses find_parent() function , Used to collect data from parent tags id attribute .

# Find... In the parent tag id attribute 
id_ = img.find_parent('span').get("id")

When the code is running DEBUG Information , control logging Log output level . use python Look at Liyang photography circle , The photos inside are very real , A local active Photography Forum collection road

Code warehouse address :https://codechina.csdn.net/hihell/python120, Give attention or Star Well .

Written in the back

This blog is bs4 Application articles , If necessary, , Please expand your study repeatedly .

Today is the first day of continuous writing 239 / 365 God .
expect Focus on , give the thumbs-up Comment on Collection .

More exciting

《 Reptiles 100 example , Column sales , After buying, you can learn a series of columns 》
 stay 120 In a series of columns , To learn python beautifulsoup4 modular ,7000 Word blog + Climb the ninth workshop net

版权声明
本文为[Dream eraser]所创,转载请带上原文链接,感谢
https://pythonmana.com/2021/10/20211013181453117w.html

  1. L'application de Seed en python
  2. Python functional programming series 008: Testability
  3. [must see for getting started with Python] the difference and connection between cookie and session in Python!
  4. Python Xiaobai from scratch pyqt5 project actual combat (4) basic controls
  5. Python Xiaobai starts the pyqt5 project from scratch (3) connection between signal and slot
  6. Echarts ne peut pas afficher le HTML en PDF en utilisant le pdfkit de Python
  7. 一只Python 小white 的日常提问(づ ●─● )づ
  8. 2021 tutoriel complet d'automatisation des tests d'interface python [matériel d'apprentissage joint]
  9. Décrivez ce que les connaissances pertinentes jouent dans votre travail en utilisant arduino ou Python, y compris les bibliothèques pertinentes, en conjonction avec votre travail quotidien.
  10. Une question quotidienne d'un petit morceau de Python (づ● - ●)
  11. Python中字典问题请求解惑
  12. 一只Python 小white 的日常提問(づ ●─● )づ
  13. 在python中的问题,请问如何解决
  14. Only 10 questions are needed to easily master Matplotlib graphics processing | Python skill tree
  15. 在python中的問題,請問如何解决
  16. Comment résoudre le problème en python
  17. Demande de résolution de problèmes de dictionnaire en python
  18. 使用python,在一个命名为.txt文本文档写入n m乘法表。
  19. En utilisant Python, écrivez une table de multiplication n m dans un document texte nommé.txt.
  20. 使用python,在一個命名為.txt文本文檔寫入n m乘法錶。
  21. Python,前缀后缀相同时合并
  22. 关于#python#的问题:python
  23. 關於#python#的問題:python
  24. Python,前綴後綴相同時合並
  25. Questions sur # # Python #: Python
  26. Python, préfixe et suffixe combinés en même temps
  27. python manage.py shell无法运行,
  28. python manage.py shell無法運行,
  29. Le shell Python manage.py ne fonctionne pas,
  30. python中使用vscode Import 'matplotlib.pyplot' could not be resolved from source 问题
  31. [Chapter 11 of the full version] Python advanced crawler practice - system master Po anti climbing skills challenge high salary
  32. L'utilisation de vscode Import 'matplotlib.pyplot' en python ne peut pas être résolue à partir du problème source
  33. Python fusionne les deux listes et supprime les éléments dupliqués lors de la fusion
  34. [JS Reverse AES Reverse Encryption] python crawler combat, les jours sont de plus en plus décisifs
  35. 30 jeux Python. Je peux jouer à la pêche au travail pendant une journée.
  36. J'a i collecté un nouveau hit de liste en python, donc c'est un secret que quelqu'un d'autre peut devenir un magnat des médias!
  37. J'a i utilisé Python pour ramper à travers 5000 belles photos de papier peint, un jour oublié Premier amour!
  38. [Python planting system] the best green plant for your girlfriend. Girls love it when they see it! Attachment: should be able to feed - right??!
  39. [Python love guide] two small programs for sweetness burst table are released! Afraid you can't find someone?
  40. J'ai utilisé Python pour ramper 1000 lettres d'amour pour aider mon colocataire à exprimer les fleurs de classe, mais les inverser et les inverser... C'est le secret ultime des fleurs de classe!
  41. J'ai utilisé Python pour me connecter à la plus grande plate - forme de jeu au monde, et à quel point le cryptage steam est intelligent [code source inclus]
  42. python中列表转为矩阵后无法进行矩阵的乘法运算
  43. Python crawler Development and Learning full tutoriel 2nd Edition, banggan 100000 words [recommended Collection]
  44. Python crawler haut de gamme: microstore confus anti - décryptage
  45. La multiplication de la matrice ne peut pas être effectuée lorsque la liste est convertie en matrice en python
  46. Introduction to operators in python (Part 1)
  47. Are the dictionaries in Python ordered
  48. Introduction to dictionaries in Python
  49. List introduction in Python
  50. pandas比较两个dataframe特定数据列的数值是否相同并给出差值:使用np.where函数
  51. Python使用matplotlib绘制透明背景的可视化图像并保存透明背景的可视化结果(transparent background)
  52. Python self study notes -- basic grammar
  53. Python utilise matplotlib pour dessiner une image visuelle de l'arrière - plan transparent et enregistrer les résultats visuels de l'arrière - plan transparent
  54. Pandas compare les valeurs de deux colonnes de données spécifiques à dataframe et donne des valeurs de voyage: en utilisant la fonction np.where
  55. Comment configurer une application ASGI Django avec Postgres, nginx et uvicorn sur Ubuntu 20.04
  56. What are the advantages of Python and how to get started quickly
  57. Python self study notes -- basic data types
  58. Python code reading (Chapter 14): List Union
  59. Analyse statistique de la fonction de données des essais aléatoires Python
  60. Alien invasion project in Python application -- Aliens (Part 2)