Python web crawler - Jingdong Mall product list

be hard by 2021-09-15 13:15:56
python web crawler jingdong mall


Python_ Web crawler —— Jingdong Mall commodity list


I've been expanding my knowledge recently , Want to learn other programming languages , After many considerations, I finally chose Python,Python From the beginning of release, it has occupied a place in programming with a huge user cluster ,python Complete the most work with the least language , Rich code base for learning . Current python It involves : big data 、 machine learning 、web Development 、 Artificial intelligence and many other aspects

What is a web crawler

A web crawler is a web crawler from web The process of obtaining required data from resources , Directly from web Resources get the information you need , Instead of using threads provided by the website API Access interface .
Web crawler is also called web data resource acquisition , Is a data acquisition technology , Through this technology, we can directly from the website HTML Obtain the required data, including web Communicate with resources 、 Dissect the file to obtain the required data and organize it into information , And convert to the required data format .
A little bit more simple : Through the website web Interface to get the data we want , And store... In a certain format . Every acquisition is the process of requesting web resources , According to the information returned from the web page , Find the data information we want through the parsing tool , And keep it for subsequent use .

Web crawlers are generally divided into the following steps :

  1. Determine the address of the access site
  2. analysis HTML Find the target tag
  3. Send a request to get resources
  4. Use tools to analyze HTML page
  5. Get the resources you need
  6. Save and get resources

What can web crawlers do

Web crawlers use many aspects , For example, do big data analysis and obtain network data sources , Collect some articles , In order to satisfy the inner curiosity, now some beauty portraits and so on .
Web crawlers have certain copyright disputes , Therefore, developers should have a certain sense of judgment and grasp the bottom line in their hearts , If there is cross-border behavior, only the law can stipulate your bottom line .

The actual case of web crawler —— Get the list of Jingdong Mall

  1. Determine the request URL
    The actual combat case is Jingdong Mall query list
     Jingdong Mall search list

Analyze website address url:https://search.jd.com/Search?keyword= mobile phone &page=1
You can analyze the parameters to be passed in from the website address :
keyword: Search for keywords
page: Page number
Number of products per page :30

There is a confusing behavior on the page number of Jingdong Mall : Switching page numbers on the browser is page The number is page*2-1 The number of , In fact, each page shows that another page is loaded when the user browses to the bottom , Therefore, it appears that there are 60 Items , The actual size of each page is 30

  1. Analysis website HTML Find the target tag
    Using Google browser F12 Or right click to check the label , The label you want when you find it
     Find the label

The package that the program needs to introduce

import requests as rq
from bs4 import BeautifulSoup as bfs
import json
import time
  1. Send a request to get resources
    Write the method of generating product list according to the parameters
def get_urls(num):
URL = "https://search.jd.com/Search"
Param="?keyword= mobile phone &page={}"
return [URL+Param.format(index) for index in range(1,num+1)]

Send a request , Add a request header to the request method

# Visit website
def get_requests(url):
header={
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36"
}
return rq.get(url,headers=header)
  1. Use tools to analyze HTML page
    The format required to escape the request information
def get_soup(r):
if r.status_code ==rq.codes.ok:
soup=bfs(r.text,"lxml")
else:
print(" Network request failed ...")
soup=None
return soup
  1. Get the resources you need
    Find the required information in this page , Name in the item list 、 Price 、 Shop name 、 Store links 、 comments

# Access to product information
def get_goods(soup):
goods=[]
if soup !=None:
tab_div=soup.find('div',id="J_goodsList")
tab_goods=tab_div.find_all('div',class_="gl-i-wrap")
for good in tab_goods:
name=good.find('div',class_="p-name").text
price=good.find('div',class_="p-price").text
comment=good.find('div',class_="p-commit").find('strong').select_one("a").text
shop=good.find('div',class_="p-shop").find('span').find('a').text
shop_url=good.find('div',class_="p-shop").find('span').find('a')['href']
goods.append({"name":name,"price":price,"comment":comment,"shop":shop,"shop_url":shop_url})
return goods
  1. Save and get resources
    Save the file in the format json Format , This can be replaced by saving other formats , Or save to the business library
def save_to_json(goods,file):
with open(file,"w",encoding="utf-8") as fp:
json.dump(goods,fp,indent=2,sort_keys=True,ensure_ascii=False)

Program main method , Combine the above functions . Realize the analysis of the whole station

if __name__=="__main__":
goods=[]
a=0
for url in get_urls(30):
a+=1
print(" Current access page number {}, The website is :{}".format(a,url))
response=get_requests(url)
soup=get_soup(response)
page_goods= get_goods(soup)
goods+=page_goods
print(" wait for 10 Seconds to the next page ...")
time.sleep(10)
for good in goods:
print(good)
save_to_json(goods,"jd_phone2.json")

Run the program , Check whether the saved file conforms to the format
 The data generated

版权声明
本文为[be hard by]所创,转载请带上原文链接,感谢
https://pythonmana.com/2021/09/20210909163417742M.html

  1. Python - Programmation orientée objet - pratique (6)
  2. Python - Programmation orientée objet - réflexion hasattr, GetAttr, GetAttr, delattr
  3. Python - Programmation orientée objet - _Dict
  4. Python - pydantic (2) Modèle imbriqué
  5. Non-ASCII character ‘\xe5‘ in file kf1.py on line 4, but no encoding declared; see http://python.or
  6. python笔记(一)
  7. Non - ASCII character 'xe5' in file kf1.py on Line 4, but no Encoding declared;Voirhttp://python.or
  8. Notes Python (1)
  9. Talk about how JMeter executes Python scripts concurrently
  10. In Beijing, you can't see the moon in the Mid Autumn Festival. Draw a blood red moon in Python
  11. Un des pandas crée un objet
  12. Machine learning | unitary regression model Python practical case
  13. Draw a "Lollipop chart" with Excel and python
  14. Python uses scikit learn to calculate TF-IDF
  15. Getting started with Python Basics_ 3 conditional statements and iterative loops
  16. Python dynamic properties and features
  17. 云计算开发:Python内置函数-min()函数详解
  18. [Python skill] how to speed up loop operation and numpy array operation
  19. 雲計算開發:Python內置函數-min()函數詳解
  20. Développement de l'informatique en nuage: explication détaillée de la fonction intégrée python - min ()
  21. 从0起步学Python(附程序实例讲解)第1讲
  22. 从0起步学Python(附程序实例讲解)第1讲
  23. Apprendre Python à partir de 0 (avec des exemples de programme) leçon 1
  24. Apprendre Python à partir de 0 (avec des exemples de programme) leçon 1
  25. With Python, I'll take you to enjoy it for a month when the Mid Autumn Festival is coming
  26. You can't write interface software in Python! Which software on sale has no UI?
  27. Python国内外原题解析及源码1~15
  28. Python实现长篇英文自动纠错~
  29. Python implémente la correction automatique des erreurs en anglais long
  30. Analyse des problèmes originaux et code source de Python au pays et à l'étranger 1 ~ 15
  31. 一张思维导图学Python之表白
  32. Python教学中课程思政建设的研究探索2
  33. Recherche sur la construction idéologique et politique du Programme d'études dans l'enseignement Python 2
  34. Une présentation de la cartographie mentale Python
  35. Python高级用法总结(8)-函数式编程
  36. Python + Mirai development QQ robot starting tutorial (2021.9.9 test is valid)
  37. Python Advanced use Summary (8) - functional Programming
  38. How to get started with Python and share learning methods for free. All you want to know is here
  39. Python + Mirai development QQ robot starting tutorial (2021.9.9 test is valid)
  40. Python趣味编程中(PPT适合青少儿和零基础学习Python)
  41. Python基础第1讲(含代码、Python最新安装包、父与子的编程之旅:与小卡特一起学Python中文版)
  42. 用 Python 增强 Git
  43. Python基礎第1講(含代碼、Python最新安裝包、父與子的編程之旅:與小卡特一起學Python中文版)
  44. Base Python leçon 1 (y compris le Code, le dernier paquet d'installation Python, le voyage de programmation parent - enfant: apprendre la version chinoise de python avec le petit Carter)
  45. Dans la programmation amusante Python (ppt pour les jeunes enfants et l'apprentissage de base zéro Python)
  46. 非常好的题目详解Python字典的用法
  47. Python teaches you to build wechat push live Betta reminder from 0 (single room simplified version)
  48. Python 协程与 JavaScript 协程的对比
  49. 手把手带你用Python实现一个量化炒股策略
  50. Main dans la main pour mettre en œuvre une stratégie quantitative de spéculation boursière en python
  51. Comparaison des coproductions Python et JavaScript
  52. 【python种子项目ppc】一行代码生成项目与开发详细指导
  53. Docker 部署一个用 Python 编写的 Web 应用
  54. Python - poetry(4)管理环境
  55. Python - poetry(2)命令介绍
  56. [Python Seed Project PPC] a line of Code Generation Project and Development detailed guidance
  57. Introduction à la commande python - Poetry (2)
  58. Python - Poetry (4) Management Environment
  59. I collected Banhua's spatial data set in Python. In addition to meizhao, I found her other secrets again!
  60. I modified ban Hua's boot password in Python and found her secret after logging in again!