Python_ Web crawler —— Jingdong Mall commodity list

I've been expanding my knowledge recently , Want to learn other programming languages , After many considerations, I finally chose Python,Python From the beginning of release, it has occupied a place in programming with a huge user cluster ,python Complete the most work with the least language , Rich code base for learning . Current python It involves : big data 、 machine learning 、web Development 、 Artificial intelligence and many other aspects

What is a web crawler

A web crawler is a web crawler from web The process of obtaining required data from resources , Directly from web Resources get the information you need , Instead of using threads provided by the website API Access interface .

Web crawler is also called web data resource acquisition , Is a data acquisition technology , Through this technology, we can directly from the website HTML Obtain the required data, including web Communicate with resources 、 Dissect the file to obtain the required data and organize it into information , And convert to the required data format .

A little bit more simple : Through the website web Interface to get the data we want , And store... In a certain format . Every acquisition is the process of requesting web resources , According to the information returned from the web page , Find the data information we want through the parsing tool , And keep it for subsequent use .

Web crawlers are generally divided into the following steps :

  1. Determine the address of the access site
  2. analysis HTML Find the target tag
  3. Send a request to get resources
  4. Use tools to analyze HTML page
  5. Get the resources you need
  6. Save and get resources

What can web crawlers do

Web crawlers use many aspects , For example, do big data analysis and obtain network data sources , Collect some articles , In order to satisfy the inner curiosity, now some beauty portraits and so on .

Web crawlers have certain copyright disputes , Therefore, developers should have a certain sense of judgment and grasp the bottom line in their hearts , If there is cross-border behavior, only the law can stipulate your bottom line .

The actual case of web crawler —— Get the list of Jingdong Mall

  1. Determine the request URL

    The actual combat case is Jingdong Mall query list

Analyze website address url: mobile phone &page=1

You can analyze the parameters to be passed in from the website address :

keyword: Search for keywords

page: Page number

Number of products per page :30

There is a confusing behavior on the page number of Jingdong Mall : Switching page numbers on the browser is page The number is page*2-1 The number of , In fact, each page shows that another page is loaded when the user browses to the bottom , Therefore, it appears that there are 60 Items , The actual size of each page is 30

  1. Analysis website HTML Find the target tag

    Using Google browser F12 Or right click to check the label , The label you want when you find it

The package that the program needs to introduce

import requests as rq
from bs4 import BeautifulSoup as bfs
import json
import time
  1. Send a request to get resources

    Write the method of generating product list according to the parameters
def get_urls(num):
URL = ""
Param="?keyword= mobile phone &page={}"
return [URL+Param.format(index) for index in range(1,num+1)]

Send a request , Add a request header to the request method

# Visit website 
def get_requests(url):
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36"
return rq.get(url,headers=header)
  1. Use tools to analyze HTML page

    The format required to escape the request information
def get_soup(r):
if r.status_code
print(" Network request failed ...")
return soup
  1. Get the resources you need

    Find the required information in this page , Name in the item list 、 Price 、 Shop name 、 Store links 、 comments

# Access to product information
def get_goods(soup):
if soup !=None:
tab_goods=tab_div.find_all('div',class_="gl-i-wrap") for good in tab_goods:
name=good.find('div',class_="p-name").text price=good.find('div',class_="p-price").text
goods.append({"name":name,"price":price,"comment":comment,"shop":shop,"shop_url":shop_url}) return goods
  1. Save and get resources

    Save the file in the format json Format , This can be replaced by saving other formats , Or save to the business library
def save_to_json(goods,file):
with open(file,"w",encoding="utf-8") as fp:

Program main method , Combine the above functions . Realize the analysis of the whole station

if __name__=="__main__":
for url in get_urls(30):
print(" Current access page number {}, The website is :{}".format(a,url))
page_goods= get_goods(soup)
print(" wait for 10 Seconds to the next page ...")
time.sleep(10) for good in goods:

Run the program , Check whether the saved file conforms to the format

