Python_ Web crawler —— Jingdong Mall commodity list


I've been expanding my knowledge recently , Want to learn other programming languages , After many considerations, I finally chose Python,Python From the beginning of release, it has occupied a place in programming with a huge user cluster ,python Complete the most work with the least language , Rich code base for learning . Current python It involves : big data 、 machine learning 、web Development 、 Artificial intelligence and many other aspects

What is a web crawler

A web crawler is a web crawler from web The process of obtaining required data from resources , Directly from web Resources get the information you need , Instead of using threads provided by the website API Access interface .

Web crawler is also called web data resource acquisition , Is a data acquisition technology , Through this technology, we can directly from the website HTML Obtain the required data, including web Communicate with resources 、 Dissect the file to obtain the required data and organize it into information , And convert to the required data format .

A little bit more simple : Through the website web Interface to get the data we want , And store... In a certain format . Every acquisition is the process of requesting web resources , According to the information returned from the web page , Find the data information we want through the parsing tool , And keep it for subsequent use .

Web crawlers are generally divided into the following steps :

  1. Determine the address of the access site
  2. analysis HTML Find the target tag
  3. Send a request to get resources
  4. Use tools to analyze HTML page
  5. Get the resources you need
  6. Save and get resources

What can web crawlers do

Web crawlers use many aspects , For example, do big data analysis and obtain network data sources , Collect some articles , In order to satisfy the inner curiosity, now some beauty portraits and so on .

Web crawlers have certain copyright disputes , Therefore, developers should have a certain sense of judgment and grasp the bottom line in their hearts , If there is cross-border behavior, only the law can stipulate your bottom line .

The actual case of web crawler —— Get the list of Jingdong Mall

  1. Determine the request URL

    The actual combat case is Jingdong Mall query list

Analyze website address url:https://search.jd.com/Search?keyword= mobile phone &page=1

You can analyze the parameters to be passed in from the website address :

keyword: Search for keywords

page: Page number

Number of products per page :30

There is a confusing behavior on the page number of Jingdong Mall : Switching page numbers on the browser is page The number is page*2-1 The number of , In fact, each page shows that another page is loaded when the user browses to the bottom , Therefore, it appears that there are 60 Items , The actual size of each page is 30

  1. Analysis website HTML Find the target tag

    Using Google browser F12 Or right click to check the label , The label you want when you find it

The package that the program needs to introduce

import requests as rq
from bs4 import BeautifulSoup as bfs
import json
import time
  1. Send a request to get resources

    Write the method of generating product list according to the parameters
def get_urls(num):
URL = "https://search.jd.com/Search"
Param="?keyword= mobile phone &page={}"
return [URL+Param.format(index) for index in range(1,num+1)]

Send a request , Add a request header to the request method

# Visit website 
def get_requests(url):
header={
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36"
}
return rq.get(url,headers=header)
  1. Use tools to analyze HTML page

    The format required to escape the request information
def get_soup(r):
if r.status_code ==rq.codes.ok:
soup=bfs(r.text,"lxml")
else:
print(" Network request failed ...")
soup=None
return soup
  1. Get the resources you need

    Find the required information in this page , Name in the item list 、 Price 、 Shop name 、 Store links 、 comments

# Access to product information
def get_goods(soup):
goods=[]
if soup !=None:
tab_div=soup.find('div',id="J_goodsList")
tab_goods=tab_div.find_all('div',class_="gl-i-wrap") for good in tab_goods:
name=good.find('div',class_="p-name").text price=good.find('div',class_="p-price").text
comment=good.find('div',class_="p-commit").find('strong').select_one("a").text
shop=good.find('div',class_="p-shop").find('span').find('a').text
shop_url=good.find('div',class_="p-shop").find('span').find('a')['href']
goods.append({"name":name,"price":price,"comment":comment,"shop":shop,"shop_url":shop_url}) return goods
  1. Save and get resources

    Save the file in the format json Format , This can be replaced by saving other formats , Or save to the business library
def save_to_json(goods,file):
with open(file,"w",encoding="utf-8") as fp:
json.dump(goods,fp,indent=2,sort_keys=True,ensure_ascii=False)

Program main method , Combine the above functions . Realize the analysis of the whole station

if __name__=="__main__":
goods=[]
a=0
for url in get_urls(30):
a+=1
print(" Current access page number {}, The website is :{}".format(a,url))
response=get_requests(url)
soup=get_soup(response)
page_goods= get_goods(soup)
goods+=page_goods
print(" wait for 10 Seconds to the next page ...")
time.sleep(10) for good in goods:
print(good)
save_to_json(goods,"jd_phone2.json")

Run the program , Check whether the saved file conforms to the format

Python Web crawler —— More articles about Jingdong Mall commodity list

  1. Python Web crawler notes ( 5、 ... and ): download 、 Analysis of Jingdong P20 Sales data

    ( One )  Analyze the web Download the sales data for this link below https://item.jd.com/6733026.html#comment 1.      When turning the page , Google F12 Of Network The tab can be seen below ...

  2. Python It's a reptile - Jingdong products

    Python It's a reptile - Jingdong products #!/usr/bin/env python # coding: utf-8 from selenium import webdriver from selenium.we ...

  3. Python Web crawlers and information extraction notes

    Copy and paste notes directly. I found something wrong Document download address //download.csdn.net/download/hide_on_rush/12266493 Master the basic ability of directional network data crawling and web page parsing Pytho ...

  4. 《 Master python Web crawler 》 note

    < Master python Web crawler > WeiWei Writing Directory structure Chapter one What is a web crawler Chapter two Overview of reptile skills The third chapter Crawler implementation principle and implementation technology Chapter four Urllib Library and URLError exception handling The fifth chapter Regular ...

  5. Python Crawlers and information extraction

    1.Requests Library entry Requests install Open the command prompt as Administrator : pip install requests test : open IDLE: >>> import requests ...

  6. The first 3 Homework -MOOC Learning notes :Python Crawlers and information extraction

    1. Register with China University MOOC 2. Choose Songtian teacher of Beijing University of technology <Python Crawlers and information extraction >MOOC Course 3. Learn to complete the 0 Zhou Zhidi 4 Week's course content , And complete the weekly work 4. Provide pictures or websites to show learning progress ...

  7. Holiday study 【 6、 ... and 】Python Web crawler 2020.2.4

    Passed today Python The web crawler video reviews the web crawler I learned before , Understand the relevant specifications of web crawler . Case study : Jingdong's Robots agreement https://www.jd.com/robots.txt Explain the range of reptiles ...

  8. Learn from scratch Python Web crawler PDF Download the full HD version for free | Baidu SkyDrive

    Baidu SkyDrive : Learn from scratch Python Web crawler PDF Download the full HD version for free Extraction code :wy36 Catalog Preface No 1 Chapter Python Introduction to zero basic grammar 11.1 Python And PyCharm install 11.1.1 Python ...

  9. About Python Web crawler combat notes ③

    Python Web crawler combat notes ③ How to download Han Han's blog article Python Web crawler combat notes ③ How to download Han Han's blog article target: Download all the articles 1. Blog list page rules That is to say , http://blog.sina ...

  10. About Python Web crawler combat notes ①

    python Actual combat notes of web crawler project ① How to download Han Han's blog post python Actual combat notes of web crawler project ① How to download Han Han's blog post 1. Open Han Han's blog list page http://blog.sina.com.cn/s/ar ...

Random recommendation

  1. UVALive 7148 LRIP( The division of trees +STL)(2014 Asia Shanghai Regional Contest)

    Topic link :https://icpcarchive.ecs.baylor.edu/index.php?option=com_onlinejudge&Itemid=8&category=6 ...

  2. Effective java note 6-- abnormal

    Give full play to the exceptional advantages , It can improve the readability of a program . Reliability and maintainability . If not used properly , They can also have negative effects . One . Use exceptions only for abnormal conditions Let's start with a piece of code : //Horrible abuse of ex ...

  3. 【 translate 】UI Design basis (UI Design Basics)-- Auto fit and layout (Adaptivity and Layout)( Four )

    2.3  Auto fit and layout (Adaptivity and Layout) 2.3.1  Develop an automatic adaptation (Build In Adaptivity) Users usually want to be on all their devices , Use what they like in a variety of scenarios a ...

  4. A Star algorithm

    The binary heap algorithm is not used to optimize , After studying for several days, I finally got rid of one demo, If the box size generated by clicking the button in this column is incorrect , You can set the preset to the corresponding size first You can only walk up and down, left and right   using UnityEngine; usi ...

  5. Windows Server 2008 To configure IIS+PHP

    problem : I'm going to launch one recently PHP written MVC project , Find out Windows Server 2008 Installed or PHP5.2, Many grammars do not support . Try some solutions : 1. take PHP Upgrade to 5.6, But we need to do well IIS and P ...

  6. Random class It's about generating random numbers

    public class MyRandom extends Random{ public static void main(String[] args) { // random number , Production random number // java carry ...

  7. haoop note

    : //: What is? hadoop? hadoop It is a complete set of technical solutions to solve the big data problem :hadoop The composition of ? The core framework distributed file system Distributed computing framework Distributed resource allocation framework hadoop Object storage Machine Computing :h ...

  8. 【 Luogu P1616 Crazy medicine picking 】

    Background The title is NOIP2005 The crazy version of the third question of the popularization group . The title is to commemorate LiYuxiang born . Title Description LiYuxiang He is a gifted child , His dream is to become the greatest doctor in the world . So , He wants to worship the most prestigious in the neighborhood ...

  9. front end -CSS style

    One .CSS Introduce CSS(Cascading Style Sheet), Full name: cascading style , Define how to display HTML Internal elements , Browser read HTML When you file , Read CSS The style is based on CSS Rules to render content 1.CSS ...

  10. Flutter - Create sideslip menu ( Don't use navigatior, Just change content)

    I wrote an article before ,Flutter - Create a sideslip menu across all pages . This is used in Navigator.of(context).push To navigate to a new page . This time I'll introduce a navigation system that doesn't use , Just change content ...