Python crawler advanced - before and after the end of the separation of what great, super detailed process!

AirPython 2021-01-20 14:44:43
python crawler advanced end separation


We're going to grab a list of all the books on this website :

https://www.epubit.com/books

1) Exploration and research

Create a new python file , Write the following code :

import requests
url = 'https://www.epubit.com/books'
res = requests.get(url)
print(res.text)

The print result is as follows :

There's no book information in it . But using the browser checker you can see the information about the book :

We came across a Based on the separation of the front and back ends Website , Or a use JavaScript get data Website . The data flow of this kind of website is like this :

  • First request Only the basic frame of the web page is returned , There's no data . That's what you see in the screenshot .
  • But the basic framework of a web page contains JavaScript Code for , This code will initiate one or more requests for data . We call it Follow up requests .

In order to grab such a website , There are two ways :

  1. Analyze the address and parameters of subsequent requests , Write code to initiate the same follow-up request .
  2. Using simulated browser technology , such as selenium. This technology can automatically initiate subsequent requests to obtain data .

2) Analyze subsequent requests

Open the Google browser checker , Follow the instructions in the figure :

  1. Click on Network, Here you can see all the web requests sent by the browser .
  2. choose XHR, View the browser with JavaScript Sent request .
  3. You can see a lot of requests below . We're going to look at them one by one and find the request that contains the list of items .

Let's take a look at the process of a browser opening a web page , Generally, not all the contents are returned by one request , It's a lot of steps :

  1. The first request for HTML file , It may contain words , data , The address of the picture , Style sheets, addresses, etc .HTML There are no pictures directly in the file .
  2. Browser according to HTML Link in , Send request again , Read the picture , Style sheets , be based on JavaScript Data, etc .

So we see that there are so many different types of requests :XHR, JS,CSS,Img,Font, Doc etc. .

The website we crawled sent a lot of XHR request , To request a list of books , Web menu , Advertising information , Footers, etc . We're going to look at these requests for books .

The specific operation steps are shown in the figure :

  1. Select request on the left
  2. Select on the right Response
  3. You can see the data returned by this request below , From the data, we can judge whether it contains book information .

Javascript The format of the request return is usually JSON Format , This is a kind of JavaScript Data format , It contains pairs of data separated by colons , It's easier to understand .JSON It's like Python Dictionary in .

Among the many requests , It can be roughly judged by the name of the request , Increase of efficiency . Like in the picture above getUBookList It looks like getting a list of books . Click to view , What is returned is a list of books .

Please remember the address and format of this link , We need it later :

https://www.epubit.com/pubcloud/content/front/portal/getUbookList?page=1&row=20&=&startPrice=&endPrice=&tagId= Look at the , You can see :

  1. The address is :https://www.epubit.com/pubcloud/content/front/portal/getUbookList
  2. page=1 It means the first one 1 page , We can pass in, in turn 2,3,4 wait .
  3. row=20 It means that each page has 20 This book
  4. startPrice and endPrice Price terms , Their values are all empty , No price limit .

3) Use postman Test the conjecture

To test this idea, open Google browser , Enter the following address in the address bar :

https://www.epubit.com/pubcloud/content/front/portal/getUbookList?page=1&row=20&=&startPrice=&endPrice=&tagId=

But we get the following result :

{
"code": "-7",
"data": null,
"msg": " The system temporarily drifted , Please try again later ~",
"success": false
}

It's not that there's something wrong with the system , Instead, the system detects that we are an abnormal request , Refuse to return data to us .

It means that in addition to sending this URL, Additional information needs to be sent to the server , This information is called Header, Translated into Chinese, it means request header .

As you can see in the figure below, a normal request contains multiple request headers :

  1. Select the request you want to view
  2. Choose... On the right Headers
  3. Flip down , You can see Request Headers, Here's the data :
    • Accept: application/json, text/plain, /
    • Accept-Encoding:gzip, deflate, br
    • ....

In order for the server to process requests normally , We're going to simulate normal requests , Also add the corresponding header. If it's given Header It's all the same , It's impossible for the server to recognize us as crawlers . Later we will learn how to add... When sending a request header.

But usually the server doesn't check all Header, Maybe just add one or two keys Header You can cheat the server into giving us data . But we're going to test those one by one Header Is a must .

Cannot add... In browser Header, In order to send the tape Header Of HTTP request , We're going to use another software called Postman. This is a API One of the most common tools used by developers and crawler Engineers .

First, in the postman Download on :www.postman.com. Follow the instructions to install the software step by step , There are no additional settings in the middle .

open postman Then you can see the following interface :

  1. Click the plus sign on the top , You can add a new request
  2. Fill in the middle of the request URL
  3. spot Headers Get into Headers Setting interface of , add to Header.

these Header The name and value of can be copied in the inspector . If you spell it yourself , Be careful not to make mistakes .

Let's take a look at some common header:

  • User-Agent: This Header Who is the requester , It's usually the name of a browser with detailed version information , such as :Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 If the crawler doesn't add this Header, The server will recognize that this is an abnormal request , Can be rejected . Of course , Rejection depends on the programmer's code logic .
  • Cookie: If a website needs to log in , The login information is stored in Cookie in . Server through this Header Determine if you've landed , Who landed . Suppose we want to place an order in Jingdong Mall automatically , We can log in manually first , Copy Cookie Value , use Python Send the request and include this Cookie, So the server thinks we've logged in , Allow us to place orders or do other operations . If the function of timing is added to the program , Specify the specific time point of order , This is it. Second kill program . This is a common way to crawl websites that need to be logged in .
  • Accept: What format of data does the browser accept , such as **application/json, text/plain, */*** It means accepting JSON, Text data , Or any data .
  • Origin-Domain: The requester is from that domain name , In this case is :www.epubit.com

About more HTTP Of Header, You can search on the Internet HTTP Headers Study .

I'll add the common ones one by one Header, But the server never returns data , Until you add Origin-Domain This Header. It means this Header It's a must .

The background program of web page may not check Header, It's also possible to check one Header, It's also possible to check multiple Header, It takes us to try to find out .

since Origin-Domain Is the key , Maybe the backend only checks this one Header, Let's get rid of the rest through the selection box on the left Header, Only keep Origin-Domain, The request is still successful , This means that only this one is checked in the background Header:

Then change the... In the address bar page Parameters , Get other pages , For example, the screenshot is changed to 3, Resend request , Found that the server returned new data ( Other 20 This book ). So our request process is successful .

4) Write grab program

Developing crawlers , The main time is analysis , Once the analysis is clear , Crawling code is not complicated :

import requests
def get_page(page=1):
''' Grab the data of the specified page , The default is No 1 page '''
# Use page Dynamic splicing URL
url = f'https://www.epubit.com/pubcloud/content/front/portal/getUbookList?page={page}&row=20&=&startPrice=&endPrice=&tagId='
headers = {'Origin-Domain': 'www.epubit.com'}
# At the same time, the request is passed in headers
res = requests.get(url, headers=headers)
print(res.text)
get_page(5)

Here we've tested the grab number 5 Pages of data , Compare the printed JSON Data and page number 5 Page data , The result is a match .

Now let's analyze JSON Data structure of , And then we can improve the program .

5) analysis JSON data

JSON It's like Python Dictionary in , Use braces to store data , Use colons to separate keys and values . The following is omitted JSON data :

{
"code": "0",
"data": {
"current": 1, // first page
"pages": 144, // How many pages are there
"records": [ // A lot of the book's information is in square brackets
{
"authors": "[ beautiful ] Stephen · Prada (Stephen Prata)", // author
"code": "UB7209840d845c9", // Code
"collectCount": 416, // Liking number
"commentCount": 64, // comments
"discountPrice": 0, // discount price
"downebookFlag": "N",
"fileType": "",
...
},
{
"authors": " Stupid uncle ",
"code": "UB7263761464b35",
"collectCount": 21,
"commentCount": 3,
"discountPrice": 0,
"downebookFlag": "N",
"fileType": "",
...
},
...
],
"size": 20,
"total": 2871
},
"msg": " success ",
"success": true
}

Let's learn about this JSON Format :

  1. On the outside is a brace , It contains code, data, msg, success Four pieces of information . This format is designed by the programmer who developed this web page , Different pages may be different .
  2. among code, msg and sucess Indicates the status code of the request , Prompt to request return , Is the request successful . And the real data is data in .
  3. data The colon of is followed by a brace , Represents a data object . It contains the current number of pages (current), Total number of pages (pages), Book information (records) etc. .
  4. records A lot of books , So it's represented by a square bracket , There are many data objects wrapped in braces inside square brackets , Each brace represents a Book .
{
"authors": "[ beautiful ] Stephen · Prada (Stephen Prata)", // Title
"code": "UB7209840d845c9", // Code
"collectCount": 416, // Liking number
"commentCount": 64, // comments
"discountPrice": 0, // discount 0, There is no discount
...
"forSaleCount": 3, // Quantity on sale
...
"logo": "https://cdn.ptpress.cn/pubcloud/bookImg/A20190961/20200701F892C57D.jpg",
"name": "C++ Primer Plus The first 6 edition Chinese version ", // Title
...
"price": 100.30, // Price
...
}

There are many fields in the information of each book , Many fields are omitted here , Annotate important information .

6) Completion procedure

Now let's perfect the above procedure , from JSON We need to parse out the data , In order to simplify the , We just grab : Title , author , Number and price .

Application framework :

import requests
import json
import time
class Book:
# -- Omit --
def get_page(page=1):
# -- Omit --
books = parse_book(res.text)
return books
def parse_book(json_text):
#-- Omit --
all_books = []
for i in range(1, 10):
print(f'====== Grab the first {i} page ======')
books = get_page(i)
for b in books:
print(b)
all_books.extend(books)
print(' Grab a page , rest 5 Second ...')
time.sleep(5)
  1. Defined Book Class to represent a book
  2. Added parse_book Function is responsible for parsing data , Returns... Containing the current page 20 Of this book list
  3. The bottom uses for Loop to grab data , And put it in a big list ,range Add the number of pages to grab in . From the previous analysis, we can know how many pages there are .
  4. After grabbing a page , must do sleep Seconds , One is to prevent too much pressure on the website , The second is to prevent the website from blocking you IP, It's for his good , It's also for your own good .
  5. The code to save the captured information to a file , Please complete by yourself .

So let's see , The omitted part :

Book class :

class Book:
def __init__(self, name, code, author, price):
self.name = name
self.code = code
self.author = author
self.price = price
def __str__(self):
return f' Title :{self.name}, author :{self.author}, Price :{self.price}, Number :{self.code}'

Here is __str__ Function is a magic function , When we use print Print a Book When the object ,Python Will call this function automatically .

parse_book function :

import json
def parse_book(json_text):
''' According to the returned JSON character string , The list of analytical books '''
books = []
# hold JSON String into a dictionary dict class
book_json = json.loads(json_text)
records = book_json['data']['records']
for r in records:
author = r['authors']
name = r['name']
code = r['code']
price = r['price']
book = Book(name, code, author, price)
books.append(book)
return books
  1. At the top import 了 json modular , This is a Python Self contained , No need to install
  2. The key code is to use json I've got you JSON String into a dictionary , The rest is the operation of the dictionary , It's easy to understand .

Grabbing is based on JavaScript The web page of , The complexity lies mainly in the analysis process , Once the analysis is done , Grab more code than HTML It's even simpler and more refreshing !

This article is from WeChat official account. - AirPython(AirPython)

The source and reprint of the original text are detailed in the text , If there is any infringement , Please contact the yunjia_community@tencent.com Delete .

Original publication time : 2021-01-14

Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .

版权声明
本文为[AirPython]所创,转载请带上原文链接,感谢
https://pythonmana.com/2021/01/20210120142824111G.html

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database