What's so great about the separation of front and rear ends? Hand in hand teach you to climb down with Python!

Liu Zaoqi 2021-01-21 10:11:32
great separation rear ends hand


Hello everyone , I get up early .

This article is a detailed advanced crawler tutorial , It contains a very detailed process of thinking and trial and error , If you're serious about learning reptiles , I suggest that we take a serious look .

We're going to grab a list of all the books on this website :

https://www.epubit.com/books

1) Exploration and research

Create a new python file , Write the following code :

import requests
url = 'https://www.epubit.com/books'
res = requests.get(url)
print(res.text)

The print result is as follows :

There's no book information in it . But using the browser checker you can see the information about the book :

We came across a Based on the separation of the front and back ends Website , Or a use JavaScript get data Website . The data flow of this kind of website is like this :

  • First request Only the basic frame of the web page is returned , There's no data . That's what you see in the screenshot .
  • But the basic framework of a web page contains JavaScript Code for , This code will initiate one or more requests for data . We call it Follow up requests .

In order to grab such a website , There are two ways :

  1. Analyze the address and parameters of subsequent requests , Write code to initiate the same follow-up request .
  2. Using simulated browser technology , such as selenium. This technology can automatically initiate subsequent requests to obtain data .

2) Analyze subsequent requests

Open the Google browser checker , Follow the instructions in the figure :

  1. Click on Network, Here you can see all the web requests sent by the browser .
  2. choose XHR, View the browser with JavaScript Sent request .
  3. You can see a lot of requests below . We're going to look at them one by one and find the request that contains the list of items .

Let's take a look at the process of a browser opening a web page , Generally, not all the contents are returned by one request , It's a lot of steps :

  1. The first request for HTML file , It may contain words , data , The address of the picture , Style sheets, addresses, etc .HTML There are no pictures directly in the file .
  2. Browser according to HTML Link in , Send request again , Read the picture , Style sheets , be based on JavaScript Data, etc .

So we see that there are so many different types of requests :XHR, JS,CSS,Img,Font, Doc etc. .

The website we crawled sent a lot of XHR request , To request a list of books , Web menu , Advertising information , Footers, etc . We're going to look at these requests for books .

The specific operation steps are shown in the figure :

  1. Select request on the left
  2. Select on the right Response
  3. You can see the data returned by this request below , From the data, we can judge whether it contains book information .

Javascript The format of the request return is usually JSON Format , This is a kind of JavaScript Data format , It contains pairs of data separated by colons , It's easier to understand .JSON It's like Python Dictionary in .

Among the many requests , It can be roughly judged by the name of the request , Increase of efficiency . Like in the picture above getUBookList It looks like getting a list of books . Click to view , What is returned is a list of books .

Please remember the address and format of this link , We need it later :

https://www.epubit.com/pubcloud/content/front/portal/getUbookList?page=1&row=20&=&startPrice=&endPrice=&tagId= Look at the , You can see :

  1. The address is :https://www.epubit.com/pubcloud/content/front/portal/getUbookList
  2. page=1 It means the first one 1 page , We can pass in, in turn 2,3,4 wait .
  3. row=20 It means that each page has 20 This book
  4. startPrice and endPrice Price terms , Their values are all empty , No price limit .

3) Use postman Test the conjecture

To test this idea, open Google browser , Enter the following address in the address bar :

https://www.epubit.com/pubcloud/content/front/portal/getUbookList?page=1&row=20&=&startPrice=&endPrice=&tagId=

But we get the following result :

{
"code": "-7",
"data": null,
"msg": " The system temporarily drifted , Please try again later ~",
"success": false
}

It's not that there's something wrong with the system , Instead, the system detects that we are an abnormal request , Refuse to return data to us .

It means that in addition to sending this URL, Additional information needs to be sent to the server , This information is called Header, Translated into Chinese, it means request header .

As you can see in the figure below, a normal request contains multiple request headers :

  1. Select the request you want to view
  2. Choose... On the right Headers
  3. Flip down , You can see Request Headers, Here's the data :
    • Accept: application/json, text/plain, /
    • Accept-Encoding:gzip, deflate, br
    • ....

In order for the server to process requests normally , We're going to simulate normal requests , Also add the corresponding header. If it's given Header It's all the same , It's impossible for the server to recognize us as crawlers . Later we will learn how to add... When sending a request header.

But usually the server doesn't check all Header, Maybe just add one or two keys Header You can cheat the server into giving us data . But we're going to test those one by one Header Is a must .

Cannot add... In browser Header, In order to send the tape Header Of HTTP request , We're going to use another software called Postman. This is a API One of the most common tools used by developers and crawler Engineers .

First, in the postman Download on :www.postman.com. Follow the instructions to install the software step by step , There are no additional settings in the middle .

open postman Then you can see the following interface :

  1. Click the plus sign on the top , You can add a new request
  2. Fill in the middle of the request URL
  3. spot Headers Get into Headers Setting interface of , add to Header.

these Header The name and value of can be copied in the inspector . If you spell it yourself , Be careful not to make mistakes .

Let's take a look at some common header:

  • User-Agent: This Header Who is the requester , It's usually the name of a browser with detailed version information , such as :Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 If the crawler doesn't add this Header, The server will recognize that this is an abnormal request , Can be rejected . Of course , Rejection depends on the programmer's code logic .
  • Cookie: If a website needs to log in , The login information is stored in Cookie in . Server through this Header Determine if you've landed , Who landed . Suppose we want to place an order in Jingdong Mall automatically , We can log in manually first , Copy Cookie Value , use Python Send the request and include this Cookie, So the server thinks we've logged in , Allow us to place orders or do other operations . If the function of timing is added to the program , Specify the specific time point of order , This is it. Second kill program . This is a common way to crawl websites that need to be logged in .
  • Accept: What format of data does the browser accept , such as **application/json, text/plain, */*** It means accepting JSON, Text data , Or any data .
  • Origin-Domain: The requester is from that domain name , In this case is :www.epubit.com

About more HTTP Of Header, You can search on the Internet HTTP Headers Study .

I'll add the common ones one by one Header, But the server never returns data , Until you add Origin-Domain This Header. It means this Header It's a must .

The background program of web page may not check Header, It's also possible to check one Header, It's also possible to check multiple Header, It takes us to try to find out .

since Origin-Domain Is the key , Maybe the backend only checks this one Header, Let's get rid of the rest through the selection box on the left Header, Only keep Origin-Domain, The request is still successful , This means that only this one is checked in the background Header:

Then change the... In the address bar page Parameters , Get other pages , For example, the screenshot is changed to 3, Resend request , Found that the server returned new data ( Other 20 This book ). So our request process is successful .

4) Write grab program

Developing crawlers , The main time is analysis , Once the analysis is clear , Crawling code is not complicated :

import requests
def get_page(page=1):
''' Grab the data of the specified page , The default is No 1 page '''
# Use page Dynamic splicing URL
url = f'https://www.epubit.com/pubcloud/content/front/portal/getUbookList?page={page}&row=20&=&startPrice=&endPrice=&tagId='
headers = {'Origin-Domain': 'www.epubit.com'}
# At the same time, the request is passed in headers
res = requests.get(url, headers=headers)
print(res.text)
get_page(5)

Here we've tested the grab number 5 Pages of data , Compare the printed JSON Data and page number 5 Page data , The result is a match .

Now let's analyze JSON Data structure of , And then we can improve the program .

5) analysis JSON data

JSON It's like Python Dictionary in , Use braces to store data , Use colons to separate keys and values . The following is omitted JSON data :

{
"code": "0",
"data": {
"current": 1, // first page
"pages": 144, // How many pages are there
"records": [ // A lot of the book's information is in square brackets
{
"authors": "[ beautiful ] Stephen · Prada (Stephen Prata)", // author
"code": "UB7209840d845c9", // Code
"collectCount": 416, // Liking number
"commentCount": 64, // comments
"discountPrice": 0, // discount price
"downebookFlag": "N",
"fileType": "",
...
},
{
"authors": " Stupid uncle ",
"code": "UB7263761464b35",
"collectCount": 21,
"commentCount": 3,
"discountPrice": 0,
"downebookFlag": "N",
"fileType": "",
...
},
...
],
"size": 20,
"total": 2871
},
"msg": " success ",
"success": true
}

Let's learn about this JSON Format :

  1. On the outside is a brace , It contains code, data, msg, success Four pieces of information . This format is designed by the programmer who developed this web page , Different pages may be different .
  2. among code, msg and sucess Indicates the status code of the request , Prompt to request return , Is the request successful . And the real data is data in .
  3. data The colon of is followed by a brace , Represents a data object . It contains the current number of pages (current), Total number of pages (pages), Book information (records) etc. .
  4. records A lot of books , So it's represented by a square bracket , There are many data objects wrapped in braces inside square brackets , Each brace represents a Book .
{
"authors": "[ beautiful ] Stephen · Prada (Stephen Prata)", // Title
"code": "UB7209840d845c9", // Code
"collectCount": 416, // Liking number
"commentCount": 64, // comments
"discountPrice": 0, // discount 0, There is no discount
...
"forSaleCount": 3, // Quantity on sale
...
"logo": "https://cdn.ptpress.cn/pubcloud/bookImg/A20190961/20200701F892C57D.jpg",
"name": "C++ Primer Plus The first 6 edition Chinese version ", // Title
...
"price": 100.30, // Price
...
}

There are many fields in the information of each book , Many fields are omitted here , Annotate important information .

6) Completion procedure

Now let's perfect the above procedure , from JSON We need to parse out the data , In order to simplify the , We just grab : Title , author , Number and price .

Application framework :

import requests
import json
import time
class Book:
# -- Omit --
def get_page(page=1):
# -- Omit --
books = parse_book(res.text)
return books
def parse_book(json_text):
#-- Omit --
all_books = []
for i in range(1, 10):
print(f'====== Grab the first {i} page ======')
books = get_page(i)
for b in books:
print(b)
all_books.extend(books)
print(' Grab a page , rest 5 Second ...')
time.sleep(5)
  1. Defined Book Class to represent a book
  2. Added parse_book Function is responsible for parsing data , Returns... Containing the current page 20 Of this book list
  3. The bottom uses for Loop to grab data , And put it in a big list ,range Add the number of pages to grab in . From the previous analysis, we can know how many pages there are .
  4. After grabbing a page , must do sleep Seconds , One is to prevent too much pressure on the website , The second is to prevent the website from blocking you IP, It's for his good , It's also for your own good .
  5. The code to save the captured information to a file , Please complete by yourself .

So let's see , The omitted part :

Book class :

class Book:
def __init__(self, name, code, author, price):
self.name = name
self.code = code
self.author = author
self.price = price
def __str__(self):
return f' Title :{self.name}, author :{self.author}, Price :{self.price}, Number :{self.code}'

Here is __str__ Function is a magic function , When we use print Print a Book When the object ,Python Will call this function automatically .

parse_book function :

import json
def parse_book(json_text):
''' According to the returned JSON character string , The list of analytical books '''
books = []
# hold JSON String into a dictionary dict class
book_json = json.loads(json_text)
records = book_json['data']['records']
for r in records:
author = r['authors']
name = r['name']
code = r['code']
price = r['price']
book = Book(name, code, author, price)
books.append(book)
return books
  1. At the top import 了 json modular , This is a Python Self contained , No need to install
  2. The key code is to use json I've got you JSON String into a dictionary , The rest is the operation of the dictionary , It's easy to understand .

Grabbing is based on JavaScript The web page of , The complexity lies mainly in the analysis process , Once the analysis is done , Grab more code than HTML It's even simpler and more refreshing .

-END-

This article is from WeChat official account. - Get up early Python(zaoqi-python)

The source and reprint of the original text are detailed in the text , If there is any infringement , Please contact the yunjia_community@tencent.com Delete .

Original publication time : 2021-01-15

Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .

版权声明
本文为[Liu Zaoqi]所创,转载请带上原文链接,感谢
https://pythonmana.com/2021/01/20210121095345085D.html

  1. Python批量 png转ico
  2. 使用line_profiler对python代码性能进行评估优化
  3. 使用line_profiler对python代码性能进行评估优化
  4. Getting started with Python 3 flash in win environment
  5. Common ways to write configuration files in Python
  6. Python会在2021年死去吗? Python 3.9最终版本的回顾
  7. Python batch PNG to ICO
  8. Using line_ Profiler evaluates and optimizes the performance of Python code
  9. Using line_ Profiler evaluates and optimizes the performance of Python code
  10. Will Python die in 2021? A review of the final version of Python 3.9
  11. Python3 SMTP send mail
  12. Understanding closures in Python: getting started with closures
  13. Python日志实践
  14. Python logging practice
  15. [python opencv 计算机视觉零基础到实战] 十、图片效果毛玻璃
  16. [python opencv 计算机视觉零基础到实战] 九、模糊
  17. 10. Picture effect ground glass
  18. [Python opencv computer vision zero basis to actual combat] 9. Fuzzy
  19. 使用line_profiler對python程式碼效能進行評估優化
  20. Using line_ Profiler to evaluate and optimize the performance of Python code
  21. LeetCode | 0508. 出现次数最多的子树元素和【Python】
  22. Leetcode | 0508
  23. LeetCode | 0530. 二叉搜索树的最小绝对差【Python】
  24. LeetCode | 0515. 在每个树行中找最大值【Python】
  25. Leetcode | 0530. Minimum absolute difference of binary search tree [Python]
  26. Leetcode | 0515. Find the maximum value in each tree row [Python]
  27. 我来记笔记啦-搭建python虚拟环境
  28. Let me take notes - building a python virtual environment
  29. LeetCode | 0513. 找树左下角的值【Python】
  30. Leetcode | 0513. Find the value in the lower left corner of the tree [Python]
  31. Python OpenCV 泛洪填充,取经之旅第 21 天
  32. Python opencv flood fill, day 21
  33. Python爬虫自学系列(二)
  34. Python crawler self study series (2)
  35. 【python】身份证号码有效性检验
  36. [Python] validity test of ID number
  37. Python ORM - pymysql&sqlalchemy
  38. Python ORM - pymysql&sqlalchemy
  39. centos7 安装python3.8
  40. centos7 安装python3.8
  41. Centos7 installing Python 3.8
  42. Centos7 installing Python 3.8
  43. Django——图书管理系统(六)
  44. Django——图书管理系统(五)
  45. Django -- library management system (6)
  46. Django -- library management system (5)
  47. python批量插入数据小脚本
  48. Python batch insert data script
  49. ZoomEye-python 使用指南
  50. Zoomeye Python User's Guide
  51. 用Python写代码,一分钟搞定一天工作量,同事直呼:好家伙 - 知乎
  52. Using Python to write code, one minute to complete a day's workload, colleagues call: good guy - Zhihu
  53. Python 上的可视化库——PyG2Plot
  54. Pyg2plot: a visualization library on Python
  55. Python 上的可视化库——PyG2Plot
  56. Python实用代码-无限级分类树状结构生成算法
  57. Pyg2plot: a visualization library on Python
  58. Python utility code - infinite classification tree structure generation algorithm
  59. 奇技淫巧,还是正统功夫?Python推导式最全用法
  60. Pandas 的这个知识点,估计 80% 的人都得挂!