Python crawler from the beginning to give up 04 | Python crawler starts the first request page

SunriseCai 2020-11-13 11:31:56
python crawler beginning python crawler


This blog is only for my spare time to record articles , Publish to , Only for users to read , If there is any infringement , Please let me know , I'll delete it .
This article is pure and wild , There is no reference to other people's articles or plagiarism . Insist on originality !!

Preface

Hello . Here is Python Reptiles from getting started to giving up series of articles . I am a SunriseCai.

Use Python Reptiles Here are three steps , One step corresponds to one article .

  • Request web page
  • Get web response , Parsing data ( Webpage )( Hang in the air
  • Save the data ( Hang in the air

This article introduces Python Reptiles The first step : Request web page .

requests It's a Python HTTP Client library , Use Python Reptiles It can't be without it , It's also the highlight of this chapter .

  • requests Module is a very simple module to use , But after my description , Completely deviated from the word simple . I suggest omitting this article , It's better not to look at it .

requests Basic use

install requests modular

First , Need to be in cmd Window input the following command , Install... For network requests requests modular .

pip install requests

Here only for requests The basic use of , For more information, please click requests Official documents .


requests There are many ways to request a module , Here are just two of the most commonly used requests , Namely GET and POST request

Method describe
requests.get() Request the specified page information , And return the entity body
requests.post() Submit data to the specified resource for processing request ( For example, submit form )

The first step to start is , It is necessary to put requests Module import .

import requests

1) requests.get()

  • adopt GET Request access to a page , It's very simple , Just one line of code .

Examples of success :

resp = requests.get('https://www.baidu.com')
print(resp.status_code) # 200 If the return value of the status code is 200 The visit is successful

Examples of failures :

resp = requests.get('https://www.douban.com/') # Douban Homepage
print(resp.status_code) # 418 Here the status code is returned as 418, It's obvious that the request didn't succeed , Let's talk about how to deal with

2) requests.post()

  • adopt POST Request page , Only need to get() Method changed to post(), Then pass in the data to be submitted :
  • The request in this , One more. headers Parameters , What is he .

Example :

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36'
}
data = {

'name': 'xxx', # account number
'password': 'xxx' # password
}
# carry headers And data For the request
resp = requests.post('https://accounts.douban.com/j/mobile/login/basic', data=data, headers=headers)
print(resp.status_code) # 200 The request is successful
print(resp.text) # text In order to get the response text information {
"status":"success","message":"success","description":" Handle a successful "...}}

3) customized headers( Request header )

  • resp = requests.get(‘https://www.douban.com/’), Why is the request unsuccessful ?
  • Some sites will judge all requests , If it turns out to be python 3.x Sent by , It's a reptile , Just give it to Pass It fell off
  • So this time should give us requests Request add Request header , I'm going to visit it disguised as a browser .
  • If you want to add a request header for the request , Just simply pass on one dict to headers Parameters are OK .

Let's take a look at adding request headers ( Disguise one's identity ) The result of the subsequent request :

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36'
} # Here is the request header of Google browser
r = requests.get('https://www.douban.com/', headers=headers) # Douban Homepage
print(r.status_code) # 200 carry headers After the request , Successful visit

You can also add cookie,referer Equal parameter , No more introduction here , Later articles will talk about how to use them .

4) Response content

There are three main methods :

r = requests.get('http://www.xxx.com')
Method describe use
r.text Return the response body text information Text content
r.content Return binary response content picture 、 music 、 Video etc. ’
r.json() return json Content , Extract the data in the returned content in the form of key value pairs json Format page

5) Response status code

It is mainly divided into 5 Categories: :

Status code describe
1** instructions – Indicates that the request has been received , To continue processing
2** success – Indicates that the request was received successfully 、 understand 、 Accept
3** Redirect – Incomplete information needs to be added
4** Client error – The request has a syntax error or the request cannot be implemented
5** Server-side error – The server could not fulfill the legitimate request

6) View response headers

View response headers :

resp = requests.get('https://www.baidu.com')
print(resp.headers)
# {
'Accept-Ranges': 'bytes', 'Cache-Control': 'no-cache'...} The data returned is dict

Check the response header for Cache-Control:

resp = requests.get('https://www.baidu.com')
print(resp.headers['Cache-Control']) # no-cache

Check the other parameters of the response header .

7) Errors and exceptions

The following is quoted from requests Official documents .

  • Network problems ( Such as :DNS The query fails 、 Refuse to connect, etc ) when ,Requests Will throw out a ConnectionError abnormal .
  • If HTTP The request returned an unsuccessful status code , Response.raise_for_status() Will throw out a HTTPError abnormal .
  • If the request times out , Then throw a Timeout abnormal .
  • If the request exceeds the set maximum number of redirects , Will throw a TooManyRedirects abnormal .
  • all Requests Explicitly thrown exceptions are inherited from requests.exceptions.RequestException .

requests Request Douban movie example

The web page is shown below :
 Insert picture description here
Request code :

import requests
url = 'https://movie.douban.com/top250'
headers = {

'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36'
}
resp = requests.get(url=url, headers=headers)
print(resp.text)

The return value is as follows :

  • requests The source code requested is basically HTML The document , In the next article, we will show you how to get from HTML Extract data from documents .
<!DOCTYPE html>
<html lang="zh-cmn-Hans" class="">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="renderer" content="webkit">
<meta name="referrer" content="always">
<meta name="google-site-verification" content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw" />
<title>
Watercress movie Top 250
</title>
<body>
<ol class="grid_view">
<li>
<div class="item">
<div class="pic">
<em class="">1</em>
<a href="https://movie.douban.com/subject/1292052/">
<img width="100" alt=" Shawshank redemption " src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.jpg" class="">
</a>
</div>
<div class="info">
<div class="hd">
<a href="https://movie.douban.com/subject/1292052/" class="">
<span class="title"> Shawshank redemption </span>
<span class="title">&nbsp;/&nbsp;The Shawshank Redemption</span>
<span class="other">&nbsp;/&nbsp; The moon is black and high ( harbor ) /  stimulate 1995( platform )</span>
</a>
<span class="playable">[ Playable ]</span>
</div>
<div class="bd">
<p class="">
The director : frank · Delabond Frank Darabont&nbsp;&nbsp;&nbsp; starring : Tim · Robbins Tim Robbins /...<br>
1994&nbsp;/&nbsp; The United States &nbsp;/&nbsp; crime The plot
</p>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.7</span>
<span property="v:best" content="10.0"></span>
<span>1758111 People comment on </span>
</div>
<p class="quote">
<span class="inq"> Hope makes people free .</span>
</p>
</div>
</div>
</div>
</li>
' Omitted below '......
</body>
</html>

First , This article doesn't make sense , Please also write kindly , Suggestions requests Official documents .

  • The knowledge described by the pen , Often unsatisfactory .
     Insert picture description here

Finally, I will summarize the content of this chapter :

  1. It introduces requests Of get and post Request mode
  2. It introduces requests Add request header for
  3. It introduces requests Get different response content
  4. It introduces requests Different status codes of
  5. It introduces requests View request headers and errors and exceptions
  6. It introduces requests Request Douban movie

sunrisecai

  • Thank you for your patience in watching , Focus , Neverlost .
  • For the convenience of chicken pecking each other , Welcome to join QQ Group organization :648696280( There is no learning material in it , Just for questions )

Next article , be known as Python Reptiles from getting started to giving up 05 | Python Crawler launched the first analysis page .

版权声明
本文为[SunriseCai]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database