After learning the previous basics and crawler Basics , We're going to start learning about web requests .

List of articles

    • urllib Introduction to
    • Send a request
    • Send a request -Request request
    • IP agent
    • Use cookie
    • exception handling
    • urllib Four modules of Library :
    • Case study
    • Code case


First look at it. urllib


urllib Introduction to

urllib yes Python The standard library is used for network request , No installation required , Just quote directly .
Mainly used for crawler development ,API Data acquisition and testing .

urllib Four modules of Library :

  • urllib.request: For opening and reading url
  • urllib.error : Include the exceptions proposed ,urllib.request
  • urllib.parse: For parsing url
  • urllib.robotparser: For parsing robots.txt

Case study

#  author : Internet veteran Xin #  Development time :2021/4/5/0005 8:23import urllib.parse
kw={'wd':" Internet veteran Xin "}result=urllib.parse.urlencode(kw)print(result)# decode res=urllib.parse.unquote(result)print(res)

 Insert picture description here
The Internet will be old in the browser , Change to non Chinese form

I searched the Internet in my browser , And then copy what you're browsing :
 Insert picture description here

https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=%E4%BA%92%E8%81%94%E7%BD%91%E8%80%81%E8%BE%9B&fenlei=256&oq=%25E7%25BE%258E%25E5%259B%25A2&rsv_pq=aa5b8079001eec3e&rsv_t=9ed1VMqcHzdaH7l2O1E8kMBcAS8OfSAGWHaXNgUYsfoVtGNbNVzHRatL1TU&rqlang=cn&rsv_enter=1&rsv_dl=tb&rsv_btype=t&inputT=3542&rsv_sug2=0&rsv_sug4=3542

Take a close look at , Is the bold part what we output in the code wd Result

Send a request

  • urllib.request library
    Simulate a browser to launch a http request , And get the response result of the request
  • urllib.request.urlopen The grammar of :
    urlopen(url,data=None,[timeout]*,cafile=None,capath=None,cadefault=False,context=None

Parameter description :
url: str Type of address , That is to visit URL, for example https://www/baidu.com
data: The default value is None
urlopen The function returns a http.client.HTTPResponse object

Code case

get request

#  author : Internet veteran Xin #  Development time :2021/4/5/0005 8:23import urllib.request
url="http://www.geekyunwei.com/"resp=urllib.request.urlopen(url)html=resp.read().decode('utf-8')  # take bytes Turn into utf-8 type print(html)

Why change it to utf-8 instead of gbk, Here to see the web page to check what is in the web page source code :
 Insert picture description here

Send a request -Request request

We're going to climb for the watercress

#  author : Internet veteran Xin #  Development time :2021/4/5/0005 8:23import urllib.request
url="https://movie.douban.com/"resp=urllib.request.urlopen(url)print(resp)

Douban has an anti crawler strategy , Will report directly to 418 error
 Insert picture description here
For this, we need to disguise the request head :
We found user-Agent:

User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3861.400 QQBrowser/10.7.4313.400

#  author : Internet veteran Xin #  Development time :2021/4/5/0005 8:23import urllib.request
url="https://movie.douban.com/"headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3861.400 QQBrowser/10.7.4313.400'}# Build the request object req=urllib.request.Request(url,headers=headers)# Use urlopen Open request resp=urllib.request.urlopen(req)# Read data from response results html=resp.read().decode('utf-8')print(html)

So let's use that Python Successfully disguised as a browser to get the data

IP agent

opener Use , Build your own opener Send a request

#  author : Internet veteran Xin #  Development time :2021/4/5/0005 8:23import urllib.request
url="https://www.baidu.com/"headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3861.400 QQBrowser/10.7.4313.400'}# Build the request object req=urllib.request.Request(url,headers=headers)opener=urllib.request.build_opener()resp=opener.open(req)print(resp.read().decode())

If you keep sending requests , He might ban you from IP, So we change it every once in a while IP agent .

IP Agent classification :

  • Transparent proxy : The target site knows you're using a proxy and knows your source IP Address , This kind of agency is certainly not in line with our original intention
  • Anonymous proxy : The website knows you're using a proxy , But I don't know your source ip
  • High hiding agent : This is the safest way , Directory sites don't know you're using a proxy

ip Way of agency :
Free of charge : https://www.xicidaili.com/nn/
Rechargeable : Elephant agent , Come on, agent , Sesame agent

#  author : Internet veteran Xin #  Development time :2021/4/5/0005 8:23from urllib.request import build_opener
from urllib.request import ProxyHandler
proxy=ProxyHandler({'https':'222.184.90.241:4278'})opener=build_opener(proxy)url='https://www.baidu.com/'resp=opener.open(url)print(resp.read().decode('utf-8'))

Baidu can actually do anti climbing , Even the gaoni agent can't do 100% bypass .

Use cookie

Why use cookie?
Use cookie Mainly to solve http The statelessness of .

Use steps :

  • Instantiation MozillaCookiejar( preservation cookie)
  • establish handler object (cookie The processor of )
  • establish opener object
  • Open the web page ( Send request to get response )
  • preservation cookie file

Case study : Get Baidu Post cookie Store it

import urllib.request
from http import cookiejar
filename='cookie.txt'def get_cookie():
    cookie=cookiejar.MozillaCookieJar(filename)# establish handler object handler=urllib.request.HTTPCookieProcessor(cookie)opener=urllib.request.build_opener((handler))# Request URL url='https://tieba.baidu.com/f?kw=python3&fr=index'resp=opener.open(url)#  preservation cookiecookie.save()# Reading data def use_cookie():# Instantiation MozillaCookieJarcookie=cookiejar.MozillaCookieJar()# load cookie file cookie.load(filename)print(cookie)if __name__=='__main--':use_cookie()#get_cookie()

exception handling

We crawl a website that we can't access to catch exceptions

#  author : Internet veteran Xin #  Development time :2021/4/6/0006 7:38import urllib.requestimport urllib.error
url='https://www.google.com'try:
    resp=urllib.request.urlopen(url)except urllib.error.URLError as e:
    print(e.reason)

You can see that an exception was caught
 Insert picture description here
We're done with web requests , We'll learn a few common libraries later , Then you can crawl the data .