So called crawling , Is to put URL The network resources specified in the address are read from the network stream , Save to local . 
It's like using a program to simulate IE Browser features , hold URL As HTTP The content of the request is sent to the server , Then read the response resources on the server side .

1、 The process of browsing the web

In fact, the process of grabbing web pages is similar to that used by readers IE The same is true for browsers .

For example, you can enter... In the address bar of your browser    www.baidu.com     This address .

The process of opening a web page is actually a browser as a browser “ client ”, Sent... To the server One request , Put the server-side files “ Catch ” To local , Explain it again 、 show .

HTML It's a markup language , Tag content and parse and differentiate it .

The function of the browser is to get HTML Code parsing , Then turn the original code into the website page we see directly .

2 、URL And URI

2.1 URI

Web Every resource available on the Internet , Such as HTML file 、 Images 、 Video clip 、 Programs, and so on, all have a common resource identifier (Universal Resource Identifier, URI) Positioning .

URI It usually consists of three parts :

① The naming mechanism for accessing resources ;

② Host name of the resource ;

③ Resources themselves The name of , Represented by a path .

Like the one below URI:
http://www.why.com.cn/myhtml/html1223/

We can explain it this way :

① This is a way to get through HTTP Protocol access resources ,

② Located in the host www.why.com.cn On ,

③ Through the path “/myhtml/html1223/” visit .

 

2.2 URL

In a nutshell ,URL It's entered in the browser    http://www.baidu.com     This string .

URL yes URI A subset of . It is Uniform Resource Locator Abbreviation , Translated into “ Unified resource positioning operator ”.

In layman's terms ,URL yes Internet A string describing the information resource on the , Mainly used in all kinds of WWW On client and server programs .

use URL A unified format can be used to describe various information resources , Including documents 、 The address and directory of the server .

URL The general format of is ( Square bracket [] Of is optional ):

protocol :// hostname[:port] / path / [;parameters][?query]#fragment

URL The format of is composed of three parts :

① The first part is the agreement ( Or service mode ).

② The second part is the host where the resource is stored IP Address ( Sometimes a port number is also included ).

③ The third part is the specific address of the host resource , Such as directories and file names .

The first and second parts use “://” Symbols separate ,

The second and third parts use “/” Symbols separate .

The first and second parts are indispensable , The third part can sometimes be omitted .

2.3 Compare

URI Belong to URL A lower level of abstraction , A string text standard .

let me put it another way ,URI Belong to the father class , and URL Belong to URI Subclasses of .URL yes URI A subset of .

URI Is defined as : Uniform resource identifiers ;

URL Is defined as : Uniform resource locator .

The difference between them is ,URI Indicates the path to the request server , Define such a resource .

and URL Also explain how to access this resource (http://).

2.4 URL Example

http agreement

Using hypertext transfer protocol HTTP, Resources for providing hypertext information services .

example :http://www.peopledaily.com.cn/channel/welcome.htm

Its computer domain name is www.peopledaily.com.cn.

Hypertext files ( The file type is .html) It's in the catalog /channel Under the welcome.htm.

This is a computer in the people's daily of China .

example :http://www.rol.cn.net/talk/talk1.htm

Its computer domain name is www.rol.cn.net.

Hypertext files ( The file type is .html) It's in the catalog /talk Under the talk1.htm.

This is the address of the red chat room , You can enter the second section of the red chat room 1 room .

file

use URL Represents a file , Server mode uses file Express , There's a mainframe in the back IP Address 、 Access to files path ( It's a catalog ) And file names .

Sometimes you can omit directories and filenames , but “/” Symbols cannot be omitted .

example :file://ftp.yoyodyne.com/pub/files/foobar.txt

Above this URL Represents stored on the host ftp.yoyodyne.com Upper pub/files/ A file in the directory , The file name is foobar.txt.

example :file://ftp.yoyodyne.com/pub

On behalf of the host ftp.yoyodyne.com Directory on /pub.

example :file://ftp.yoyodyne.com/

On behalf of the host ftp.yoyodyne.com Root directory .

3、urllib2

urllib2 yes Python An acquisition of URLs(Uniform Resource Locators) The components of .

It uses urlopen Function provides a very simple interface .

response = urllib2.urlopen('http://www.baidu.com/')
html = response.read()
print html # Printed is the source code of the web page

urllib2 Use one Request Object to map the HTTP request .

In its simplest form, you will create one with the address you want to request Request object ,

By calling urlopen And pass in Request object , A related request will be returned response object ,

This response object is like a file object , So you can Response Call in .read().

req = urllib2.Request('http://www.baidu.com') # Handle http head 
req2 = urllib2.Request('ftp://example.com/') # Handle ftp head ,urllib2 Use the same interface for all URL head .
response = urllib2.urlopen(req)
the_page = response.read()
print the_page

stay HTTP When asked , Allow you to do two extra things .

(1) send out data The form data

Sometimes you want to send some data to URL( Usually URL And CGI[ Universal gateway interface ] Script , Or others WEB Application hook ).

stay HTTP in , This often uses the familiar POST Request to send .

This is usually when you submit a HTML The form is made by your browser .

Not all POSTs All from forms , You can use POST Submit arbitrary data to your own program .

General HTML Forms ,data It needs to be coded in standard form . And then as data Parameters are passed to Request object .

Coding work using urllib Instead of urllib2.

import urllib
import urllib2 url = 'http://www.someserver.com/register.cgi' values = {'name': 'WHY',
'location': 'SDU',
'language': 'Python'} data = urllib.urlencode(values) # Coding work
req = urllib2.Request(url, data) # Send request at the same time data Forms
response = urllib2.urlopen(req) # Receiving feedback
the_page = response.read() # Read the feedback

If there's no transmission data Parameters ,urllib2 Use GET Method request .

GET and POST The difference between requests is POST Requests usually have " side effect ",

They change the state of the system in some way ( For example, submit piles of garbage to your door ).

Data It can also be done in Get Requested URL It's coded on itself to transmit .

(2) Set up Headers To http request

There are some sites that don't like being programmed ( An impersonal visit ) visit , Or send different versions of content to different browsers .

default urllib2 Take yourself for “Python-urllib/x.y”(x and y yes Python Major and minor version numbers , for example Python-urllib/2.7),

This identity may confuse the site , Or not working at all .

The browser confirms its identity through User-Agent head , When you create a request object , You can give him a dictionary containing header data .

urlopen The response object returned response( perhaps HTTPError example ) There are two very useful ways info() and geturl()

3.1 geturl()

This returns the real URL, This is very useful , because urlopen( perhaps opener Object used ) Maybe there will be redirection . Acquired URL Maybe with a request URL Different .

from urllib2 import Request, urlopen
old_url = 'http://i.baidu.com/?from=image'
req = Request(old_url)
response = urlopen(req)
print 'Old url :' + old_url
print 'Real url :' + response.geturl() Output :
Old url :http://i.baidu.com/?from=image
Real url :http://i.baidu.com/welcome/

3.2 info()

This returns the dictionary object of the object , The dictionary describes the situation of the obtained page . Usually a specific header sent by the server headers. At present, it is httplib.HTTPMessage example .

classical headers contain "Content-length","Content-type", And so on .

from urllib2 import Request, urlopen
old_url = 'http://www.baidu.com'
req = Request(old_url)
response = urlopen(req)
print 'Info():'
print response.info() Output :
Info():
Date: Tue, 14 Nov 2017 07:55:18 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: Close
Vary: Accept-Encoding
Set-Cookie: BAIDUID=6AD1DC5A0A2202A069705254C2CE007E:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BIDUPSID=6AD1DC5A0A2202A069705254C2CE007E; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1510646118; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BDSVRTM=0; path=/
Set-Cookie: BD_HOME=0; path=/
Set-Cookie: H_PS_PSSID=1450_21102_24880_22075; path=/; domain=.baidu.com
P3P: CP=" OTI DSP COR IVA OUR IND COM "
Cache-Control: private
Cxy_all: baidu+164eb539e96e551026e6b365b859f30e
Expires: Tue, 14 Nov 2017 07:54:57 GMT
X-Powered-By: HPHP
Server: BWS/1.1
X-UA-Compatible: IE=Edge,chrome=1
BDPAGETYPE: 1
BDQID: 0xb02af5550000a934
BDUSERID: 0

3.3 openers/handles

Openers:

When you get a URL You use a opener( One urllib2.OpenerDirector Example ).

Under normal circumstances , We use the default opener: adopt urlopen.

But you can create personalized openers.

Handles:

Openers Using the processor handlers, be-all “ heavy ” Work by handlers Handle .

Every handlers Know how to open through a specific protocol URLs, Or how to deal with URL All aspects of opening .

for example HTTP Redirection or HTTP cookies.

If you want to use a specific processor to get URLs You want to create a openers, For example, get one that can handle cookie Of opener, Or get one that doesn't redirect opener.

To create a opener, You can instantiate a OpenerDirector, And then call .add_handler(some_handler_instance).

Again , have access to build_opener, This is a more convenient function , Used to create opener object , He only needs one function call .
build_opener Several processors are added by default , But there are quick ways to add or update default processors .

Other processors handlers You may want to deal with agents , verification , And other common but somewhat special cases .

install_opener Used to create ( overall situation ) Default opener. This means calling urlopen Will use the opener.

Opener The object has a open Method . This method can be like urlopen Function is directly used to get urls: Usually you don't have to call install_opener, Except for convenience .

4、 exception handling

When urlopen Can't handle a response when , produce urlError.
But usually Python APIs Anomalies such as ValueError,TypeError And so on will also produce .
HTTPError yes urlError Subclasses of , Usually in specific HTTP URLs Produced in .

4.1 URLError
Usually ,URLError There is no network connection ( No routing to a specific server ), Or the server does not exist .

In this case , Exceptions can also have "reason" attribute , It's a tuple( It can be understood as immutable arrays ),

Contains an error number and an error message .

import urllib2
req = urllib2.Request('http://www.baibai.com')
try:
urllib2.urlopen(req) except urllib2.URLError, e:
print e.reason Output :
[Errno 11002] getaddrinfo failed # The error number is 11002, The content is getaddrinfo failed

4.2 HTTPError

Every... On the server HTTP Response object response Contains a number " Status code ".

Sometimes the status code indicates that the server cannot complete the request . The default processor will handle part of this response for you .

for example : If response It's a " Redirect ", The client needs to get the document from another address ,urllib2 Will handle for you .

Other things that can't be dealt with ,urlopen Will produce a HTTPError.

Typical mistakes include "404"( Page not found ),"403"( Ask to be forbidden ), and "401"( With authentication request ).

HTTP The status code indicates HTTP The status of the response returned by the protocol .

For example, the client sends a request to the server , If the requested resource is successfully obtained , The returned status code is 200, Indicates a successful response .

If the requested resource does not exist , It usually returns 404 error .

HTTP Status codes are usually divided into 5 Types , Respectively by 1~5 It starts with five numbers , from 3 Bit integers make up :

200
The request is successful
Processing mode : Get the content of the response , To deal with
201
Request completed , The result is the creation of new resources . Newly created resources URI You can get... In the responding entity
Processing mode : You won't meet
202
The request is accepted , But the processing has not been completed
Processing mode : Block waiting
204
The server side has implemented the request , But no new letters were returned Rest . If the customer is a user agent , You don't need to update your own document view for this .
Processing mode : discarded
300
The status code is not HTTP/1.0 The application directly uses , Just as 3XX Default interpretation of type response . There are multiple available requested resources .
Processing mode : If the program can handle , Then further processing , If it can't be dealt with in the program , Then discard
301
The requested resource will be assigned a permanent URL, In this way, we can pass the URL Visit and ask about this resource
Processing mode : Redirect to assigned URL
302
The requested resource is in a different URL Temporary storage
Processing mode : Redirect to temporary URL
304
Requested resource not updated
Processing mode : discarded
400
Illegal request
Processing mode : discarded
401
unauthorized
Processing mode : discarded
403
prohibit
Processing mode : discarded
404
Can't find
Processing mode : discarded
5XX
Respond to code with “5” The initial status code indicates that the server finds itself in error , Unable to proceed with request
Processing mode : discarded
HTTPError After the instance is generated, there will be an integer 'code' attribute , It's the related error number sent by the server .

Error Codes Error code
Because the default processor handles redirection (300 Outside the number ), also 100-299 The number of the range indicates success , So you can only see 400-599 Wrong number for .
BaseHTTPServer.BaseHTTPRequestHandler.response It's a very useful answer number dictionary , Shows HTTP All the response numbers used by the protocol .

When an error number is generated , The server returns a HTTP Error number , And an error page .

You can use HTTPError Instance as the response object returned by the page response.

This means the same as the error attribute , It also contains read,geturl, and info Method .

import urllib2
req = urllib2.Request('http://bbs.csdn.net/callmewhy')
try:
urllib2.urlopen(req)
except urllib2.URLError, e:
print e.code
#print e.read() Output :
404

4.3 Wrapping

So if you want to work for HTTPError or URLError To prepare for , There will be two basic approaches . The second is recommended .

The first one is

from urllib2 import Request, urlopen, URLError, HTTPError
req = Request('http://bbs.csdn.net/callmewhy')
try:
response = urlopen(req) except HTTPError, e:
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code except URLError, e:
print 'We failed to reach a server.'
print 'Reason: ', e.reason else:
print 'No exception was raised.' Output :
The server couldn't fulfill the request.
Error code: 404
Similar to other languages ,try Then the exception is caught and its contents printed out .
One thing to note here ,except HTTPError Must be in the first , otherwise except URLError Will also accept HTTPError .
because HTTPError yes URLError Subclasses of , If URLError In the front, it will capture all the URLError( Include HTTPError ).
The second kind
from urllib2 import Request, urlopen, URLError, HTTPError
req = Request('http://bbs.csdn.net/callmewhy')
try:
response = urlopen(req) except URLError, e:
if hasattr(e, 'code'):
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code elif hasattr(e, 'reason'):
print 'We failed to reach a server.'
print 'Reason: ', e.reason else:
print 'No exception was raised.' Output :
The server couldn't fulfill the request.
Error code: 404
 
 
 
 
 
 

python Reptiles - More articles on the basics

  1. Python Reptile base

    Preface Python Very suitable for developing web crawlers , For the following reasons : 1. Grab the interface of the web page itself Compared to other static programming languages , Such as java,c#,c++,python The interface for capturing web documents is simpler : Compared to other dynamic scripting languages , Such as perl ...

  2. python Reptiles - Basic introduction -python Reptiles break through the blockade

    python Reptiles - Basic introduction -python Reptiles break through the blockade >> Relevant concepts >> request Concept : It's a request from the client to the server , Including the information submitted by the user and some information of the client . The client can go through H ...

  3. python Reptiles - Basic introduction - Crawl the entire website 《3》

    python Reptiles - Basic introduction - Crawl the entire website <3> describe : The first two chapters give a rough account of python2.python3 Crawl the entire website , This chapter simply records python2.python3 The difference between python ...

  4. python Reptiles - Basic introduction - Crawl the entire website 《2》

    python Reptiles - Basic introduction - Crawl the entire website <2> describe : The prologue is in <python Reptiles - Basic introduction - Crawl the entire website <1>> Described in , It's not described here , Only attach  python3 ...

  5. python Reptiles - Basic introduction - Crawl the entire website 《1》

    python Reptiles - Basic introduction - Crawl the entire website <1> describe : Usage environment :python2.7.15 , development tool :pycharm, Now crawl a website page (http://www.baidu.com) All numbers ...

  6. Python The basic knowledge of reptiles is reptiles

    One . Preface Reptiles Spider What? , I've heard of it for a long time , It's a tall thing , Climb the web , Crawling Links ~~~dos The data on the black screen keeps coming up , Just look at it , Beautiful pictures of school flowers , Music website songs , joke . There are all kinds of jokes , All of you come here ...

  7. python Basic knowledge of reptiles

    Web crawler ( Also known as web spider , Network robot , stay FOAF Community Center , More often referred to as a web chaser ), It's a rule of thumb , Program or script that automatically grabs information from the world wide web . Necessary knowledge for web crawler 1. Python Basic knowledge of 2. P ...

  8. Python Reptile base ( One )——HTTP

    Preface The Internet connects computers all over the world ( By cable ), The world wide web connects all kinds of resources on the Internet ( Through hypertext links ), Such as static HTML file , Dynamic software programs ······. Because of the World Wide Web , Every computer on the Internet can be very ...

  9. 【 Learning notes 】 Chapter two python The basics of secure programming ---python Reptile base (urllib)

    One . Reptile base 1. The concept of reptiles Web crawler ( Also known as web spider ), It's a rule of thumb , A program or script that automatically captures information from the world wide web . The best way to use crawlers is to acquire and process information in batches and automatically . For the macro or micro situation, we can have one more side ...

  10. python What are the basics of reptiles , What books and tutorials are available for beginners ?

    One , Reptile base : First of all, we should understand what a reptile is , Instead of learning content with code , Novice Xiaobai should spend an hour to understand what a reptile is , Then learn the knowledge with code , The result is that you have to learn more code than you do ...

Random recommendation

  1. His writing AES and RSA Encryption and decryption tools

    package com.sdyy.common.utils; import java.security.Key; import java.security.KeyFactory; import jav ...

  2. Popular Cows(codevs 2186)

    The question : Yes N(N<=10000) Head ox , Every cow wants to be most poluler Of cattle , give M(M<=50000) A relationship , Such as (1,2) representative 1 welcome 2, Relationships can deliver , But we can't talk to each other , namely 1 welcome 2 Do not represent ...

  3. use canvas Pie chart drawn 、 Column chart 、 Broken line chart

    canvas Pie chart drawn canvas A column chart drawn canvas Draw a statistical chart of broken line

  4. Redmine backlogs install

    We used to use IceScrum The free version of Scrum project , use GitLab Let's do it Issue management , But there are some problems .GitLab Of issue It's not easy to use , Can't meet our needs , meanwhile issue There's no way to put it in S ...

  5. --save and --save-dev The difference between

    --save It's a statement of what the production environment needs to depend on ( The framework used in developing applications , library , such as jquery,bootstrap etc. ) --save-dev It's a statement of what the development environment needs to depend on ( Building tools , Testing tools , such as babel,gulp ...

  6. use VSCode To develop a asp.net core 2.0+angular 5 project (4): Angular5 Global error handling

    The first part : http://www.cnblogs.com/cgzl/p/8478993.html The second part : http://www.cnblogs.com/cgzl/p/8481825.html Third ...

  7. mvc5.0- route

    :first-child{margin-top:0!important}.markdown-body>:last-child{margin-bottom:0!important}.markdow ...

  8. Cookie Safety talk ( turn )

    add by zhj: I also agree with the author ,JavaScript operation Cookie It's an abnormal practice : It can be used JavaScript operation Cookie Finished function , It can also be done on the server side . js fuck ...

  9. GsonFormat Plug ins are mainly used to use Gson The library will JSONObject Format String Resolve into entities , The plug-in can speed up development , Easy to use , Efficient .

    GsonFormat Plug ins are mainly used to use Gson The library will JSONObject Format String Resolve into entities , The plug-in can speed up development , Easy to use , Efficient . Plug-in address :https://plugins.jetbr ...

  10. linux Lower installation nodejs

    1. linux Download wget http://cdn.npm.taobao.org/dist/node/v10.14.1/node-v10.14.1-linux-x64.tar.xz 2. decompression ...