There are also many reuse processes in the development process of crawlers , Here's a summary , We can save things in the future .
Multi person learning python, I don't know where to start .
Many people study python, After mastering the basic grammar , I don't know where to look for cases to start .
A lot of people who have done cases , But I don't know how to learn more advanced knowledge .
So for these three kinds of people , I will provide you with a good learning platform , Get a free video tutorial , electronic text , And the source code of the course ！??¤
QQ Group ：1057034340
1、 Basic Web Capture
2、 Using agents IP
In the process of developing crawlers, we often encounter IP The situation of being sealed off , At this point, you need to use an agent IP;
stay urllib2 There is... In the bag ProxyHandler class , Through this class, you can set up a proxy to visit a web page , The following code snippet ：
cookies It's some websites to identify users 、 Conduct session Tracking data stored on the user's local terminal ( Usually encrypted ),python Provides cookielib Module for processing cookies,cookielib The main function of the module is to provide storage cookie The object of , To facilitate urllib2 Module to access Internet resources .
code snippet ：
The key lies in CookieJar(), It is used for management HTTP cookie value 、 Storage HTTP Request generated cookie、 Outgoing to HTTP Request add cookie The object of . Whole cookie All stored in memory , Yes CookieJar After garbage collection of instance cookie Will also be lost , All processes do not need to be operated separately .
Manually add cookie：
4、 Pretend to be a browser
Some websites dislike the visit of reptiles , So I refused all requests to the reptiles . So use urllib2 Direct access to websites often appears HTTP Error 403: Forbidden The situation of .
Yes, some header Pay special attention ,Server The end will focus on these header Checking ：
1.User-Agent There are some Server or Proxy The value will be checked , It is used to determine whether it is initiated by the browser Request.
2.Content-Type In the use of REST Interface ,Server The value will be checked , Used to determine HTTP Body How to analyze the content in .
At this time, you can modify http In bag header To achieve , The code snippet is as follows ：
5、 Processing of verification code
For some simple verification codes , Simple identification . We've only done some simple captcha recognition , But there are some anti human captcha , such as 12306, Manual coding can be carried out through the coding platform , Of course, it's a fee .
Have you met some webpages , No matter how to transcode, it's a mess . ha-ha , That means you don't know much web The service has the ability to send compressed data , This can reduce the amount of data transmitted on the network line 60% above . This is especially true for XML web service , because XML data The compression ratio of can be very high .
But the general server will not send compressed data for you , Unless you tell the server that you can handle compressed data .
So you need to change the code like this ：
This is the key ： establish Request object , Add one Accept-encoding The header tells the server that you can accept it gzip compressed data .
And then decompress the data ：
7、 Multithreading concurrent fetching
If single thread is too slow , You need to multithread , Here is a simple thread pool template This program just prints 1-10, But it can be seen that it is concurrent .
Although I say Python Multithreading is a chicken bone , But for crawlers, the network frequency , It can improve efficiency to a certain extent .