Python anti-crawlers and anti-anti-crawlers

Small ao dog 2022-09-09 00:25:04 阅读数:146




Everyone can understand it as one crawling on the Internet蜘蛛,互联网就比作一张大网,而爬虫便是在这张网上爬来爬去的蜘蛛,If it encounters its own猎物(所需要的资源),那么它就会将其抓取下来.比如它在抓取一个网页,在这个网中他发现了一条道路,In fact, it points to the website链接,Then it can crawl to another online to get it数据.


反爬虫,using any technical means,阻止别人批量获取自己网站信息的一种方式.

If not anti-reptile,Someone will continue to initiate requests to obtain data,Dynamic servers will have a lot of abnormal errors or normal unexpected traffic,Traffic is wasted(程序员/组织)获取数据上,而不是分发数据(给用户)上.

This will have a huge negative impact on website officials,So anti-reptile.

3.Anti-reptile means

3.1 基于请求头

Anti-crawlers are first and foremost based on请求头的,The request headers of the crawler are usually different from the request headers of the browser used by the user,通过请求头,A large portion of program requests can be filtered out.

3.2 基于用户行为

Anti-reptile means can also be based on用户行为,对于一些异常行为,比如1Submit dozens of requests in seconds,The background can determine that the user who made the request may not be a human being,By blocking such usersIPTo achieve the effect of anti-reptile.


4.1 设置请求头

通过设置请求头,Our crawler can pretend to be a browser,Thereby avoiding the suspicion of the anti-reptile system.

我们可以使用 fake_useragent 库,It contains one UserAgentClasses can help us generate different request headers.

fake_useragent 库是第三方库,所以第一步是安装:

pip install fake_useragent

接着导入 fake_useragent 库:

from fake_useragent import UserAgent

Then the method of use is also very simple:

from fake_useragent import UserAgent
# 创建一个 UserAgent对象
ua = UserAgent()
# 随机获取 User-Agent
# 随机生成 ie的 User-Agent
# 随机生成 opera的 User-Agent
# 随机生成 chrome的 User-Agent
# 随机生成 google的 User-Agent
# 随机生成 firefox的 User-Agent
# 随机生成 safari的 User-Agent

will contain the generated User-AgentThe request header is used as a parameter when sending the request,It is possible to masquerade browser requests.

4.2 设置间隔时间

High-frequency requests will cause the anti-crawler system to react,We need to disguise the bot to look more like a real person,Masquerading can be done by reducing the frequency of requests.

We usually reduce the frequency by setting the interval time,这里我们会使用到time库.

time库是 Python 内置的标准库,直接导入就可以使用:

import time

time库提供了一个sleep()方法,It can pause the program for a period of time based on the parameters entered:

import time

其中,sleep()The unit used by the method is seconds(s).

Try to make sure our crawler has some time between each request,In order to avoid being monitored by the anti-reptile system.

At the same time we can also combinerandom,to make the interval more natural:

import time
import random
for i in range(10):
time.sleep(random.random() * 3)


robots协议也称爬虫协议、爬虫规则等,是指网站可建立一个robots.txtfile to tell others which pages can be crawled,哪些页面不能抓取,While others go by readingrobots.txt文件来识别这个页面是否允许被抓取.

robots协议It is a common code of ethics in the international Internet community,基于以下原则建立:

  1. 搜索技术应服务于人类,同时尊重信息提供者的意愿,并维护其隐私权
  2. 网站有义务保护其使用者的个人信息和隐私不被侵犯

但是,这个robots协议不是防火墙,也没有强制执行力,Bots can be completely ignoredrobots.txt文件去抓取网页的快照.

robots协议并不是一个规范,而只是约定俗成的,So it doesn't really guarantee the privacy of the site,It's just a gentleman's agreement in the internet world,We need to consciously abide by it.

版权声明:本文为[Small ao dog]所创,转载请带上原文链接,感谢。