Python scratch framework learning

Zh1z3ven 2020-11-15 23:51:43
python scratch framework learning


Scrapy frame

Scrapy install

Normal installation will report an error , There are two main reasons

0x01 upgrade pip3 package

python -m pip install -U pip

0x02 Manually install dependencies

Manual installation required wheel、lxml、Twisted、pywin32

pip3 install wheel
pip3 install lxml
pip3 install Twisted
pip3 install pywin32

0x03 install Scrapy

pip3 install scrapy

Scrapy Project management

0x01 Use scrapy Create a new crawler project

mkdir Scrapy
scrapy startproject myfirstpjt
cd myfirstpjt

image-20201110220849055

0x02 scrapy Relevant command

There are two kinds of orders , One is global command , One is project command

Global commands don't need to rely on Scrapy Project can be directly related to performance , The project command must depend on the project

stay Scrapy Use outside the directory where the project is located scrapy -h Show all global commands

C:\Users\LENOVO>scrapy -h
Scrapy 2.4.0 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
commands
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command

fetch

fetch The command is mainly used to display the crawler crawling process

image-20201110224047794

0x03 Selectors

Support XPath CSS Selectors

meanwhile XPath The selector also has a .re() Method is used to extract data from regular expressions

, Different from using .xpath() perhaps .css() Method ,.re() Method returns unicode A string of list , So it's impossible to construct nested .re() call

establish Scrapy project

Writing small scripts or projects is equivalent to writing a composition on white paper , The framework will integrate some of the things that are often used to turn a composition question into a fill in question , Greatly reduced workload

scrapy startproject Project name ( for example todayMovie)

tree todayMovie

D:\pycharm\code\Scrapy>scrapy startproject todayMovie
New Scrapy project 'todayMovie', using template directory 'c:\python3.7\lib\site-packages\scrapy\templates\project', created in:
D:\pycharm\code\Scrapy\todayMovie
You can start your first spider with:
cd todayMovie
scrapy genspider example example.com
D:\pycharm\code\Scrapy>tree todayMovie
Folder PATH list
The volume serial number is 6858-7249
D:\PYCHARM\CODE\SCRAPY\TODAYMOVIE
└─todayMovie
└─spiders
D:\pycharm\code\Scrapy>

0x01 Use genspider Parameter new basic crawler

Create a new one called wuHanMovieSpider The crawler script for , Script search domain is mtime.com

scrapy genspider wuHanMovieSpider mtime.com

image-20201112222835234

0x02 About the documents under the framework

scrapy.cfg

 The main statement is that the default setting file location is todayMovie Under the module of settings file (setting.py), Defined as the project name todayMovie

items.py The role of a file is to define what items the crawler ultimately needs ,

pipelines.py The purpose of the document is to close the end .Scrapy The crawler crawled the content of the web page after , How to deal with these contents depends on pipelines.py How to set up

Need modification 、 The only thing to fill in the blanks is 4 File , they Namely items.py、settings.py、pipelines.py、wuHanMovieSpider.py.

among items.py Decide which items to climb ,wuHanMovieSpider.py Decide how to climb , settings.py Decide who is going to handle the crawling content ,pipelines.py Decide what to do with the crawled content

0x03 xpath Selectors

selector = response.xpath('/html/body/div[@id='homeContentRegion']//text()')[0].extract()

extract() Return to the selected content Unicode character string .

About xpath Traverse the document tree

Symbol purpose
/ Select the root of the document , It's usually html
// Select all descendant nodes from the current location
./ Represents extraction from the current node , Second extraction of data will use
. Select the current node , Relative paths
.. Select the current node parent node , Relative paths
ELEMENT Select all of the sub nodes ELEMENT Element nodes
//ELEMENT Select descendant node all ELEMENT Element nodes
* Select all element child nodes
text() Select the existing text child node
@ATTR Choose the name ATTR Property node of
@* Select all attribute nodes
/@ATTR Get node attribute value
Method purpose
contains a[contains(@href,"test")] lookup href Attribute contains test Character a label
start-with a[start-with(@href,"http")] lookup href Attribute to http At the beginning a label

give an example

response.xpath('//a/text()') # Select all a The text of
response.xpath('//div/*/img') # selection div Sun node's all img
response.xpath('//p[contains(@class,'song')]') # choice class Property contains ‘song’ Of p Elements
response.xpath('//a[contains(@data-pan,'M18_Index_review_short_movieName')]/text()')
response.xpath('//div/a | //div/p') perhaps , It could be a May be p
selector = response.xpath('//a[contains(@href,"http://movie.mtime.com")]/text()').extract()

Reference article

https://www.cnblogs.com/master-song/p/8948210.html

https://blog.csdn.net/loner_fang/article/details/81017202

example Crawling weather forecast

0x01 establish weather Project and basic crawler

cd Scrapy\code
scrapy startproject weather
scrapy genspider beiJingSpider www.weather.com.cn/weather/101010100.shtml

0x02 modify items.py

class WeatherItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
cityDate = scrapy.Field() # City and date
week = scrapy.Field() # week
temperature = scrapy.Field() # temperature
weather = scrapy.Field() # The weather
wind = scrapy.Field() # wind

image-20201115232731688

0x03 scrapy shell

First use scrapy shell Command to test the get selector , The main thing is to see if the website has anti climbing mechanism

scrapy shell https://www.tianqi.com/beijing/

image-20201115193319731

such as 403 Is to prohibit reading , It's not that the page doesn't exist .

ordinary bypass Is to add UA And the frequency of visits

0x04 ordinary bypass

Prepare a bunch of User-Agent Put it in resource.py utilize random Choose one at a time .

setp1: Will be ready resource.py Put it in settings.py In the same level directory

resource.py

#-*- coding:utf-8 -*-
UserAgents = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
"Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/76.0",
]

image-20201115194255599

step2: modify middlewares.py

Import random,UserAgents,UserAgentMiddleware

image-20201115194533687

Add a new class at the bottom , The new class inherits from UserAgentMiddleware class

Class provides randomly selected for each request UA head

class CustomUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent='Scrapy'):
# ua = "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/76.0"
ua = random.choice(UserAgents)
self.user_agent = ua

image-20201115194746141

step3: modify settings.py

use CustomUserAgentMiddleware To replace UserAgentMiddleware.

stay settings.py Find DOWNLOADER_MIDDLEWARES This option is modified as shown in the following figure

image-20201115194917795

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
#'weather.middlewares.WeatherDownloaderMiddleware': 543,
'weather.middlewares.CustomUserAgentMiddleware': 542,
}

step4: Change request interval

Scrapy The time setting between requests is DOWNLOAD_DELAY, If we don't think about anti climbing, the smaller the better , The value is 30 Every time 30s Like a website asking for a web page .

image-20201115195113089

ps: General website add UA Just head bypass

image-20201115195421229

0x05 modify beiJingSpider.py

The content to be obtained is in class=day7 This div Next , In this div Anchor point

image-20201115200409939

# Note that there url Finally, there must be / Otherwise, you can't get the content
scrapy shell https://tianqi.com/beijing/
selector = response.xpath('//div[@class="day7"]')
selector1 = selector.xpath('ul[@class="week"]/li')

image-20201115201806303

beiJingSpider.py

import scrapy
from weather.items import WeatherItem
class BeijingspiderSpider(scrapy.Spider):
name = 'beiJingSpider'
allowed_domains = ['https://www.tianqi.com/beijing/']
start_urls = ['https://www.tianqi.com/beijing/']
def parse(self, response):
items = []
city = response.xpath('//dd[@class="name"]/h2/text()').extract()
Selector = response.xpath('//div[@class="day7"]')
date = Selector.xpath('ul[@class="week"]/li/b/text()').extract()
week = Selector.xpath('ul[@class="week"]/li/span/text()').extract()
wind = Selector.xpath('ul[@class="txt"]/li/text()').extract()
weather = Selector.xpath('ul[@class="txt txt2"]/li/text()').extract()
temperature1 = Selector.xpath('div[@class="zxt_shuju"]/ul/li/span/text()')
temperature2 = Selector.xpath('div[@class="zxt_shuju"]/ul/li/b/text()').extract()
for i in range(7):
item = WeatherItem()
try:
item['cityDate'] = city + date[i]
item['week'] = week[i]
item['temperature'] = temperature1[i] + ',' + temperature2[i]
item['weather'] = weather[i]
item['wind'] = wind[i]
except IndexError as e:
exit()
items.append(item)
return items

0x06 modify pipelines.py Handle Spider Result

import time
import codecs
class WeatherPipeline:
def process_item(self, item, spider):
today = timw.strftime('%Y%m%d', time.localtime())
fileName = today + '.txt'
with codecs.open(fileName, 'a', 'utf-8') as fp:
fp.write("%s \t %s \t %s \t %s \r\n"
%(item['cityDate'],
item['week'],
item['temperature'],
item['weather'],
item['wind']))
return item

0x07 modify settings.py

find ITEM_PIPELINES Remove the previous comment

image-20201115210132863

0x08 Crawling content

go back to weather Execute command under project

scrapy crawl beiJingSpider

image-20201115232455439

版权声明
本文为[Zh1z3ven]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database