Scrapy frame
Scrapy install
Normal installation will report an error , There are two main reasons
0x01 upgrade pip3 package
python -m pip install -U pip
0x02 Manually install dependencies
Manual installation required wheel、lxml、Twisted、pywin32
pip3 install wheel
pip3 install lxml
pip3 install Twisted
pip3 install pywin32
0x03 install Scrapy
pip3 install scrapy
Scrapy Project management
0x01 Use scrapy Create a new crawler project
mkdir Scrapy
scrapy startproject myfirstpjt
cd myfirstpjt
0x02 scrapy Relevant command
There are two kinds of orders , One is global command , One is project command
Global commands don't need to rely on Scrapy Project can be directly related to performance , The project command must depend on the project
stay Scrapy Use outside the directory where the project is located scrapy -h Show all global commands
C:\Users\LENOVO>scrapy -h
Scrapy 2.4.0 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
commands
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command
fetch
fetch The command is mainly used to display the crawler crawling process
0x03 Selectors
Support XPath CSS Selectors
meanwhile XPath The selector also has a .re() Method is used to extract data from regular expressions
, Different from using .xpath() perhaps .css() Method ,.re() Method returns unicode A string of list , So it's impossible to construct nested .re() call
establish Scrapy project
Writing small scripts or projects is equivalent to writing a composition on white paper , The framework will integrate some of the things that are often used to turn a composition question into a fill in question , Greatly reduced workload
scrapy startproject Project name ( for example todayMovie)
tree todayMovie
D:\pycharm\code\Scrapy>scrapy startproject todayMovie
New Scrapy project 'todayMovie', using template directory 'c:\python3.7\lib\site-packages\scrapy\templates\project', created in:
D:\pycharm\code\Scrapy\todayMovie
You can start your first spider with:
cd todayMovie
scrapy genspider example example.com
D:\pycharm\code\Scrapy>tree todayMovie
Folder PATH list
The volume serial number is 6858-7249
D:\PYCHARM\CODE\SCRAPY\TODAYMOVIE
└─todayMovie
└─spiders
D:\pycharm\code\Scrapy>
0x01 Use genspider
Parameter new basic crawler
Create a new one called wuHanMovieSpider The crawler script for , Script search domain is mtime.com
scrapy genspider wuHanMovieSpider mtime.com
0x02 About the documents under the framework
scrapy.cfg
The main statement is that the default setting file location is todayMovie Under the module of settings file (setting.py), Defined as the project name todayMovie
items.py The role of a file is to define what items the crawler ultimately needs ,
pipelines.py The purpose of the document is to close the end .Scrapy The crawler crawled the content of the web page after , How to deal with these contents depends on pipelines.py How to set up
Need modification 、 The only thing to fill in the blanks is 4 File , they Namely items.py、settings.py、pipelines.py、wuHanMovieSpider.py.
among items.py Decide which items to climb ,wuHanMovieSpider.py Decide how to climb , settings.py Decide who is going to handle the crawling content ,pipelines.py Decide what to do with the crawled content
0x03 xpath Selectors
selector = response.xpath('/html/body/div[@id='homeContentRegion']//text()')[0].extract()
extract() Return to the selected content Unicode character string .
About xpath Traverse the document tree
Symbol | purpose |
---|---|
/ | Select the root of the document , It's usually html |
// | Select all descendant nodes from the current location |
./ | Represents extraction from the current node , Second extraction of data will use |
. | Select the current node , Relative paths |
.. | Select the current node parent node , Relative paths |
ELEMENT | Select all of the sub nodes ELEMENT Element nodes |
//ELEMENT | Select descendant node all ELEMENT Element nodes |
* | Select all element child nodes |
text() | Select the existing text child node |
@ATTR | Choose the name ATTR Property node of |
@* | Select all attribute nodes |
/@ATTR | Get node attribute value |
Method | purpose |
---|---|
contains | a[contains(@href,"test")] lookup href Attribute contains test Character a label |
start-with | a[start-with(@href,"http")] lookup href Attribute to http At the beginning a label |
give an example
response.xpath('//a/text()') # Select all a The text of
response.xpath('//div/*/img') # selection div Sun node's all img
response.xpath('//p[contains(@class,'song')]') # choice class Property contains ‘song’ Of p Elements
response.xpath('//a[contains(@data-pan,'M18_Index_review_short_movieName')]/text()')
response.xpath('//div/a | //div/p') perhaps , It could be a May be p
selector = response.xpath('//a[contains(@href,"http://movie.mtime.com")]/text()').extract()
Reference article
https://www.cnblogs.com/master-song/p/8948210.html
https://blog.csdn.net/loner_fang/article/details/81017202
example Crawling weather forecast
0x01 establish weather Project and basic crawler
cd Scrapy\code
scrapy startproject weather
scrapy genspider beiJingSpider www.weather.com.cn/weather/101010100.shtml
0x02 modify items.py
class WeatherItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
cityDate = scrapy.Field() # City and date
week = scrapy.Field() # week
temperature = scrapy.Field() # temperature
weather = scrapy.Field() # The weather
wind = scrapy.Field() # wind
0x03 scrapy shell
First use scrapy shell Command to test the get selector , The main thing is to see if the website has anti climbing mechanism
scrapy shell https://www.tianqi.com/beijing/
such as 403 Is to prohibit reading , It's not that the page doesn't exist .
ordinary bypass Is to add UA And the frequency of visits
0x04 ordinary bypass
Prepare a bunch of User-Agent Put it in resource.py utilize random Choose one at a time .
setp1: Will be ready resource.py Put it in settings.py In the same level directory
resource.py
#-*- coding:utf-8 -*-
UserAgents = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
"Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/76.0",
]
step2: modify middlewares.py
Import random,UserAgents,UserAgentMiddleware
Add a new class at the bottom , The new class inherits from UserAgentMiddleware class
Class provides randomly selected for each request UA head
class CustomUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent='Scrapy'):
# ua = "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/76.0"
ua = random.choice(UserAgents)
self.user_agent = ua
step3: modify settings.py
use CustomUserAgentMiddleware To replace UserAgentMiddleware.
stay settings.py Find DOWNLOADER_MIDDLEWARES This option is modified as shown in the following figure
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
#'weather.middlewares.WeatherDownloaderMiddleware': 543,
'weather.middlewares.CustomUserAgentMiddleware': 542,
}
step4: Change request interval
Scrapy The time setting between requests is DOWNLOAD_DELAY, If we don't think about anti climbing, the smaller the better , The value is 30 Every time 30s Like a website asking for a web page .
ps: General website add UA Just head bypass
0x05 modify beiJingSpider.py
The content to be obtained is in class=day7 This div Next , In this div Anchor point
# Note that there url Finally, there must be / Otherwise, you can't get the content
scrapy shell https://tianqi.com/beijing/
selector = response.xpath('//div[@class="day7"]')
selector1 = selector.xpath('ul[@class="week"]/li')
beiJingSpider.py
import scrapy
from weather.items import WeatherItem
class BeijingspiderSpider(scrapy.Spider):
name = 'beiJingSpider'
allowed_domains = ['https://www.tianqi.com/beijing/']
start_urls = ['https://www.tianqi.com/beijing/']
def parse(self, response):
items = []
city = response.xpath('//dd[@class="name"]/h2/text()').extract()
Selector = response.xpath('//div[@class="day7"]')
date = Selector.xpath('ul[@class="week"]/li/b/text()').extract()
week = Selector.xpath('ul[@class="week"]/li/span/text()').extract()
wind = Selector.xpath('ul[@class="txt"]/li/text()').extract()
weather = Selector.xpath('ul[@class="txt txt2"]/li/text()').extract()
temperature1 = Selector.xpath('div[@class="zxt_shuju"]/ul/li/span/text()')
temperature2 = Selector.xpath('div[@class="zxt_shuju"]/ul/li/b/text()').extract()
for i in range(7):
item = WeatherItem()
try:
item['cityDate'] = city + date[i]
item['week'] = week[i]
item['temperature'] = temperature1[i] + ',' + temperature2[i]
item['weather'] = weather[i]
item['wind'] = wind[i]
except IndexError as e:
exit()
items.append(item)
return items
0x06 modify pipelines.py Handle Spider Result
import time
import codecs
class WeatherPipeline:
def process_item(self, item, spider):
today = timw.strftime('%Y%m%d', time.localtime())
fileName = today + '.txt'
with codecs.open(fileName, 'a', 'utf-8') as fp:
fp.write("%s \t %s \t %s \t %s \r\n"
%(item['cityDate'],
item['week'],
item['temperature'],
item['weather'],
item['wind']))
return item
0x07 modify settings.py
find ITEM_PIPELINES Remove the previous comment
0x08 Crawling content
go back to weather Execute command under project
scrapy crawl beiJingSpider