1 Demand analysis

Want a crawler that can crawl the job details page of the dragnet , To get the details of Corporate name Job title salary Degree required Position requirements Etc . The crawler can be configured Search for job keywords and Search City To get the details of different positions in different cities , And the crawled information is stored in the database .


2 Target site analysis

Target site :https://www.lagou.com/. You can see that in the upper left corner you can switch Search City , In the center you can enter Search for job keywords , After selecting the city and entering the search position keyword, click Search button , You can jump to the list page of the corresponding position , Each list page has 15 Details ( The last page may not be enough 15 individual ). Click on each detail item , You can jump to the details page of the corresponding company , The data to be crawled is on the details page .

Tips: It's possible that the same company will have different HR Send out the same recruitment message , Search for example Python Reptiles , You'll find companies Eigen Two recruitment messages have been released , They are made up of Yunshan and Casey Issued separately .


3 Process analysis

In order to review Scrapy and Selenium, Not used requests Library to implement this crawler , Specific process :

1. Switch cities and enter search keywords : use Selenium Drive the browser to simulate clicking on the top left corner to switch cities , Then enter the search keywords , Finally, click the search button , Jump to the list page of the corresponding position .

2. Parse list pages and simulate page flipping : Parse the first list page , Get the page number of the whole position list page , use Selenium Simulate page flipping , Turn the page and get the details of each page at the same time url.

3. Parse the details page and extract the data : Analyze the details page of each company , use Scrapy Of ItemLoader To get the information of each field , And the corresponding data processing work .

4. Store data to MongoDB: Store the obtained detailed information of each company in MongoDB database .


4 Code implementation

use Scrapy Framework to organize the whole code . The whole program flow chart :

The general flow chart is like this , The data crawling of dragnet does not need Cookie Of , added Cookie Instead, it will be recognized as a reptile . The flow chart is about Simulated landing preservation cookie To local and Load from local cookie It's just strengthening https://www.cnblogs.com/strivepy/p/9233389.html The practice of .

4.1 stay Scrapy In a Request How is the DownloadMiddleware Delivered

Scrapy Official documents https://doc.scrapy.org/en/master/topics/settings.html#std:setting-DOWNLOADER_MIDDLEWARES_BASE about DOWNLOADER_MIDDLEWARES_BASE Explanation :

 {
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}

The closer the serial number is, the closer the middleware is to the engine , The larger the serial number is, the closer the middleware is to the downloader .

Every Middleware There are process_request(request, spider)prcess_response(request, response, spider)process_exception(request, exception, spider) Three functions ( Even if it doesn't happen ).

When one request When the scheduler schedules the middleware of each downloader ( Without exception ),request Go through the serial numbers from small to large Middleware Of process_request() Function to the downloader Downloader, The downloader will get response after ,response Go through the serial numbers, from big to small Middleware Of process_response() The function reaches the engine .

and Scrapy project setting.py in DOWNLOADER_MIDDLEWARES Custom each Middleware When the project is running , Frame automatically and DOWNLOADER_MIDDLEWARES_BASE Merge to get a complete list of downloaded middleware .

4.2 Realize city switch and input search keywords

The basic idea : from start_request() Function with flag Initial request for , The flag Just for middleware The most original request, And then in process_request() Function to achieve simulation landing 、 Load local cookie、 Switch cities 、 Enter the search keywords and click the search button to jump to the position list page .

4.2.1 start_request() Initiated with flag Request

stay meta Setting properties in index_flag Come on middleware Filter out the initial request ,brower To receive Chrome example ,wait To receive WebDriverWait example ,pagenumber To receive the number of pages of the entire list page .

 # Location: LagouCrawler.spider.lagoucrawler.LagouCrawlerSpder
def start_requests(self):
base_url = 'https://www.lagou.com'
index_flag = {'index_flag': 'fetch index page', 'brower': None, 'wait': None, 'pagenumber': None}
yield scrapy.Request(url=base_url, callback=self.parse_index, meta=index_flag, dont_filter=True)

4.2.2 process_request() Function to filter out the initial request

stay process_request() Function to filter out the initial request , Judge whether you have logged in , If you've logged in , Then judge whether it is necessary to switch cities , Then enter the search keywords and click search to jump to the list page , Otherwise, judge whether it exists locally cookie file , If it exists, load the local cookie file , Otherwise, a simulated landing will be performed and cookie Save as local file , Then judge whether it is necessary to switch cities , And enter the search keywords, and then click search to jump to the list page .

 # Location: LagouCrawler.middlewares.LagoucrawlerDownloaderMiddleware
def process_request(self, request, spider):
"""
middleware Core function of , Every request Will go through this function . This function filters out the initial request And details page request,
For the initial request Check in 、cookies A series of operations , And then the last index page response return , Yes
On the details page request be , Do nothing .
:param request:
:param spider:
:return:
"""
# Filter out the initial login 、 Switch index page request
if 'index_flag' in request.meta.keys():
# Judge whether it is in login status , If not, judge whether there is cookies File exists
if not self.is_logined(request, spider):
path = os.getcwd() + '/cookies/lagou.txt'
# if cookies File exists , Then load cookie file , Otherwise, log in
if os.path.exists(path):
self.load_cookies(path)
else:
# land lagou network
self.login_lagou(spider)
# Response body of index page after successful login , If you don't log in , Request response details page url when , Will be redirected to the landing page
response = self.fetch_index_page(request, spider)
return response

use Chrome When driving the browser , The window to switch cities will always pop up first , So you need to turn it off first , Then get the content of the login status element in the upper right corner , To determine whether it is in login status .

 # Location: LagouCrawler.middlewares.LagoucrawlerDownloaderMiddleware
def is_logined(self, request, spider):
"""
On initial request , There will always be a window to switch cities , So turn it off first , And then by judging whether the upper right corner shows
The user name determines whether it is in login status , And initialize the entire program brower example
:param request: Initial request request, Its meta contain index_page attribute
:param spider:
:return: Has logged back True, Otherwise return to False
"""
self.brower.get(request.url)
try:
# Close the city selection window
box_close = self.wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="cboxClose"]')))
box_close.click()
# Get the login status in the upper right corner
login_status = self.wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="lg_tbar"]/div/ul/li[1]/a')))
# If the top right corner shows login , The user has not logged in yet
if login_status.text == ' Sign in ':
return False
else:
return True
except TimeoutException as e:
# Second request , There's no address box , Need to redesign
spider.logger.info('Locate Username Element Failed:%s' % e.msg)
return False

Load local cookie File to Chrome In the example .

 # Location: LagouCrawler.middlewares.LagoucrawlerDownloaderMiddleware
def load_cookies(self, path):
"""
Load local cookies file , Realize login free access
:param path: Local cookies File path
:return:
"""
with open(path, 'r') as f:
cookies = json.loads(f.read())
for cookie in cookies:
cookies_dict = {'name': cookie['name'], 'value': cookie['value']}
self.brower.add_cookie(cookies_dict)

If there is no local cookie file , You need to log in to the hook net , And will cookie Save as local file , To be used later :

 # Location: LagouCrawler.middlewares.LagoucrawlerDownloaderMiddleware
def login_lagou(self, spider):
"""
use selenium Simulate the login process , And will be successful after landing cookies Save as local file .
:param spider:
:return:
"""
try:
# Set the wait time , Otherwise, there will be an exception that cannot be found by the login element
time.sleep(2)
# Click to enter the login page
login_status = self.wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="loginToolBar"]//a[@class="button bar_login passport_login_pop"]')))
login_status.click()
# enter one user name
username = self.wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@data-propertyname="username"]/input')))
username.send_keys(self.username)
# Enter the user password
password = self.wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@data-propertyname="password"]/input')))
password.send_keys(self.password)
# Click the login button
submit_button = self.wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@data-propertyname="submit"]/input')))
submit_button.click()
# time.sleep(1)
# Get the cookies
cookies = self.brower.get_cookies()
# Save after login cookies
self.save_cookies(cookies)
except TimeoutException as e:
spider.logger.info('Locate Login Element Failed: %s' % e.msg)

Use after login save_cookies() Function will cookie Save as local file :

 # Location: LagouCrawler.middlewares.LagoucrawlerDownloaderMiddleware
@staticmethod
def save_cookies(cookies):
"""
Upon successful landing , take cookie Save as local file , For the next program run or later use
:param cookies:
:return:
"""
path = os.getcwd() + '/cookies/'
if not os.path.exists(path):
os.mkdir(path)
with open(path + 'lagou.txt', 'w') as f:
f.write(json.dumps(cookies))

Finally, finish all about cookie After operation , Switch cities and enter search keywords and click the search button , Use WebDriverWait after , Some elements will still be reported Element cannot be clicked It's abnormal , So add... To the front time.sleep(1)

 # Location: LagouCrawler.middlewares.LagoucrawlerDownloaderMiddleware
def fetch_index_page(self, request, spider):
"""
This function uses selenium Complete the city switch , Search keyword input and click the search button operation . If you click the search button ,
The page did not jump successfully , Because of 31 Lines of code , Throw out NoSuchElementException, And in the load_cookies()
Function report a NoneType No, get_cookies() Error of . as a result of response It's empty. .
:param request:
:param spider:
:return:
"""
try:
# Judge whether it is necessary to switch cities
city_location = self.wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="lg_tnav"]/div/div/div/strong')))
if city_location.text != self.city:
time.sleep(1)
city_change = self.wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="changeCity_btn"]')))
city_change.click()
# Locate the corresponding element according to the search city and click switch
# time.sleep(1)
city_choice = self.wait.until(EC.presence_of_element_located((By.LINK_TEXT, self.city)))
city_choice.click()
time.sleep(1)
# Locate the keyword input box and enter the keyword
keywords_input = self.wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="search_input"]')))
keywords_input.send_keys(self.job_keywords)
# time.sleep(1)
# Locate the search button and click , Sometimes the page will not jump after clicking , The reason is that it was redirected .
keywords_submit = self.wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="search_button"]')))
keywords_submit.click()
# Jump to the list page and wait for the content element to be crawled to load , If it's redirected , You can't jump to the page , Will be submitted to the NoSuchElementException
self.wait.until(EC.visibility_of_all_elements_located((By.XPATH, '//*[@id="s_position_list"]')))
pagenumber = self.wait.until(EC.presence_of_element_located((
By.XPATH, '//*[@id="s_position_list"]/div[@class="item_con_pager"]/div/span[@class="pager_next "]/preceding-sibling::span[1]'
)))
# Get the total number of pages , For adoption response Pass on to parse_detail function , For subsequent page flipping analysis use
request.meta['pagenumber'] = pagenumber.text
# take brower and wait adopt response Pass on to parse_detail function , For subsequent page flipping analysis use
request.meta['brower'] = self.brower
request.meta['wait'] = self.wait
body = self.brower.page_source
# Return to the initial search page , stay parse_detail Function to analyze the relevant information
response = HtmlResponse(
url=self.brower.current_url,
body=body,
encoding='utf-8',
request=request
)
return response
except TimeoutException:
spider.logger.info('Locate Index Element Failed And Use Proxy Request Again')
# except NoSuchElementException:
# If the exception is caught , Indicates that the page has been redirected , There's no normal jump , Re request the keyword page
return request

After jumping to the list page , take brower(Chrome example ), wait(WebDriverWait example ) and pagenumber Put in request.meta in . Then return response(HtmlResponse example ), The response from start_request() Function specified callback function parse_index() To analyze , In this function, we use response.meta Take it out and put it in request.meta Medium brower,wait and pagenumber, For the next page turning operation .

4.3 Realize parsing list page and simulating page flipping

When Selenium drive Chrome Finish switching cities 、 Enter search keywords 、 Click the search button and jump to the list page , The first page of the list will be response Pass on to parse_index() Callback function for parsing :

 # Location: LagouCrawler.spider.lagoucrawler.LagouCrawlerSpder
def parse_index(self, response):
"""
Parse the first list page , Get all the recruitment details pages url, And make a request ; Then turn the page and do exercises , Get every page
List page each detail page of url, And make a request . Be careful : Hangzhou Python Crawler position as an example, details page request initiation
Probably 55 After a ( When you grab it , Altogether 4 page , each page 15 A recruitment , for 60 I'd like to give you some details about recruitment ), Last 5 It's always
Be redirected to the page where the search keywords are initially entered , Even if it's set DOWNLOAD_DELAY It's no use . Should be
The server recognized it as a robot , The initial idea is middlewares Of process_response() Function ,
By judgment response Of status_code, For redirection request With the agency , Launch again request,
But this idea didn't come true , Need a deeper understanding of the framework , So we use the dynamic proxy of Abu cloud proxy ,
Let each request All requests are made through a proxy server .
:param response: the middleware The first detail page after filtering and processing response
:return:
"""
self.pagenumber = response.meta['pagenumber']
# initialization spider Medium brower and wait
self.brower = response.meta['brower']
self.wait = response.meta['wait']
# Analyze the index page and the recruitment details page url
for url in self.parse_url(response):
yield scrapy.Request(url=url, callback=self.parse_detail, dont_filter=True)
# Turn the page and parse
for pagenumber in range(2, int(self.pagenumber) + 1):
response = self.next_page()
for url in self.parse_url(response):
yield scrapy.Request(url=url, callback=self.parse_detail, dont_filter=True)

stay parse_index() There are two functions in the function , One is to parse each detail item in a single list page url Of parse_url() function , And page flipping simulation next_page() function . In the process of parsing out the details url after , Initiate request .Scrapy Would be right request Of url duplicate removal (RFPDupeFilter), take dont_filter Set to True, Then tell Scrapy This url Don't participate in de duplication .

Parse details url Of parse_url() function , Return to details url list :

 # Location: LagouCrawler.spider.lagoucrawler.LagouCrawlerSpder
@staticmethod
def parse_url(response):
"""
Analyze the contents of the recruitment information on each list page url
:param response: List of pp. response
:return: The list page of the recruitment details page url list
"""
url_selector = response.xpath('//*[@id="s_position_list"]/ul/li')
url_list = []
for selector in url_selector:
url = selector.xpath('.//div[@class="p_top"]/a/@href').extract_first()
url_list.append(url)
return url_list

Page flipping simulation next_page() function , Go back to the next list page response. Control the page turning speed to 2 second , In the search for The next page This element is , What needs attention is shown in the code :

 # Location: LagouCrawler.spider.lagoucrawler.LagouCrawlerSpder
def next_page(self):
"""
use selenium Simulate page flipping . use xpath obtain next_page_button When the control , It took a long time , as a result of
span Labeled class="pager_next " There is a space before the back quote !!!
:return:
"""
try:
# use xpath It took half a day to find the next page button. It was actually the big brother of the programmer span Labeled class="pager_next " Add a space , Space !!!
next_page_button = self.wait.until(EC.presence_of_element_located((
By.XPATH, '//*[@id="s_position_list"]/div[@class="item_con_pager"]/div/span[@class="pager_next "]'
)))
next_page_button.click()
self.wait.until(EC.visibility_of_all_elements_located((By.XPATH, '//*[@id="s_position_list"]')))
# Control page turning speed
time.sleep(2)
body = self.brower.page_source
response = HtmlResponse(url=self.brower.current_url, body=body, encoding='utf-8')
return response
except TimeoutException:
pass

After parsing a list page , It's going to work out url Initiate request , To request the details page .

4.4 Realize the analysis of detail page and extract data

Of each detail item url Requested returned response Will be parse_detail() Function to parse , Use ItemLoader Complete data extraction and formatting operations :

 # Location: LagouCrawler.spider.lagoucrawler.LagouCrawlerSpder
@staticmethod
def parse_detail(response):
"""
Analyze the details of each recruitment information on each page
:param response: Of each list page HtmlResponse example
:return: Recruitment details of various companies
"""
item_loader = CompanyItemLoader(item=CompanyItem(), response=response)
item_loader.add_xpath('company_name', '//*[@id="job_company"]/dt/a/div/h2/text()')
item_loader.add_xpath('company_location', 'string(//*[@id="job_detail"]/dd[@class="job-address clearfix"]/div[@class="work_addr"])')
item_loader.add_xpath('company_website', '//*[@id="job_company"]/dd/ul/li[5]/a/@href')
item_loader.add_xpath('company_figure', '//*[@id="job_company"]/dd/ul//i[@class="icon-glyph-figure"]/parent::*/text()')
item_loader.add_xpath('company_square', '//*[@id="job_company"]/dd/ul//i[@class="icon-glyph-fourSquare"]/parent::*/text()')
item_loader.add_xpath('company_trend', '//*[@id="job_company"]/dd/ul//i[@class="icon-glyph-trend"]/parent::*/text()')
item_loader.add_xpath('invest_organization', '//*[@id="job_company"]/dd/ul//p[@class="financeOrg"]/text()')
item_loader.add_xpath('job_position', '//*[@class="position-content-l"]/div[@class="job-name"]/span/text()')
item_loader.add_xpath('job_salary', '//*[@class="position-content-l"]/dd[@class="job_request"]/p/span[@class="salary"]/text()')
item_loader.add_xpath('work_experience', '//*[@class="position-content-l"]/dd[@class="job_request"]/p/span[3]/text()')
item_loader.add_xpath('degree', '//*[@class="position-content-l"]/dd[@class="job_request"]/p/span[4]/text()')
item_loader.add_xpath('job_category', '//*[@class="position-content-l"]/dd[@class="job_request"]/p/span[5]/text()')
item_loader.add_xpath('job_lightspot', '//*[@id="job_detail"]/dd[@class="job-advantage"]/p/text()')
item_loader.add_xpath('job_description', 'string(//*[@id="job_detail"]/dd[@class="job_bt"]/div)')
item_loader.add_xpath('job_publisher', '//*[@id="job_detail"]//div[@class="publisher_name"]/a/span/text()')
item_loader.add_xpath('resume_processing', 'string(//*[@id="job_detail"]//div[@class="publisher_data"]/div[2]/span[@class="tip"])')
item_loader.add_xpath('active_time', 'string(//*[@id="job_detail"]//div[@class="publisher_data"]/div[3]/span[@class="tip"])')
item_loader.add_xpath('publish_date', '//*[@class="position-content-l"]/dd[@class="job_request"]/p[@class="publish_time"]/text()')
item = item_loader.load_item()
yield item

Item Fields and ItemLoader The definition of , This definition completes the format processing of the extracted field ( Remove space 、 Line breaks, etc ):

 # Location: LagouCrawler.items
# -*- coding: utf-8 -*- # Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html import datetime
from scrapy import Item, Field
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join def formate_date(value):
"""
According to the extracted release time , If it's released on the same day, it's H:M Format , Then get the date of the day and return , It is Y:M:D Format ,
Then directly return to the date
:param value: The extracted time string
:return: The formatted date
"""
if ':' in value:
now = datetime.datetime.now()
publish_date = now.strftime('%Y-%m-%d')
publish_date += '( today )'
return publish_date
else:
return value class CompanyItemLoader(ItemLoader):
default_output_processor = TakeFirst() class CompanyItem(Item):
# define the fields for your item here like:
# name = scrapy.Field() # Corporate name
company_name = Field(
input_processor=MapCompose(lambda x: x.replace(' ', ''), lambda x: x.strip())
)
# The company address
company_location = Field(
input_processor=MapCompose(lambda x: x.replace(' ', ''), lambda x: x.replace('\n', ''), lambda x: x[:-4])
)
# The company's official website
company_website = Field()
# The company size
company_figure = Field(
input_processor=MapCompose(lambda x: x.replace(' ', ''), lambda x: x.replace('\n', '')),
output_processor=Join('')
)
# Company area
company_square = Field(
input_processor=MapCompose(lambda x: x.replace(' ', ''), lambda x: x.replace('\n', '')),
output_processor=Join('')
)
# Company stage
company_trend = Field(
input_processor=MapCompose(lambda x: x.replace(' ', ''), lambda x: x.replace('\n', '')),
output_processor=Join('')
)
# Investment institution
invest_organization = Field() # Job title
job_position = Field()
# Position salary
job_salary = Field(
input_processor=MapCompose(lambda x: x.strip())
)
# Experience needs
work_experience = Field(
input_processor=MapCompose(lambda x: x.replace(' /', ''))
)
# Education needs
degree = Field(
input_processor=MapCompose(lambda x: x.replace(' /', ''))
)
# Nature of work
job_category = Field() # Job highlights
job_lightspot = Field()
# Job description
job_description = Field(
input_processor=MapCompose(lambda x: x.replace('\xa0\xa0\xa0\xa0', '').replace('\xa0', ''), lambda x: x.replace('\n', '').replace(' ', ''))
) # Post publisher
job_publisher = Field()
# Release time
publish_date = Field(
input_processor=MapCompose(lambda x: x.replace('\xa0 ', '').strip(), lambda x: x[:-6], formate_date)
)
# # Willingness to chat
# chat_will = Field()
# Resume processing
resume_processing = Field(
input_processor=MapCompose(lambda x: x.replace('\xa0', '').strip())
)
# Active period
active_time = Field(
input_processor=MapCompose(str.strip)
)

ItemLoader You can refer to https://blog.csdn.net/zwq912318834/article/details/79530828

4.5 Store data to MongoDB

By using itempipline Store the captured data in MongoDB:

 # Location: LagouCrawler.piplines
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html from pymongo import MongoClient class LagoucrawlerPipeline(object): def __init__(self, host=None, db=None, collection=None):
self.mongo_uri = host
self.mongo_db = db
self.mongo_collection = collection
self.client = None
self.db = None
self.collection = None @classmethod
def from_crawler(cls, crawler):
"""
Through this function , To get in settings.py Defined in the file Mongodb Address 、 Database name and table name
:param crawler:
:return:
"""
return cls(
host=crawler.settings.get('MONGO_URI'),
db=crawler.settings.get('MONGO_DB'),
collection=crawler.settings.get('MONGO_COLLECTION')
) def open_spider(self, spider):
"""
stay spider When open , complete mongodb Initialization of the database .
:param spider:
:return:
"""
self.client = MongoClient(host=self.mongo_uri)
self.db = self.client[self.mongo_db]
self.collection = self.db[self.mongo_collection] def process_item(self, item, spider):
"""
pipeline Core function of , In this function, execute the item A series of operations , For example, storage and so on .
:param item: parse_detail The function resolves item
:param spider: Grab the item Of spider
:return: Return to the processed item, For the other pipeline Reprocessing ( If any )
"""
# Take the company name as the query condition
condition = {'company_name': item.get('company_name')}
# upsert Parameter set to True after , If there is no such record in the database , Will perform the insert operation ;
# meanwhile , Use update_one() function , You can also complete the de duplication operation .
result = self.collection.update_one(condition, {'$set': item}, upsert=True)
spider.logger.debug('The Matched Item is: {} And The Modified Item is: {}'.format(result.matched_count, result.modified_count))
return item def close_spider(self, spider):
"""
stay spider closed , close mongodb Data connection .
:param spider:
:return:
"""
self.client.close()

Tips: In the view MongoDB Data time , You will find that the amount of data will be less than the amount of data captured , The reason is that data insertion uses MongoDB The update function of update_one() With company name as the query condition , In all the details, the same company with different hr Publish recruitment information for the same position .

And then in setting.py Configure the pipline:

 # Location: LagouCrawler.settings
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'LagouCrawler.pipelines.LagoucrawlerPipeline': 300,
} # Mongodb Address
MONGO_URI = 'localhost'
# Mongodb Library name
MONGO_DB = 'job'
# Mongodb Table name
MONGO_COLLECTION = 'works'

Here we are , Run crawler , If nothing else , The program can complete the city switch 、 Search keyword input 、 Page turning of index page and field extraction of detail item . The fact is that we can get some data , Then is 302 Redirection and various 40X and 50X Respond to . The reason is the anti crawler measures of dragnet .

4.6 Climb the hook net IP Forbidden solutions

As mentioned above , After the completion of the whole program framework , I didn't get all the data I wanted , But after getting some data , The request is redirected , Last IP banned . So something needs to be done .

4.6.1 There is no need to climb the hook net cookie

It's not unnecessary to climb the hook net cookie, It's that there can't be cookie, According to cookie To identify whether the crawler is accessing , So in settings.py It's forbidden in the middle of the world cookie:

 # Location: LagouCrawler.settings
# Disable cookies (enabled by default)
COOKIES_ENABLED = False

4.6.2 The request head for climbing the hook net Headers

Need to be in settings.py Configure the request header in , To disguise as a browser ( There is no configuration here User-Agent):

 # Location: LagouCrawler.settings
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Host': 'www.lagou.com',
'Referer': 'https://www.lagou.com/',
'Connection': 'keep-alive'
}

4.6.3 For each request Configure random User-Agent

In order to prevent IP because User-Agent banned , So define a download middleware , use fake_useragent Third party package to each request Add a random User-Agent:

 # Location: LagouCrawler.middlewares
class RandomUserAgentMiddleware(object):
"""
To every one. request Add random User-Agent
""" def __init__(self, ua_type=None):
super(RandomUserAgentMiddleware, self).__init__()
self.ua = UserAgent()
self.ua_type = ua_type @classmethod
def from_crawler(cls, crawler):
"""
obtain setting.py Configured in RANDOM_UA_TYPE, If not configured , The default value is used random
:param crawler:
:return:
"""
return cls(
ua_type=crawler.settings.get('RANDOM_UA_TYPE', 'random')
) def process_request(self, request, spider):
"""
UserAgentMiddleware The core method of ,getattr(A, B) amount to A.B, That is to get A
Object's B attribute , In this case, it's equivalent to ua.random
:param request:
:param spider:
:return:
"""
request.headers.setdefault('User-Agent', getattr(self.ua, self.ua_type))
# No redirection per request
request.meta['dont_redirect'] = True
request.meta['handle_httpstatus_list'] = [301, 302]
spider.logger.debug('The <{}> User Agent Is: {}'.format(request.url, getattr(self.ua, self.ua_type)))

stay settings.py Configure the download Middleware in , Note that it needs to be banned Scrapy The frame comes with UserAgentMiddleWare:

 # Location: LagouCrawler.settings
DOWNLOADER_MIDDLEWARES = {
# Enable cloud proxy middleware
'LagouCrawler.middlewares.AbuYunProxyMiddleware': 1,
# stay DownloaderMiddleware Before enabling custom RandomUserAgentMiddleware
'LagouCrawler.middlewares.RandomUserAgentMiddleware': 542,
# Disable the framework's default startup UserAgentMiddleware
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'LagouCrawler.middlewares.LagoucrawlerDownloaderMiddleware': 543,
}

Tips: I'm giving everyone request Add random User-Agent At the same time , Let the request Disable redirection , The reason is when the proxy server is used later , If a request fails through the proxy server , The request will be redirected , As a result, the program enters the dead loop of the redirection page .

4.6.4 Use the cloud proxy server to initiate the request

As opposed to maintaining a pool of agents yourself , Use the dynamic version of Abu cloud agent ( For the crawler business , Need to pay ) For each request Assign a random IP More convenient and flexible .

Abuyun official website :https://center.abuyun.com

Abu cloud Interface tutorial :https://www.jianshu.com/p/90d57e7a545a?spm=a2c4e.11153940.blogcont629314.15.59f8319fWrMVQK

Also write a download middleware , Let every one request Request through Abu cloud :

 # Location: LagouCrawler.middlewares
class AbuYunProxyMiddleware(object):
"""
Access to Abu cloud proxy server , The server is dynamic IP1 Second maximum request 5 Time . Need to be in setting Set download delay in
""" def __init__(self, settings):
self.proxy_server = settings.get('PROXY_SERVER')
self.proxy_user = settings.get('PROXY_USER')
self.proxy_pass = settings.get('PROXY_PASS')
self.proxy_authorization = 'Basic ' + base64.urlsafe_b64encode(
bytes((self.proxy_user + ':' + self.proxy_pass), 'ascii')).decode('utf8') @classmethod
def from_crawler(cls, crawler):
return cls(
settings=crawler.settings
) def process_request(self, request, spider):
request.meta['proxy'] = self.proxy_server
request.headers['Proxy-Authorization'] = self.proxy_authorization
spider.logger.debug('The {} Use AbuProxy'.format(request.url))

The address of the server 、 Abuyun's user name and password ( Sure 1 Dollars to buy 1 Hours , Default per second 5 A request ) It's all in settings.py in :

 # Location: LagouCrawler.settings
# Abu cloud proxy server address 
PROXY_SERVER = "http://http-dyn.abuyun.com:9020"
# Abu cloud proxy tunnel verification information , Get it after you register with abcloud to buy services
PROXY_USER = 'H9470L5HEARXXXXX'
PROXY_PASS = '02E02D1D773XXXXX'
# Enable speed limit settings
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 0.2 # Initial download delay

Also in settings.py Configure the download Middleware in , Such as 4.6.3 Shown .


5 Scrapy Breakpoint debugging

For convenience Scrapy Breakpoint debugging , stay scrapy.cfg Create a new directory in the same category run.py file :

 from scrapy.cmdline import execute
def main():
spider_name = 'lagoucrawler'
cmd_string = 'scrapy crawl {spider_name}'.format(spider_name=spider_name)
execute(cmd_string.split()) if __name__ == '__main__':
main()

So you can get through Debug run.py, To carry out Scrapy Project breakpoint debugging .


6 Project code

Full code Github Address :https://github.com/StrivePy/LaGouCrawler/tree/recode/LagouCrawler


7 Reference material

Python3 Scrapy + Selenium + Abuyun crawls for more articles related to the study notes of the hook net

  1. selelinum+PhantomJS Climb the hook net position

    Use selenium+PhantomJS Crawling hook net job information , Save in csv File to local disk Job page of dragnet , Click next , Position information loading , But browser url Invariability of , Indicates that the data is not sent get Requested . We don't ...

  2. Climb all the hooks on the net python Position

    # 2. Crawl all the information on the hook net python Position . from urllib import request,parse import json,random def user_agent(page): # Browse ...

  3. Use request Crawling hook net information

    adopt cookies Information crawling analysis header and cookies adopt subtext Paste processing header and cookies Information After processing , Easy to paste into code Crawler hook information code import requests c ...

  4. Python Climbing hook net job

    If the hook net html The page has been adjusted , The code needs to be readjusted The code is as follows #/usr/bin/env python3 #coding:utf-8 import sys import json import requ ...

  5. ruby The crawler crawls the job information of the hook net , Generate word cloud report

    Ideas :1. Get the number of pages of the position searched by Lagou 2. Call the interface to get the position id 3. According to the position id Access page , Match keywords url The interview uses unirest, Because of the hook anti crawler , Frequent access in a short period of time will be restricted , So there's no multithreading , ...

  6. Use nodejs Pick up the hooks from Suzhou and Shanghai .NET Position information

    I started looking for a job recently , I am Suzhou , I met a few families, but they didn't get any results. I'm very sad . On the hook, according to the city Suzhou keyword .NET To search for a total of just 80 Give me a position , And then filter it with salary , Basically, few of them can throw . In addition, the housing price in Suzhou has been slow recently , Housing loan pressure is also very big , ...

  7. Learn reptiles together —— Use selenium and pyquery Grab the list of Jingdong products

    layout: article title: Learn reptiles together -- Use selenium and pyquery Grab the list of Jingdong products mathjax: true --- Learn to use together today selenium and pyquery climb ...

  8. Scrapy The crawler frame crawls the school flower net picture

    Scrapy Scrapy It's a way of crawling website data , Application framework for extracting structural data . It can be applied to data mining , In a series of programs that process or store historical data . It was originally for page crawling ( More specifically , Network capture ) Set up ...

  9. scrapy And 360 Image crawling

    # Today's goal **scrapy And 360 Image crawling ** Today's climb is 360 Beauty pictures , First of all, the analysis of the page shows that the page is dynamically loaded , Therefore, we need to find the rules of Web links first , And then call ImagesPipeline Class to achieve image crawling * Code reality ...

Random recommendation

  1. webpack The journey into the pit ( Two )loader introduction

    This is a series of articles , All the exercises in this series have my github Warehouse vue-webpack  After I have a new understanding and understanding , There will be irregular corrections and updates to the article . Here's a list of what's done so far : webpack The journey into the pit ( One ) No ...

  2. iOS Coding standards

    Coding Guidelines for Cocoa https://developer.apple.com/library/prerelease/content/documentation/Coc ...

  3. ajax Medium plus AntiForgeryToken prevent CSRF attack

    Often see in projects ajax post Data to the server without anti-counterfeiting mark , cause CSRF attack stay Asp.net Mvc It's very easy to add anti-counterfeiting marks in the form Html.AntiForgeryToken() that will do . Html.A ...

  4. html Elements Absolute position coordinates

    $(".seriesListings-itemContainer").click(function(){$(this).css("border","1 ...

  5. Old notebook _Win7_U disc _ReadyBoost

    Old notebook Win7 U disc ReadyBoost It's worth trying

  6. fstream,ifstream,ofstream Detailed explanation and usage

    fstream,istream,ofstream The inheritance relationship among the three classes fstream :(fstream Inherited from istream and ofstream)1.typedef basic_fstream< ...

  7. JSPWiki Installation configuration and FCKEditor Integration of

    edition :JSPWiki-2.8.2 FCKeditor_2.6.3     I refer to the installation method :http://doc.jspwiki.org/2.4/wiki/InstallingJSPWiki FCKEd ...

  8. Python Built in functions (9)——callable

    English document : callable(object) Return True if the object argument appears callable, False if not. If this re ...

  9. Win10 Notes - Integrating logging tools for applications Logger

    Logging tools have a long history , It's very popular debug Tools . among .NET Well known on the platform is log4net, But because of Windows 10 The general application project is gone System.Configuration quote , So it can't be very ...

  10. C# aggregate Collections The shopping cart Shopping Cart

    This is an exercise of objects and sets , Object creation , Some basic functions of sets , Such as adding , edit , Delete and other functions . object , That is, the goods of online stores ,Insus.NET Just add... To it 2 Attributes , Of objects ID Of Key And name ItemName as well as 2 A construction function ...