The first day of Python crawler learning

Entering code... 2022-09-09 01:24:02 阅读数:851


Algorithms are too hard,Come and learn reptiles directly


Crawl all movie titles on this site,评分,类型,内容简介,封面(just a url)and show time

Scrape | Movie

The website is above

所谓爬虫,It is the crawling of a website,我们先关注url,For this site is divided into two,列表页和详情页,Therefore, a function is needed to extract the two pages separatelyurl,所对应的html代码,and to parse it,Finally get the desired result.

So the first thing we have to do is to crawl the page,以下是代码

# Page crawl method
def scrape_page(url):'scraping %s...' , url)
response = requests.get(url)
if response.status_code == 200:
return response.text
logging.error('get invalid status code %s while scraping %s', response.status_code, url)
# 异常处理
except requests.RequestException:
# exec_info 可以打印出错误信息
logging.error('error occurred while scraping %s' , url , exec_info = True)

What this function does is,for a URL,to crawl ithtml代码,我们直接使用get请求即可,如果状态码是200,Then directly return the corresponding URLhtml代码,Otherwise output the error log

Then all that is needed,Crawl a web page ,Define the list page first

# Crawling method of list page
# page 接受page参数
def scrape_index(page):
index_url = f'{BASE_URL}/page/{page}'
return scrape_page(index_url)

We can put the fixed formaturlThe list page performs character splicing to get what is neededurl,最后再使用scrape_page方法,Get this pagehtml代码

再下来,It is for each list page parsing,得到详情页的url

# 解析列表页
def parse_index(html):
# <a data-v-7f856186 href="/detail/1" class="name">
pattern = re.compile('<a.*?href="(.*?)".*?class="name">')
items = re.findall(pattern, html) # Find all sums in a web pagepattern匹配的内容
if not items:
return []
for item in items:
detail_url = urljoin(BASE_URL, item) # Splicing to get a complete details page
#'get detail url %s', detail_url)
yield detail_url

which uses non-greedy universal matching,使用F12Go to developer tools,For a detail page where the hyperlink existshref之后,Therefore, you need to use a bracket to indicate the attribute that needs to be matched,So this regular expression means matching hyperlinks,然后使用findall获取所有匹配的内容,Finally, it is spliced ​​into a complete details page,So we get the details page we needurl

接下来,It is to crawl the details page.

通过分析可以得到,The information held by each page has the movie title,评分,类型,内容简介,封面(just a url)and show time,因此需要先获取html代码,Then use regular expressions to match each information.

# Crawl the data of the detail page
def scrape_detail(url):
return scrape_page(url)
def parse_detail(html):
# 匹配cover信息,可以使用compileConvert the regular expression to a regular expression object
# You don't need to rewrite the regular expression every time
# 封面信息
cover_pattern = re.compile('class="item.*?<img.*?src="(.*?)".*?class="cover">',re.S)
# 名称信息
name_pattern = re.compile('<h2.*?>(.*?)</h2>')
# 类别信息
categories_pattern = re.compile('<button.*?category.*?<span>(.*?)</span>.*?</button>',re.S)
# Release time information
published_at_pattern = re.compile('(\d{4}-\d{2}-\d{2})\s?上映')
# Content information about a movie
drama_pattern = re.compile('<div.*?drama.*?>.*?<p.*?>(.*?)</p>',re.S)
# 评分信息
score_pattern = re.compile('<p.*?score.*?>(.*?)</p>',re.S)
# Match each message again
# If it is not a special case, it is basically usedsearch
# 使用stripThe function obtains the given requirement
cover =, html).group(1).strip() if, html) else None
name =, html).group(1).strip() if, html) else None
# It needs to be used because there may be multiple resultsfindall函数返回一个列表
categories = re.findall(categories_pattern, html) if re.findall(categories_pattern, html) else []
published_at =, html).group(1) if, html) else None
drama =, html).group(1).strip() if, html) else None
# 注意scoreis a floating point number that needs to be cast
score = float(, html).group(1).strip()) if, html) else None
return {
'封面': cover,
'名字': name,
'类别': categories,
'上映时间': published_at,
'内容简介': drama,
'评分': score

This part of the notes is very detailed,不再赘述.

最后,Of course the data is stored

I haven't learned how to convert into a database,Then use it for nowjsonJust save the file,Then use the universal notepad to open it.

import json
from os import makedirs
from os.path import exists
RESULTS_DIR = 'results'
# Determine whether there is a path. If it exists, don't worry about it , Recreate one if it doesn't exist
exists(RESULTS_DIR) or makedirs(RESULTS_DIR)
import multiprocessing
# ensure_ascii = False It can ensure that Chinese characters are output normally in the file
# indent Indent two lines
def save_data(data):
name = data.get('名字')
data_path = f'{RESULTS_DIR}/{name}.json'
json.dump(data, open(data_path, 'w', encoding='utf-8'),ensure_ascii=False, indent=2)


There are two expressions,The first is unoptimized crawling,That is, crawling one page at a time,Finally get the information for each movie,The second is the optimized version,Speed ​​up with multiple processes,Put each page number into the process pool,let the computercpu进行加速,就比如说,4核电脑,python默认有4个进程同时进行,实现加速


def main():
for page in range(1 , TOTAL_PAGE + 1):
index_html = scrape_index(page) # 得到列表页的url
detail_urls = parse_index(index_html) # 得到详情页的url
# Traverse the entire detail pageurl 然后提取每一个url的信息 最后输出即可
for detail_url in detail_urls:
detail_html = scrape_detail(detail_url)
data = parse_detail(detail_html)'get detail data %s', data)'saving data to json file')
save_data(data)'data saved successfully')
#'detail urls %s', list(detail_urls))
if __name__ == '__main__':


def main(page):
index_html = scrape_index(page)
detail_urls = parse_index(index_html)
for detail_url in detail_urls:
detail_html = scrape_detail(detail_url)
data = parse_detail(detail_html)'get detail data %s', data)'saving data to json file')
save_data(data)'data saved successfully')
if __name__ == '__main__':
pool = multiprocessing.Pool()
pages = range(1, TOTAL_PAGE + 1), pages)

The above is the first crawler program

如果代码有问题,You can come and learn together.


版权声明:本文为[Entering code...]所创,转载请带上原文链接,感谢。