Using Python to write a general crawler of Douban and visual analysis

Big sai 2020-11-13 06:03:40
using python write general crawler


The official account of original technology :bigsai, In this paper 1024 Release , reply bigsai Take the architect to the next level pdf resources , Happy holidays , May all your wishes come true . After receiving the blessing, click one button to give back to the crab !

Preface

In one of my classes , The teacher has a task for each group , Introduce and complete a small module 、 The use of tool knowledge . But what happened to my group was python The little subject of reptiles .

I thought it was not very simple , Why ? It may not be enough to think of new time and energy , I just want to comment on Douban movie ( Short commentary ) Let's do something about it .

I've written about Nezha before , But what I want to write today is as detailed as my aunt . This article mainly realizes to any movie short comment ( hot ) And the visual analysis of grabbing . That is, you just need to provide links and some basic information , He can
 Insert picture description here

analysis

For beancurds ,what shold we consider ? How to analyze ? Douban movie home page

Try this first , Open any movie , Here we use Jiang Ziya For example . Open Jiang Ziya and you will find that it is a non dynamic rendering page , That's the traditional way of rendering , Ask for this directly url You can get the data . But flip through the pages and you'll see : Users who are not logged in can only access the preferred interface , To access the following page .

image-20201022195020410

So the process should be Sign in ——> Reptiles ——> Storage ——> Visual analysis .

Here's the environment and the installation required , The environment is python3, Code in win and linux Can run successfully , If mac and linux Can't run friends font chaos problem, please private me . among pip The package used is as follows , Use Tsinghua directly Image download is very slow ( That's sweet enough ).

pip install requests -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install matplotlib -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install numpy -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install xlrd -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install xlwt -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install bs4 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install lxml -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install wordcloud -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install jieba -i https://pypi.tuna.tsinghua.edu.cn/simple

Sign in

Douban's login address

There's a password log in bar when you get in , What happened on our way to login , open F12 The console is not enough , We also need to use Fidder Grab the bag .

 Insert picture description here

open F12 Console and then click login , After many attempts, I found that the login interface is also very simple :

 Insert picture description here

Look at the parameters of the request and find that it is a normal request , No encryption , Of course, you can use fidder Carry out the bag , Here I have a simple test, using the wrong password to test . If the failed partner can try to log in manually and exit again, and then run the program again .

image-20201022195625220

Write the login module code like this :

url='https://accounts.douban.com/j/mobile/login/basic'
header={
'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
'Referer': 'https://accounts.douban.com/passport/login_popup?login_source=anony',
'Origin': 'https://accounts.douban.com',
'content-Type':'application/x-www-form-urlencoded',
'x-requested-with':'XMLHttpRequest',
'accept':'application/json',
'accept-encoding':'gzip, deflate, br',
'accept-language':'zh-CN,zh;q=0.9',
'connection': 'keep-alive'
,'Host': 'accounts.douban.com'
}
data={

'ck':'',
'name':'',
'password':'',
'remember':'false',
'ticket':''
}
def login(username,password):
global data
data['name']=username
data['password']=password
data=urllib.parse.urlencode(data)
print(data)
req=requests.post(url,headers=header,data=data,verify=False)
cookies = requests.utils.dict_from_cookiejar(req.cookies)
print(cookies)
return cookies

After this HD , The whole execution process is about :
 Insert picture description here

Crawling

After successful login , We can take the login information to visit the website and crawl the information as we like . Although it's a traditional way of interaction , But every time you switch pages, there's a ajax request .
 Insert picture description here
In this part of the interface, we can get the data of the comment part directly , You don't need to request the entire page and extract this part of the content . And this part of url The rules are the same as the previous analysis , only one start Indicates that the current number of items is changing , So just put it together url Just go .

That is to put together with logic url Until it doesn't work properly .

https://movie.douban.com/subject/25907124/comments?percent_type=&start=0& Other parameters are omitted
https://movie.douban.com/subject/25907124/comments?percent_type=&start=20& Other parameters are omitted
https://movie.douban.com/subject/25907124/comments?percent_type=&start=40& Other parameters are omitted

For each url How to extract information after visiting ?
We according to the css Selectors to filter data , Because each comment has the same style , stay html It's like an element in a list .

Look at the one we just had ajax The data returned by the interface is just the red block below , So we're directly based on class Search Su into several groups, Cao Zu can .

 Insert picture description here

On the concrete implementation , We use requests Send a request to get the result , Use BeautifulSoup Parse html Format file .
And the data we need is easy to analyze the corresponding parts .

image-20201022210917778

The implementation code is :

import requests
from bs4 import BeautifulSoup
url='https://movie.douban.com/subject/25907124/comments?percent_type=&start=0&limit=20&status=P&sort=new_score&comments_only=1&ck=C7di'
header = {

'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
}
req = requests.get(url,headers=header,verify=False)
res = req.json() # The result returned is a json
res = res['html']
soup = BeautifulSoup(res, 'lxml')
node = soup.select('.comment-item')
for va in node:
name = va.a.get('title')
star = va.select_one('.comment-info').select('span')[1].get('class')[0][-2]
comment = va.select_one('.short').text
votes=va.select_one('.votes').text
print(name, star,votes, comment)

The execution result of this test is :

image-20201022220333519

Store

After crawling data, we should consider storage , We store data in cvs in .

Use xlwt Write data to excel In file ,xlwt Basic application example :

import xlwt
# Create writable workbook object 
workbook = xlwt.Workbook(encoding='utf-8')
# Create sheet sheet
worksheet = workbook.add_sheet('sheet1')
# Write in the table , The first parameter That's ok , The second parameter column , The third parameter content 
worksheet.write(0, 0, 'bigsai')
# Save the table as test.xlsx
workbook.save('test.xlsx')

Use xlrd Read excel In file , This case xlrd Basic application example :

import xlrd
# The read name is test.xls file 
workbook = xlrd.open_workbook('test.xls')
# Get the first table 
table = workbook.sheets()[0] # Open the first 1 A watch 
# Each row is a tuple 
nrows = table.nrows
for i in range(nrows):
print(table.row_values(i))# Output each line 

Come here , We have the login module + Crawling module + The storage module can store the data locally , The specific integration code is :

import requests
from bs4 import BeautifulSoup
import urllib.parse
import xlwt
import xlrd
# Account and password 
def login(username, password):
url = 'https://accounts.douban.com/j/mobile/login/basic'
header = {

'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
'Referer': 'https://accounts.douban.com/passport/login_popup?login_source=anony',
'Origin': 'https://accounts.douban.com',
'content-Type': 'application/x-www-form-urlencoded',
'x-requested-with': 'XMLHttpRequest',
'accept': 'application/json',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'connection': 'keep-alive'
, 'Host': 'accounts.douban.com'
}
# The parameters needed to log in 
data = {

'ck' : '',
'name': '',
'password': '',
'remember': 'false',
'ticket': ''
}
data['name'] = username
data['password'] = password
data = urllib.parse.urlencode(data)
print(data)
req = requests.post(url, headers=header, data=data, verify=False)
cookies = requests.utils.dict_from_cookiejar(req.cookies)
print(cookies)
return cookies
def getcomment(cookies, mvid): # The parameter is successful login cookies( Backstage can go through cookies Identifying users , Of the movie id)
start = 0
w = xlwt.Workbook(encoding='ascii') # # Create writable workbook object 
ws = w.add_sheet('sheet1') # Create sheet sheet
index = 1 # It means line , stay xls Write the corresponding number of lines in the file 
while True:
# Simulation browser send request 
header = {

'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
}
# try catch Try , Once there is an error, the execution is complete , Go ahead with no mistakes 
try:
# Put together url Every time star Add 20
url = 'https://movie.douban.com/subject/' + str(mvid) + '/comments?start=' + str(
start) + '&limit=20&sort=new_score&status=P&comments_only=1'
start += 20
# Send a request 
req = requests.get(url, cookies=cookies, headers=header)
# The return result is json character string adopt req.json() Method to get data 
res = req.json()
res = res['html'] # The data needed is in `html` Key down 
soup = BeautifulSoup(res, 'lxml') # Structure this html Create a BeautifulSoup Objects are used to extract information 
node = soup.select('.comment-item') # Each group class Are all comment-item It's divided into 20 Bar record ( Every url Yes 20 Comments )
for va in node: # Traverse comments 
name = va.a.get('title') # Get reviewer name 
star = va.select_one('.comment-info').select('span')[1].get('class')[0][-2] # Star count is highly praised 
votes = va.select_one('.votes').text # number of votes 
comment = va.select_one('.short').text # Comment text 
print(name, star, votes, comment)
ws.write(index, 0, index) # The first index That's ok , The first 0 Column write index
ws.write(index, 1, name) # The first index That's ok , The first 1 Column write Commentator 
ws.write(index, 2, star) # The first index That's ok , The first 2 Column write Comment on stars 
ws.write(index, 3, votes) # The first index That's ok , The first 3 Column write number of votes 
ws.write(index, 4, comment) # The first index That's ok , The first 4 Column write Comment content 
index += 1
except Exception as e: # There is an abnormal exit 
print(e)
break
w.save('test.xls') # Save as test.xls file 
if __name__ == '__main__':
username = input(' Enter account :')
password = input(' Input password :')
cookies = login(username, password)
mvid = input(' Of the movie id by :')
getcomment(cookies, mvid)

After execution, the data is successfully stored :

image-20201022221256503

Visual analysis

We need to count the scores 、 Word frequency statistics . There is also the generation of word cloud display . And the corresponding is matplotlibWordCloud library .

The logic of implementation : Read xls The file of , Use word segmentation to count word frequency , Statistics of the most frequent words are made into histogram and words . Make a pie chart to show the number of stars , The main code has comments , The specific code is :

The code is :

import matplotlib.pyplot as plt
import matplotlib
import jieba
import jieba.analyse
import xlwt
import xlrd
from wordcloud import WordCloud
import numpy as np
from collections import Counter
# Set the font yes , we have linux There's something wrong with the font 
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['axes.unicode_minus'] = False
# similar comment Some data for reviews [ ['1',' name ','star star ',' Number of endorsements ',' Comment content '] ,['2',' name ','star star ',' Number of endorsements ',' Comment content '] ] Tuples 
def anylasescore(comment):
score = [0, 0, 0, 0, 0, 0] # They correspond to each other 0 1 2 3 4 5 The number of times it appears 
count = 0 # The total number of ratings 
for va in comment: # Traverse the data for each comment ['1',' name ','star star ',' Number of endorsements ',' Comment content ']
try:
score[int(va[2])] += 1 # The first 3 Column by star star To cast to int Format 
count += 1
except Exception as e:
continue
print(score)
label = '1 branch ', '2 branch ', '3 branch ', '4 branch ', '5 branch '
color = 'blue', 'orange', 'yellow', 'green', 'red' # All kinds of colors 
size = [0, 0, 0, 0, 0] # A percentage number Put it all together 100
explode = [0, 0, 0, 0, 0] # explode :( Every piece ) Away from the center ;
for i in range(1, 5): # Calculation 
size[i] = score[i] * 100 / count
explode[i] = score[i] / count / 10
pie = plt.pie(size, colors=color, explode=explode, labels=label, shadow=True, autopct='%1.1f%%')
for font in pie[1]:
font.set_size(8)
for digit in pie[2]:
digit.set_size(8)
plt.axis('equal') # This line of code makes the pie equal in length and width 
plt.title(u' Percentage of each score ', fontsize=12) # title 
plt.legend(loc=0, bbox_to_anchor=(0.82, 1)) # legend 
# Set up legend Font size of 
leg = plt.gca().get_legend()
ltext = leg.get_texts()
plt.setp(ltext, fontsize=6)
plt.savefig("score.png")
# Display diagram 
plt.show()
def getzhifang(map): # Histogram two-dimensional , need x and y Two coordinates 
x = []
y = []
for k, v in map.most_common(15): # Before acquisition 15 The maximum number 
x.append(k)
y.append(v)
Xi = np.array(x) # Turn into numpy Coordinates of 
Yi = np.array(y)
width = 0.6
plt.rcParams['font.sans-serif'] = ['SimHei'] # Used to display Chinese labels normally 
plt.figure(figsize=(8, 6)) # Specify image scale : 8:6
plt.bar(Xi, Yi, width, color='blue', label=' Popular word frequency statistics ', alpha=0.8, )
plt.xlabel(" Word frequency ")
plt.ylabel(" frequency ")
plt.savefig('zhifang.png')
plt.show()
return
def getciyun_most(map): # Get the word cloud 
# A corresponding Chinese word , A memory number of times 
x = []
y = []
for k, v in map.most_common(300): # before 300 Among the common words 
x.append(k)
y.append(v)
xi = x[0:150] # Before interception 150 individual 
xi = ' '.join(xi) # With spaces ` ` Split it into fixed formats ( The word cloud needs )
print(xi)
# backgroud_Image = plt.imread('') # If you need personalized word cloud 
# Word cloud size , Font and other basic settings 
wc = WordCloud(background_color="white",
width=1500, height=1200,
# min_font_size=40,
# mask=backgroud_Image,
font_path="simhei.ttf",
max_font_size=150, # Set font maximum 
random_state=50, # Set how many randomly generated states there are , That's how many color schemes there are 
) # There's a hole in the font , Be sure to set this parameter . Otherwise it will show a bunch of small boxes wc.font_path="simhei.ttf" # In black 
# wc.font_path="simhei.ttf"
my_wordcloud = wc.generate(xi) # Words that need to be put into the word cloud , Before here 150 Word 
plt.imshow(my_wordcloud) # Exhibition 
my_wordcloud.to_file("img.jpg") # preservation 
xi = ' '.join(x[150:300]) # After getting it again 150 Save one word and another word cloud 
my_wordcloud = wc.generate(xi)
my_wordcloud.to_file("img2.jpg")
plt.axis("off")
def anylaseword(comment):
# This filter word , Some words are meaningless and need to be filtered out 
list = [' This ', ' One ', ' not a few ', ' get up ', ' No, ', ' Namely ', ' No ', ' that ', ' still ', ' The plot ', ' such ', ' like that ', ' such ', ' That kind of ', ' The story ', ' figure ', ' what ']
print(list)
commnetstr = '' # Comment string 
c = Counter() # python A data set , Used to store dictionaries 
index = 0
for va in comment:
seg_list = jieba.cut(va[4], cut_all=False) ## jieba participle 
index += 1
for x in seg_list:
if len(x) > 1 and x != '\r\n': # It's not a single word And it's not a special symbol 
try:
c[x] += 1 # Add one to the number of times this word 
except:
continue
commnetstr += va[4]
for (k, v) in c.most_common(): # The filtering times are less than 5 's words 
if v < 5 or k in list:
c.pop(k)
continue
# print(k,v)
print(len(c), c)
getzhifang(c) # Use this data to draw a histogram 
getciyun_most(c) # The word cloud 
# print(commnetstr)
def anylase():
data = xlrd.open_workbook('test.xls') # open xls file 
table = data.sheets()[0] # Open the first i A watch 
nrows = table.nrows # A set of columns 
comment = []
for i in range(nrows):
comment.append(table.row_values(i)) # Add the column data to the tuple 
# print(comment)
anylasescore(comment)
anylaseword(comment)
if __name__ == '__main__':
anylase()

Let's take a look at the effect of execution :

Here I choose Jiang Ziya and Spirited away Some data from the movie , The comparison of the two film ratings is :

image-20201023081251237

From the score, we can see that qianyuqianxun has higher praise , Most people are willing to give him five . It's basically one of the best animations , Let's look at the histogram :

image-20201023081534644

It's obvious that the author of Chihiro is more famous , And it has a lot of influence , So many people mentioned him . Let's take a look at the two word clouds :

Hayao Miyazaki 、 White Dragon 、 Mother-in-law , It's really full of memories , Okay, no more , Anything you want to say is welcome to discuss !

If it feels good , give the thumbs-up 、 One key, three links Original official account :bigsai, Sharing knowledge and dry goods !
 Insert picture description here

版权声明
本文为[Big sai]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database