The official account of original technology :
bigsai
, In this paper 1024 Release , reply bigsai Take the architect to the next level pdf resources , Happy holidays , May all your wishes come true . After receiving the blessing, click one button to give back to the crab !
In one of my classes , The teacher has a task for each group , Introduce and complete a small module 、 The use of tool knowledge . But what happened to my group was python The little subject of reptiles .
I thought it was not very simple , Why ? It may not be enough to think of new time and energy , I just want to comment on Douban movie ( Short commentary ) Let's do something about it .
I've written about Nezha before , But what I want to write today is as detailed as my aunt . This article mainly realizes to any movie short comment ( hot ) And the visual analysis of grabbing . That is, you just need to provide links and some basic information , He can
analysis
For beancurds ,what shold we consider ? How to analyze ? Douban movie home page
Try this first , Open any movie , Here we use Jiang Ziya For example . Open Jiang Ziya and you will find that it is a non dynamic rendering page , That's the traditional way of rendering , Ask for this directly url
You can get the data . But flip through the pages and you'll see : Users who are not logged in can only access the preferred interface , To access the following page .
So the process should be Sign in ——> Reptiles ——> Storage ——> Visual analysis .
Here's the environment and the installation required , The environment is python3, Code in win and linux Can run successfully , If mac and linux Can't run friends font chaos problem, please private me . among pip The package used is as follows , Use Tsinghua directly Image download is very slow ( That's sweet enough ).
pip install requests -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install matplotlib -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install numpy -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install xlrd -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install xlwt -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install bs4 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install lxml -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install wordcloud -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install jieba -i https://pypi.tuna.tsinghua.edu.cn/simple
There's a password log in bar when you get in , What happened on our way to login , open F12 The console is not enough , We also need to use Fidder Grab the bag .
open F12 Console and then click login , After many attempts, I found that the login interface is also very simple :
Look at the parameters of the request and find that it is a normal request , No encryption , Of course, you can use fidder Carry out the bag , Here I have a simple test, using the wrong password to test . If the failed partner can try to log in manually and exit again, and then run the program again .
Write the login module code like this :
url='https://accounts.douban.com/j/mobile/login/basic'
header={
'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
'Referer': 'https://accounts.douban.com/passport/login_popup?login_source=anony',
'Origin': 'https://accounts.douban.com',
'content-Type':'application/x-www-form-urlencoded',
'x-requested-with':'XMLHttpRequest',
'accept':'application/json',
'accept-encoding':'gzip, deflate, br',
'accept-language':'zh-CN,zh;q=0.9',
'connection': 'keep-alive'
,'Host': 'accounts.douban.com'
}
data={
'ck':'',
'name':'',
'password':'',
'remember':'false',
'ticket':''
}
def login(username,password):
global data
data['name']=username
data['password']=password
data=urllib.parse.urlencode(data)
print(data)
req=requests.post(url,headers=header,data=data,verify=False)
cookies = requests.utils.dict_from_cookiejar(req.cookies)
print(cookies)
return cookies
After this HD , The whole execution process is about :
After successful login , We can take the login information to visit the website and crawl the information as we like . Although it's a traditional way of interaction , But every time you switch pages, there's a ajax request .
In this part of the interface, we can get the data of the comment part directly , You don't need to request the entire page and extract this part of the content . And this part of url The rules are the same as the previous analysis , only one start
Indicates that the current number of items is changing , So just put it together url Just go .
That is to put together with logic url Until it doesn't work properly .
https://movie.douban.com/subject/25907124/comments?percent_type=&start=0& Other parameters are omitted
https://movie.douban.com/subject/25907124/comments?percent_type=&start=20& Other parameters are omitted
https://movie.douban.com/subject/25907124/comments?percent_type=&start=40& Other parameters are omitted
For each url How to extract information after visiting ?
We according to the css Selectors to filter data , Because each comment has the same style , stay html It's like an element in a list .
Look at the one we just had ajax The data returned by the interface is just the red block below , So we're directly based on class Search Su into several groups, Cao Zu can .
On the concrete implementation , We use requests Send a request to get the result , Use BeautifulSoup Parse html Format file .
And the data we need is easy to analyze the corresponding parts .
The implementation code is :
import requests
from bs4 import BeautifulSoup
url='https://movie.douban.com/subject/25907124/comments?percent_type=&start=0&limit=20&status=P&sort=new_score&comments_only=1&ck=C7di'
header = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
}
req = requests.get(url,headers=header,verify=False)
res = req.json() # The result returned is a json
res = res['html']
soup = BeautifulSoup(res, 'lxml')
node = soup.select('.comment-item')
for va in node:
name = va.a.get('title')
star = va.select_one('.comment-info').select('span')[1].get('class')[0][-2]
comment = va.select_one('.short').text
votes=va.select_one('.votes').text
print(name, star,votes, comment)
The execution result of this test is :
After crawling data, we should consider storage , We store data in cvs in .
Use xlwt Write data to excel In file ,xlwt Basic application example :
import xlwt
# Create writable workbook object
workbook = xlwt.Workbook(encoding='utf-8')
# Create sheet sheet
worksheet = workbook.add_sheet('sheet1')
# Write in the table , The first parameter That's ok , The second parameter column , The third parameter content
worksheet.write(0, 0, 'bigsai')
# Save the table as test.xlsx
workbook.save('test.xlsx')
Use xlrd Read excel In file , This case xlrd Basic application example :
import xlrd
# The read name is test.xls file
workbook = xlrd.open_workbook('test.xls')
# Get the first table
table = workbook.sheets()[0] # Open the first 1 A watch
# Each row is a tuple
nrows = table.nrows
for i in range(nrows):
print(table.row_values(i))# Output each line
Come here , We have the login module + Crawling module + The storage module can store the data locally , The specific integration code is :
import requests
from bs4 import BeautifulSoup
import urllib.parse
import xlwt
import xlrd
# Account and password
def login(username, password):
url = 'https://accounts.douban.com/j/mobile/login/basic'
header = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
'Referer': 'https://accounts.douban.com/passport/login_popup?login_source=anony',
'Origin': 'https://accounts.douban.com',
'content-Type': 'application/x-www-form-urlencoded',
'x-requested-with': 'XMLHttpRequest',
'accept': 'application/json',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'connection': 'keep-alive'
, 'Host': 'accounts.douban.com'
}
# The parameters needed to log in
data = {
'ck' : '',
'name': '',
'password': '',
'remember': 'false',
'ticket': ''
}
data['name'] = username
data['password'] = password
data = urllib.parse.urlencode(data)
print(data)
req = requests.post(url, headers=header, data=data, verify=False)
cookies = requests.utils.dict_from_cookiejar(req.cookies)
print(cookies)
return cookies
def getcomment(cookies, mvid): # The parameter is successful login cookies( Backstage can go through cookies Identifying users , Of the movie id)
start = 0
w = xlwt.Workbook(encoding='ascii') # # Create writable workbook object
ws = w.add_sheet('sheet1') # Create sheet sheet
index = 1 # It means line , stay xls Write the corresponding number of lines in the file
while True:
# Simulation browser send request
header = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
}
# try catch Try , Once there is an error, the execution is complete , Go ahead with no mistakes
try:
# Put together url Every time star Add 20
url = 'https://movie.douban.com/subject/' + str(mvid) + '/comments?start=' + str(
start) + '&limit=20&sort=new_score&status=P&comments_only=1'
start += 20
# Send a request
req = requests.get(url, cookies=cookies, headers=header)
# The return result is json character string adopt req.json() Method to get data
res = req.json()
res = res['html'] # The data needed is in `html` Key down
soup = BeautifulSoup(res, 'lxml') # Structure this html Create a BeautifulSoup Objects are used to extract information
node = soup.select('.comment-item') # Each group class Are all comment-item It's divided into 20 Bar record ( Every url Yes 20 Comments )
for va in node: # Traverse comments
name = va.a.get('title') # Get reviewer name
star = va.select_one('.comment-info').select('span')[1].get('class')[0][-2] # Star count is highly praised
votes = va.select_one('.votes').text # number of votes
comment = va.select_one('.short').text # Comment text
print(name, star, votes, comment)
ws.write(index, 0, index) # The first index That's ok , The first 0 Column write index
ws.write(index, 1, name) # The first index That's ok , The first 1 Column write Commentator
ws.write(index, 2, star) # The first index That's ok , The first 2 Column write Comment on stars
ws.write(index, 3, votes) # The first index That's ok , The first 3 Column write number of votes
ws.write(index, 4, comment) # The first index That's ok , The first 4 Column write Comment content
index += 1
except Exception as e: # There is an abnormal exit
print(e)
break
w.save('test.xls') # Save as test.xls file
if __name__ == '__main__':
username = input(' Enter account :')
password = input(' Input password :')
cookies = login(username, password)
mvid = input(' Of the movie id by :')
getcomment(cookies, mvid)
After execution, the data is successfully stored :
We need to count the scores 、 Word frequency statistics . There is also the generation of word cloud display . And the corresponding is matplotlib
、WordCloud
library .
The logic of implementation : Read xls The file of , Use word segmentation to count word frequency , Statistics of the most frequent words are made into histogram and words . Make a pie chart to show the number of stars , The main code has comments , The specific code is :
The code is :
import matplotlib.pyplot as plt
import matplotlib
import jieba
import jieba.analyse
import xlwt
import xlrd
from wordcloud import WordCloud
import numpy as np
from collections import Counter
# Set the font yes , we have linux There's something wrong with the font
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['axes.unicode_minus'] = False
# similar comment Some data for reviews [ ['1',' name ','star star ',' Number of endorsements ',' Comment content '] ,['2',' name ','star star ',' Number of endorsements ',' Comment content '] ] Tuples
def anylasescore(comment):
score = [0, 0, 0, 0, 0, 0] # They correspond to each other 0 1 2 3 4 5 The number of times it appears
count = 0 # The total number of ratings
for va in comment: # Traverse the data for each comment ['1',' name ','star star ',' Number of endorsements ',' Comment content ']
try:
score[int(va[2])] += 1 # The first 3 Column by star star To cast to int Format
count += 1
except Exception as e:
continue
print(score)
label = '1 branch ', '2 branch ', '3 branch ', '4 branch ', '5 branch '
color = 'blue', 'orange', 'yellow', 'green', 'red' # All kinds of colors
size = [0, 0, 0, 0, 0] # A percentage number Put it all together 100
explode = [0, 0, 0, 0, 0] # explode :( Every piece ) Away from the center ;
for i in range(1, 5): # Calculation
size[i] = score[i] * 100 / count
explode[i] = score[i] / count / 10
pie = plt.pie(size, colors=color, explode=explode, labels=label, shadow=True, autopct='%1.1f%%')
for font in pie[1]:
font.set_size(8)
for digit in pie[2]:
digit.set_size(8)
plt.axis('equal') # This line of code makes the pie equal in length and width
plt.title(u' Percentage of each score ', fontsize=12) # title
plt.legend(loc=0, bbox_to_anchor=(0.82, 1)) # legend
# Set up legend Font size of
leg = plt.gca().get_legend()
ltext = leg.get_texts()
plt.setp(ltext, fontsize=6)
plt.savefig("score.png")
# Display diagram
plt.show()
def getzhifang(map): # Histogram two-dimensional , need x and y Two coordinates
x = []
y = []
for k, v in map.most_common(15): # Before acquisition 15 The maximum number
x.append(k)
y.append(v)
Xi = np.array(x) # Turn into numpy Coordinates of
Yi = np.array(y)
width = 0.6
plt.rcParams['font.sans-serif'] = ['SimHei'] # Used to display Chinese labels normally
plt.figure(figsize=(8, 6)) # Specify image scale : 8:6
plt.bar(Xi, Yi, width, color='blue', label=' Popular word frequency statistics ', alpha=0.8, )
plt.xlabel(" Word frequency ")
plt.ylabel(" frequency ")
plt.savefig('zhifang.png')
plt.show()
return
def getciyun_most(map): # Get the word cloud
# A corresponding Chinese word , A memory number of times
x = []
y = []
for k, v in map.most_common(300): # before 300 Among the common words
x.append(k)
y.append(v)
xi = x[0:150] # Before interception 150 individual
xi = ' '.join(xi) # With spaces ` ` Split it into fixed formats ( The word cloud needs )
print(xi)
# backgroud_Image = plt.imread('') # If you need personalized word cloud
# Word cloud size , Font and other basic settings
wc = WordCloud(background_color="white",
width=1500, height=1200,
# min_font_size=40,
# mask=backgroud_Image,
font_path="simhei.ttf",
max_font_size=150, # Set font maximum
random_state=50, # Set how many randomly generated states there are , That's how many color schemes there are
) # There's a hole in the font , Be sure to set this parameter . Otherwise it will show a bunch of small boxes wc.font_path="simhei.ttf" # In black
# wc.font_path="simhei.ttf"
my_wordcloud = wc.generate(xi) # Words that need to be put into the word cloud , Before here 150 Word
plt.imshow(my_wordcloud) # Exhibition
my_wordcloud.to_file("img.jpg") # preservation
xi = ' '.join(x[150:300]) # After getting it again 150 Save one word and another word cloud
my_wordcloud = wc.generate(xi)
my_wordcloud.to_file("img2.jpg")
plt.axis("off")
def anylaseword(comment):
# This filter word , Some words are meaningless and need to be filtered out
list = [' This ', ' One ', ' not a few ', ' get up ', ' No, ', ' Namely ', ' No ', ' that ', ' still ', ' The plot ', ' such ', ' like that ', ' such ', ' That kind of ', ' The story ', ' figure ', ' what ']
print(list)
commnetstr = '' # Comment string
c = Counter() # python A data set , Used to store dictionaries
index = 0
for va in comment:
seg_list = jieba.cut(va[4], cut_all=False) ## jieba participle
index += 1
for x in seg_list:
if len(x) > 1 and x != '\r\n': # It's not a single word And it's not a special symbol
try:
c[x] += 1 # Add one to the number of times this word
except:
continue
commnetstr += va[4]
for (k, v) in c.most_common(): # The filtering times are less than 5 's words
if v < 5 or k in list:
c.pop(k)
continue
# print(k,v)
print(len(c), c)
getzhifang(c) # Use this data to draw a histogram
getciyun_most(c) # The word cloud
# print(commnetstr)
def anylase():
data = xlrd.open_workbook('test.xls') # open xls file
table = data.sheets()[0] # Open the first i A watch
nrows = table.nrows # A set of columns
comment = []
for i in range(nrows):
comment.append(table.row_values(i)) # Add the column data to the tuple
# print(comment)
anylasescore(comment)
anylaseword(comment)
if __name__ == '__main__':
anylase()
Let's take a look at the effect of execution :
Here I choose Jiang Ziya and Spirited away Some data from the movie , The comparison of the two film ratings is :
From the score, we can see that qianyuqianxun has higher praise , Most people are willing to give him five . It's basically one of the best animations , Let's look at the histogram :
It's obvious that the author of Chihiro is more famous , And it has a lot of influence , So many people mentioned him . Let's take a look at the two word clouds :
Hayao Miyazaki 、 White Dragon 、 Mother-in-law , It's really full of memories , Okay, no more , Anything you want to say is welcome to discuss !
If it feels good , give the thumbs-up 、 One key, three links Original official account :bigsai, Sharing knowledge and dry goods !