Python logs into Douban and crawls for Movie Reviews

Pig brother 66 2020-11-13 07:33:01
python logs douban crawls movie

Last we talked about Cookie Relevant knowledge , come to know Cookie It's for interactivity web The birth of the , It is mainly used in the following three aspects :

  1. Session state management ( Such as user login status 、 The shopping cart 、 Game score or other information to be recorded )
  2. Personalization ( Such as user-defined settings 、 Theme, etc )
  3. Browser behavior tracking ( Such as tracking and analyzing user behavior )

We'll use it today requests Library to log in Douban and then crawl to the movie reviews for example ,
Explain with code Cookie Of Session state management ( Sign in ) function .

This tutorial is for learning only , No commercial profit ! If there is any infringement on the interests of any company , Please inform to delete !

One 、 Demand background

Before that, brother piggy took you to climb up Youku's bullet screen and generated a word cloud picture , It is found that the quality of Youku's bullet curtain is not high , There are many prepositions and some invalid words , such as : ha-ha 、 Ah! 、 these 、 those ... And Douban's reputation has always been good , Some of the books or movies are very good , So today we're going to climb down the review of Douban , Then generate word cloud , Let's see how it works !

Two 、 Function description

We use requests Douban , And then crawl through the reviews , The final generation of word cloud !

Why our previous case ( JD.COM 、 Youku, etc ) No login required in , Today, I need to log in to climb the bean petals ? That's because Douban only allows you to check before without logging in 200 Movie Reviews , After that, you need to log in to view , This is also a means of anti pickpocketing !
 Insert picture description here

3、 ... and 、 Technical solution

Let's look at a simple technical solution , It can be roughly divided into three parts :

  1. Analyze Douban's login interface and use requests Library implementation login and save cookie
  2. Analysis of Douban film review interface to achieve batch data capture
  3. Use word cloud to analyze the data of movie reviews

Let's start the practical operation after the plan is determined !

Four 、 Log in Douban

We start with the browser before we do the crawler , Use the debug window to view url.

1. Analyze Douban login interface

Open login page , Then debug the debug window. , Enter your username and password , Click login .
 Insert picture description here
Here, brother pig suggests entering the wrong password , This way, you won't be unable to capture the request because of the page Jump ! Above we get the login request URL:

Because it's a POST request , So we also need to look at the parameters that are carried when we request to log in , We'll pull down the debug window to see Form Data.
 Insert picture description here

2. Code implementation login Douban

Get login request URL After and parameters , We can use it requests Library to write a login function !
 Insert picture description here

3. Save session state

Last time we crawled up the Youku bullet screen, we copied it from the browser Cookie Go to the request header to save the session state , But how do we make the code save automatically Cookie Well ?

Maybe you've seen or used urllib library , It's for preservation Cookie This is done as follows :

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HttpCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)

But let's talk about requests I said that when I was in Ku :

requests The library is based on urllib/3 Third party network library , It is characterized by powerful functions ,API grace . As can be seen from the picture above , about http client python Official documents also recommend that we use requests library , In practice requests The library is also a more used library .

So let's take a look at requests How does the library elegantly help us automatically save Cookie Of ? Let's do a little tweaking of the code , Enable it to save automatically Cookie Maintain session state !
 Insert picture description here
In the above code , We made two changes :

  1. Add a line at the top s = requests.Session(), Generate Session Object to save Cookie
  2. Initiating a request is no longer the original requests object , It becomes Session object

We can see that the object that initiated the request becomes session object , It and the original requests The object initiates the request in the same way , But it will bring with it every time it requests Cookie, So we all use it later Session Object to initiate a request !

4. This Session The object is what we often say session Do you ?

Maybe some students will ask :requests.Session object Is that what we often say session Well ?

The answer, of course, is not , What we often say session It is saved on the server , and requests.Session Object is just one to hold Cookie The object of , We can take a look at its source code introduction
 Insert picture description here
So we must not requests.Session Object and the session Technology's mixed up !

5、 ... and 、 Get movie reviews

After we implement login and save session state , You can start doing business !

1. Analysis of Douban film review interface

First, find the movie you want to analyze in Douban , Here, brother pig chooses an American movie **《 To live in the wilderness 》**, Because this movie is the most in brother pig's heart , Not one of them. !
 Insert picture description here
Then pull down to find the movie review , Call up the debug window , Find the URL
 Insert picture description here

2. Crawl a piece of review data

 Insert picture description here
But the one that crawled down was HTML Web data , We need to extract the review data
 Insert picture description here

3. Movie review content extraction

In the picture above, we can see that the climb back is html, And the review data is nested in html In the label , How to extract the content of movie reviews ?

Here we use regular expressions to match what we want to tag , Of course, there are more advanced extraction methods , For example, using some libraries ( such as bs4、xpath etc. ) Parse html Extract content , And the use of library efficiency is also relatively high , But that's what we'll see later , We're going to match it with regular today !

Let's go back to html Web page structure of
 Insert picture description here
We found that the content of film reviews is all in <span class="short"></span> In this label , Then we You can write regular to match the content in this tag !
 Insert picture description here
Check the extracted content
 Insert picture description here

4. Batch crawl

We crawl 、 extract 、 After saving a piece of data , Let's crawl in batches . According to the experience of previous climbing , We know that the key to batch crawling is to find paging parameters , We can quickly find out URL There is one of them. start Parameters are the parameters that control paging .
 Insert picture description here
It's just crawling here 25 The page is over , We can go to the browser to verify , Is it true that only 25 page , Brother pig has verified that there is only 25 page !

6、 ... and 、 Analyze Movie Reviews

After data capture , Let's use word cloud to analyze the movie !

There are two cases based on the use of word cloud analysis , So brother pig will only explain it briefly !

1. Use stammer participle

Because the reviews we download are paragraphs of text , The word cloud we do is to count the number of words , So we need to participle first !
 Insert picture description here

2. Using word cloud analysis

 Insert picture description here
The end result :
 Insert picture description here
From these words we can know that it's about Pursuit of self And Real life In the movie , Recommendation of brother pig split wall !!!

7、 ... and 、 summary

Today we take clambering bean petals as an example , Learned a lot , To summarize :

  1. Learn how to use requests Ku initiated POST request
  2. Learned how to use requests Library login site
  3. Learned how to use requests Library Session Object remains in session state
  4. Learn how to use regular expressions to extract content from web tags

Given the limited space , A lot of details and skills encountered in the process of reptile are not completely written out , So I hope you can do it yourself , Of course, you can also join brother pig's Python Novice communication group Learn with you , You can also ask questions in the group if you have any problems ! Please add brother pig wechat :it-pig66, Friend application format : Add group -xxx!

Access to the source code , Scan the bottom two dimensional code to focus on WeChat official account 「 Naked pigs 」, reply : Douban film review
 Insert picture description here

本文为[Pig brother 66]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database