Last we talked about Cookie Relevant knowledge , come to know Cookie It's for interactivity web The birth of the , It is mainly used in the following three aspects ：
We'll use it today requests Library to log in Douban and then crawl to the movie reviews for example ,
Explain with code Cookie Of Session state management （ Sign in ） function .
This tutorial is for learning only , No commercial profit ！ If there is any infringement on the interests of any company , Please inform to delete ！
Before that, brother piggy took you to climb up Youku's bullet screen and generated a word cloud picture , It is found that the quality of Youku's bullet curtain is not high , There are many prepositions and some invalid words , such as ： ha-ha 、 Ah! 、 these 、 those ... And Douban's reputation has always been good , Some of the books or movies are very good , So today we're going to climb down the review of Douban , Then generate word cloud , Let's see how it works ！
We use requests Douban , And then crawl through the reviews , The final generation of word cloud ！
Why our previous case （ JD.COM 、 Youku, etc ） No login required in , Today, I need to log in to climb the bean petals ？ That's because Douban only allows you to check before without logging in 200 Movie Reviews , After that, you need to log in to view , This is also a means of anti pickpocketing ！
Let's look at a simple technical solution , It can be roughly divided into three parts ：
Let's start the practical operation after the plan is determined ！
We start with the browser before we do the crawler , Use the debug window to view url.
Open login page , Then debug the debug window. , Enter your username and password , Click login .
Here, brother pig suggests entering the wrong password , This way, you won't be unable to capture the request because of the page Jump ！ Above we get the login request URL：https://accounts.douban.com/j/mobile/login/basic
Because it's a POST request , So we also need to look at the parameters that are carried when we request to log in , We'll pull down the debug window to see Form Data.
Get login request URL After and parameters , We can use it requests Library to write a login function ！
Last time we crawled up the Youku bullet screen, we copied it from the browser Cookie Go to the request header to save the session state , But how do we make the code save automatically Cookie Well ？
Maybe you've seen or used
urllib library , It's for preservation Cookie This is done as follows ：
cookie = http.cookiejar.CookieJar() handler = urllib.request.HttpCookieProcessor(cookie) opener = urllib.request.build_opener(handler) opener(url)
But let's talk about requests I said that when I was in Ku ：
requests The library is based on urllib/3 Third party network library , It is characterized by powerful functions ,API grace . As can be seen from the picture above , about http client python Official documents also recommend that we use requests library , In practice requests The library is also a more used library .
So let's take a look at requests How does the library elegantly help us automatically save Cookie Of ？ Let's do a little tweaking of the code , Enable it to save automatically Cookie Maintain session state ！
In the above code , We made two changes ：
s = requests.Session(), Generate Session Object to save Cookie
We can see that the object that initiated the request becomes session object , It and the original requests The object initiates the request in the same way , But it will bring with it every time it requests Cookie, So we all use it later Session Object to initiate a request ！
Maybe some students will ask ：requests.Session object Is that what we often say session Well ？
The answer, of course, is not , What we often say session It is saved on the server , and requests.Session Object is just one to hold Cookie The object of , We can take a look at its source code introduction
So we must not requests.Session Object and the session Technology's mixed up ！
After we implement login and save session state , You can start doing business ！
First, find the movie you want to analyze in Douban , Here, brother pig chooses an American movie **《 To live in the wilderness 》**, Because this movie is the most in brother pig's heart , Not one of them. ！
Then pull down to find the movie review , Call up the debug window , Find the URL
But the one that crawled down was HTML Web data , We need to extract the review data
In the picture above, we can see that the climb back is html, And the review data is nested in html In the label , How to extract the content of movie reviews ？
Here we use regular expressions to match what we want to tag , Of course, there are more advanced extraction methods , For example, using some libraries （ such as bs4、xpath etc. ） Parse html Extract content , And the use of library efficiency is also relatively high , But that's what we'll see later , We're going to match it with regular today ！
Let's go back to html Web page structure of
We found that the content of film reviews is all in
<span class="short"></span> In this label , Then we You can write regular to match the content in this tag ！
Check the extracted content
We crawl 、 extract 、 After saving a piece of data , Let's crawl in batches . According to the experience of previous climbing , We know that the key to batch crawling is to find paging parameters , We can quickly find out URL There is one of them.
start Parameters are the parameters that control paging .
It's just crawling here 25 The page is over , We can go to the browser to verify , Is it true that only 25 page , Brother pig has verified that there is only 25 page ！
After data capture , Let's use word cloud to analyze the movie ！
There are two cases based on the use of word cloud analysis , So brother pig will only explain it briefly ！
Because the reviews we download are paragraphs of text , The word cloud we do is to count the number of words , So we need to participle first ！
The end result ：
From these words we can know that it's about Pursuit of self And Real life In the movie , Recommendation of brother pig split wall ！！！
Today we take clambering bean petals as an example , Learned a lot , To summarize ：
Given the limited space , A lot of details and skills encountered in the process of reptile are not completely written out , So I hope you can do it yourself , Of course, you can also join brother pig's Python Novice communication group Learn with you , You can also ask questions in the group if you have any problems ！ Please add brother pig wechat ：
it-pig66, Friend application format ： Add group -xxx！
Access to the source code , Scan the bottom two dimensional code to focus on WeChat official account 「 Naked pigs 」, reply ： Douban film review