picture

Douban as a collection of books, video and audio content of the community website , Has been recognized and favored by a large number of users , Nowadays, very young people go to Douban to see the ratings and related comments before going to the cinema or buying books , It has to be said that Douban score reflects objectively the popularity of a work to a certain extent .

today , So we grab the Douban movie top 250 Relevant data .

First of all, we need to make it clear that the information we need to obtain is as follows : name , The director , Country , link , Release time , type , score ( Five stars , Proportion of four stars ) And the number of evaluators .

Analyze the website

First of all, let's watch the Douban movie top 250 You will find that ,top 250 The total is divided into 10 page , Every page 25 Bar record , The website is https://movie.douban.com/top250?start={start}&filter= among start from 0 Start , Each increment 25, To 225 end , I believe everyone can understand .

therefore , We use functions getUrls To get all the links .

def getUrls():    url_init = 'https://movie.douban.com/top250?start={0}&filter='    urls = [url_init.format(index * 25) for index in range(10)]    return urls

But it seems that this list page can't get all the information we need , You need to go to the details page of the specific movie . therefore , We can crawl the list page first , Then get the link address of the details page from the list page , Then get the details we need from the details page .

Analyze the web

Next, we need to confirm where the specific details we need are hidden . Open the url https://movie.douban.com/top250?start=0&filter=, Then open the chrome In the console .

For example, we locate 「 Farewell my concubine 」. You can see that every movie is in a separate li tag . At every li Inside :

The movie link is in div.info > div.hd > a Inside .

 picture


Get 「 Farewell my concubine 」 After the details page address of , Let's analyze the details page .

We found that , All the movie information is in <div id="content"> In this label .

  • Title in property="v:itemreviewed" Of span Inside .
  • It was released in Of span Inside .
  • Director at Of span Inside .
  • The country label is special , There is no uniqueness , So you can use regular expressions .
  • Type in the property="v:genre" Of span Inside .
  • Movie ratings are in property="v:average" Of strong Inside .
  • The number of evaluators is property='v:votes' Of span Inside .
  • The specific score is class_='ratings-on-weight' Of div Inside .

 picture


Um. , very good , After analysis , We know exactly where we need the information , Next, grab and parse these contents .

get data

It's necessary to capture web data requests library , Parsing the web page needs to use BeautifulSoup library , So let's first introduce them into our program .

import bs4 as bs4import requestsimport re

Because you need to parse the web page when you get the movie link , To get more information, you also need to parse the web page , So let's start with a definition called get_page_html(url) Function of , For from url obtain html Content .

meanwhile , To prevent anti crawlers , We need to define something headers.

def get_page_html(url):    headers = {        'Referer': 'https://movie.douban.com/chart',        'Host': 'movie.douban.com',        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'    }    try:        response = requests.get(url, headers=headers)        if response.status_code == 200:            return response.text        return None    except RequestException:        return None

Next , We started to get the data , Need a function to resolve the movie details page link address .

def get_movie_url(html):    ans = []    soup = bs4.BeautifulSoup(html, 'html.parser')    items = soup.select('li > div.item')    for item in items:        href = item.select('div.info > div.hd > a')[0]['href']        ans.append(href)    return ans

In this function , We pass in the entire page html Content , This function is responsible for resolving the link address of the movie details page and returning it to the caller in the form of a list .

Got the link to the movie details page , Finally, we just need to parse out the details . We can see , The details page and the list page have a lot of duplicate information , So we can get all the information we need from the details page .

We need to define a function , Used to parse the details page .

# 【 name , link . The director , Country , Release time , type , score ,[ Five stars , Proportion of four stars ], Number of evaluators 】def get_movie_info(url):    ans = {}    html = get_page_html(url)    soup = bs4.BeautifulSoup(html, 'html.parser')    content = soup.find('div', id='content')
   ## author    title = content.find('span', property='v:itemreviewed').text    ## Release year    year = content.find('span', class_='year').text[1:5]    ## The director    directors = content.find('span', class_='attrs').find_all('a')    director = []    for i in range(len(directors)):        director.append(directors[i].text)
   ## The country of the show / region    country = content.find(text=re.compile(' Producer country / region ')).next_element    typeList = content.find_all('span', property='v:genre')    ## Film type    type = []    for object in typeList:        type.append(object.text)
   ## score    average = content.find('strong', property='v:average').text    ## Number of evaluators    votes = content.find('span', property='v:votes').text
   ## Specific ratings ( Five stars The proportion of four stars is )    rating_per_items = content.find('div', class_='ratings-on-weight').find_all('div', class_='item')    rating_per = [rating_per_items[0].find('span', class_='rating_per').text,                  rating_per_items[1].find('span', class_='rating_per').text]
   return {'title': title, 'url': url, 'director': director, 'country': country, 'year': year, 'type': type,            'average': average, 'votes': votes, 'rating_per': rating_per}

It's not difficult to capture Douban's data , The main thing is to analyze the page structure carefully , Find the information we need from the intricate web structure . Because we get a lot of information , So this function is a little bit longer .

The information of each field is obtained by parsing the source code of the web page , So in terms of accuracy, there is no problem at all , meanwhile ,BeautifulSoup It's easy to use , You can get started in about half an hour . About BeautifulSoup The use of can see The first 65 God : Reptile weapon Beautiful Soup Traversal of documents .

After successfully grabbing the data , We also need to define a function , Used to cache data to a database or local file , For subsequent analysis . Here for the convenience of writing directly to the file .

def writeToFile(content):    filename = 'doubanTop250.txt'    with open(filename,'a') as f:         f.write(content + '\n')

thus , All our preparations have been finished , You can grab the data .

if __name__ == '__main__':    list_urls = getUrls()    list_htmls = [get_page_html(url) for url in list_urls]    movie_urls = [get_movie_url(html) for html in list_htmls]
   movie_details = [get_movie_info(url) for url in movie_urls[0]]    for detail in movie_details:        writeToFile(str(detail))

after , You can see our data . Be accomplished .

 picture

summary

Today we use requests Kuhe BeautifulSoup Ku to Douban movie top 250 And grab it , Mainly right BeautifulSoup Practice in Library . This paper has a clear idea , It's just that we need to be more patient when we analyze web pages , I hope you can do it yourself , It's good for the improvement of code skills .

Code address

Sample code :https://github.com/JustDoPython/python-100-day/tree/master/day-119


Series articles


The first 118 God :Python Compare with the copy of the object
The first 117 God : Machine learning algorithms K a near neighbor
The first 116 God : Naive Bayesian theory of machine learning algorithms
The first 115 God :Python Is it value passing or reference passing

   The first 114 God : Three board model algorithm project actual combat

   The first 113 God :Python XGBoost Algorithm project actual combat

   The first 112 God : Monte Carlo of machine learning algorithm

   The first 111 God :Python Garbage collection mechanism

from 0 Study Python 0 - 110 Summary of the grand collection