picture

Last time I introduced how to use Python Grab the public number article and save it as PDF Store files locally . But it was downloaded this way PDF Only words, no pictures , So it's only applicable to official account without pictures or pictures. , What if I want to download pictures and text ? Today, I will introduce another scheme ——HTML.

Problems to be solved

In fact, we need to solve two problems :

  1. Pictures in official account are not saved. PDF In the document .
  2. Some code fragments in the official account , Especially those with long single line code , Save as PDF There will be the problem of incomplete code .
  3. PDF Will automatically page , If it's code or picture, there will be some problems .

 picture

To sum up, this is a problem , I think we should download the official account as web page. HTML The format is the best , Here's how to implement it .

Function realization

How to get links to articles , It's the same as the last one PDF The article is the same , It is still through the official account platform's graphic material in hyperlink query. , Here we'll just take the code from the previous issue , Just modify it . First, change the original file gzh_download.py Copy it into gzh_download_html.py, Then the code is modified on this basis :

# gzh_download_html.py# Introduce modules import requestsimport jsonimport reimport timefrom bs4 import BeautifulSoupimport os
# open cookie.txtwith open("cookie.txt", "r") as file:    cookie = file.read()cookies = json.loads(cookie)url = "https://mp.weixin.qq.com"# Request public number platform response = requests.get(url, cookies=cookies)# from url In order to get tokentoken = re.findall(r'token=(\d+)', str(response.url))[0]# Set request access header information headers = {    "Referer": "https://mp.weixin.qq.com/cgi-bin/appmsg?t=media/appmsg_edit_v2&action=edit&isNew=1&type=10&token=" + token + "&lang=zh_CN",    "Host": "mp.weixin.qq.com",    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36",}
# Before loop traversal 10 Page articles for j in range(1, 10, 1):    begin = (j-1)*5    # Request the current page to get the list of articles    requestUrl = "https://mp.weixin.qq.com/cgi-bin/appmsg?action=list_ex&begin="+str(begin)+"&count=5&fakeid=MzU1NDk2MzQyNg==&type=9&query=&token=" + token + "&lang=zh_CN&f=json&ajax=1"    search_response = requests.get(requestUrl, cookies=cookies, headers=headers)    # Get the return list Json Information    re_text = search_response.json()    list = re_text.get("app_msg_list")    # Traverses the list of articles on the current page    for i in list:        # The title is the directory name , Store in the directory html And pictures        dir_name = i["title"].replace(' ','')        print(" Downloading articles :" + dir_name)        # Request the content of the article url , Get the content of the article        response = requests.get(i["link"], cookies=cookies, headers=headers)        # Save article to local        save(response, dir_name, i["aid"])        print(dir_name + " Download complete !")    # Too fast request may be asked by wechat , Here we go 10 Seconds to wait    time.sleep(10)

Okay , As you can see from the code above , The main thing is to use the original method pdfkit.from_url(i["link"], i["title"] + ".pdf") It's the way it is now , Need to use requests Request the next article URL , Then call the method to save the article page and picture to the local , there save() Method is implemented in the following code .

Call save method

# Save downloaded html Pages and images def save(search_response,html_dir,file_name):    # preservation html The location of     htmlDir = os.path.join(os.path.dirname(os.path.abspath(__file__)), html_dir)    # Where to save the picture     targetDir = os.path.join(os.path.dirname(os.path.abspath(__file__)),html_dir + '/images')    # There is no create folder     if not os.path.isdir(targetDir):        os.makedirs(targetDir)    domain = 'https://mp.weixin.qq.com/s'    # Call save html Method     save_html(search_response, htmlDir, file_name)    # Call save picture method     save_file_to_local(htmlDir, targetDir, search_response, domain)
# Save pictures to local def save_file_to_local(htmlDir,targetDir,search_response,domain):    # Use lxml Parse the page returned by the request    obj = BeautifulSoup(save_html(search_response,htmlDir,file_name).content, 'lxml')      # Find a img Content of the label    imgs = obj.find_all('img')    # Add links to images on the page list    urls = []    for img in imgs:        if 'data-src' in str(img):            urls.append(img['data-src'])        elif 'src=""' in str(img):            pass        elif "src" not in str(img):            pass        else:            urls.append(img['src'])
   # Traverse all image links , Save the picture to the local specified folder , Picture name 0,1,2...    i = 0    for each_url in urls:        # Follow the image format of the article        if each_url.startswith('//'):            new_url = 'https:' + each_url            r_pic = requests.get(new_url)        elif each_url.startswith('/') and each_url.endswith('gif'):            new_url = domain + each_url            r_pic = requests.get(new_url)        elif each_url.endswith('png') or each_url.endswith('jpg') or each_url.endswith('gif') or each_url.endswith('jpeg'):            r_pic = requests.get(each_url)        # Create the specified directory        t = os.path.join(targetDir, str(i) + '.jpeg')        print(' This article needs to be dealt with ' + str(len(urls)) + ' A picture , Under processing ' + str(i + 1) + ' Zhang ……')        # Specify the absolute path        fw = open(t, 'wb')        # Save the picture to the local specified directory        fw.write(r_pic.content)        i += 1        # Modify the old link or relative link to directly access the local image        update_file(each_url, t, htmlDir)        fw.close()
   # preservation HTML To local    def save_html(url_content,htmlDir,file_name):        f = open(htmlDir+"/"+file_name+'.html', 'wb')        # write file        f.write(url_content.content)        f.close()        return url_content
   # modify HTML file , Change the path of the picture to the local path    def update_file(old, new,htmlDir):         # Open two files , The original file is used for reading , Another file writes the modified content        with open(htmlDir+"/"+file_name+'.html', encoding='utf-8') as f, open(htmlDir+"/"+file_name+'_bak.html', 'w', encoding='utf-8') as fw:            # Traverse each line , use replace() Method replacement path            for line in f:                new_line = line.replace(old, new)                new_line = new_line.replace("data-src", "src")                 # Write new file                fw.write(new_line)        # After execution , Delete the original file        os.remove(htmlDir+"/"+file_name+'.html')        time.sleep(5)        # Change the new file name to html        os.rename(htmlDir+"/"+file_name+'_bak.html', htmlDir+"/"+file_name+'.html')

Okay , Above is the article page and picture download to the local code , Next we run the command python gzh_download_html.py , Program start execution , Print the log as follows :

$ python gzh_download_html.py Downloading articles : Study Python Just read this one ! This article needs to be dealt with 3 A picture , Under processing 1 Zhang …… This article needs to be dealt with 3 A picture , Under processing 2 Zhang …… This article needs to be dealt with 3 A picture , Under processing 3 Zhang …… Study Python Just read this one ! Download complete ! Downloading articles :PythonFlask Data visualization needs to be dealt with in this paper 2 A picture , Under processing 1 Zhang …… This article needs to be dealt with 2 A picture , Under processing 2 Zhang ……PythonFlask Data visualization download completed ! Downloading articles : Teach you how to use Python Download Small video of mobile phone, this article needs to deal with 11 A picture , Under processing 1 Zhang …… This article needs to be dealt with 11 A picture , Under processing 2 Zhang …… This article needs to be dealt with 11 A picture , Under processing 3 Zhang …… This article needs to be dealt with 11 A picture , Under processing 4 Zhang …… This article needs to be dealt with 11 A picture , Under processing 5 Zhang …… This article needs to be dealt with 11 A picture , Under processing 6 Zhang …… This article needs to be dealt with 11 A picture , Under processing 7 Zhang ……

Now let's go to the directory where the program is stored , You can see that the following folder is named after the article :

 picture

Enter the corresponding article directory , You can see a html File and a file named images My picture catalog , Let's double-click to open the extension html The file of , You can see articles with pictures and code boxes , Just like what the official account saw. .

 picture

summary

This article introduces how to pass Python Download public number articles to local in batch , And save for HTML And pictures , In this way, the offline browsing of articles can be realized . Of course, if you want to HTML Turn into PDF It's also very simple. , Direct use pdfkit.from_file(xx.html,target.pdf) Method to convert the web page directly to PDF, And it turns out like this PDF Also with pictures .