Teach you to crawl novels in Python! Who can pay for novels these days!

User 1477324266325 2021-04-06 22:26:17
teach crawl novels python pay


Take the idea of novel :

  1. First get the address of the novel .
  2. Analyze the directory address structure .
  3. Do address stitching .
  4. Analyze the content structure of the chapter .
  5. Get and save the text .
  6. Complete code

1. Get the address of the novel

Load the required package :

import re
from bs4 import BeautifulSoup as ds
import requests
 Copy code 

Get the novel catalog file , return <Response [200]>, Indicates that the web page can be crawled normally

base_url='https://www.soshuw.com/XuLiangShangYouWangFei/'
chapter_html=requests.get(base_url)
print(chapter_html)
 Copy code 

2. Analyze the address structure of the novel

Parse the directory page , The output is the source code of the directory page

chapter_page_html=ds(chapter_page,'lxml')
print(chapter_page)
 Copy code 

open Directory page , It is found that there is a table of contents of the latest chapters in front of the table of contents of the text ( There are nine chapters ), The complete catalogue contains the latest chapters , So the latest chapter here is unnecessary .
 Insert picture description here

Right click on the page and select “ Check ”( perhaps “ attribute ”, Different browsers have different names , I use it IE) choice “ Elements ” Column , When the mouse moves over the right code block . The page on the left will highlight its corresponding page area , Find the code block corresponding to the complete directory . Here's the picture :
 Insert picture description here

There are two anchors for the full catalog , Namely class="novel_list" and id=“novel108799”, After careful observation, we found that class Is not the only , So we chose id Extract the content of the block
 Insert picture description here

Extract the full directory block

chapter_novel=chapter_page.find(id="novel108799")
print(chapter_novel)
 Copy code 

give the result as follows ( Only partial results ):
 Insert picture description here

Compare the content of the novel Chapter Website and the directory website (base_url) Find out , We just need to put base_url And the second half of the chapter content URL can be spliced together to get the complete chapter content URL

3. Splicing address

Use regular language library to extract the second half of the address

chapter_novel_str=str(chapter_novel)
regx = '<dd><a href="/XuLiangShangYouWangFei(.*?)"'
chapter_href_list = re.findall(regx, chapter_novel_str)
print(chapter_href_list)
 Copy code 

Splicing url:
Define a list chapter_url_list Receive full address

chapter_url_list = []
for i in chapter_href_list:
url=base_url+i
chapter_url_list.append(url)
print(chapter_url_list)
 Copy code 

4. Analyze the content structure of the chapter

open chapter , Right click →“ attribute ”, Look at the content structure , It is found that the text of the novel has class and id Two anchors ,class It is the same. ,id It changes with the chapters , So we use class Extract text
 Insert picture description here

Extract text paragraph

chapter_novel=chapter_page.find(id="novel108799")
print(chapter_novel)
 Copy code 

Extract body text and title

body_html=requests.get('https://www.soshuw.com/XuLiangShangYouWangFei/3647144.html')
body_page=ds(body_html.content,'lxml')
body = body_page.find(class_='content')
body_content=str(body)
print(body_content)
body_regx='<br/> (.*?)\n'
content_list=re.findall(body_regx,body_content)
print(content_list)
title_regx = '<h1>(.*?)</h1>'
title = re.findall(title_regx, body_html.text)
print(title)
 Copy code 

5. Save text

with open('1.txt', 'a+') as f:
f.write('\n\n')
f.write(title[0] + '\n')
f.write('\n\n')
for e in content_list:
f.write(e + '\n')
print('{} Crawling over '.format(title[0]))
 Copy code 

6. Complete code

import re
from bs4 import BeautifulSoup as ds
import requests
base_url='https://www.soshuw.com/XuLiangShangYouWangFei'
chapter_html=requests.get(base_url)
chapter_page=ds(chapter_html.content,'lxml')
chapter_novel=chapter_page.find(id="novel108799")
#print(chapter_novel)
chapter_novel_str=str(chapter_novel)
regx = '<dd><a href="/XuLiangShangYouWangFei(.*?)"'
chapter_href_list = re.findall(regx, chapter_novel_str)
#print(chapter_href_list)
chapter_url_list = []
for i in chapter_href_list:
url=base_url+i
chapter_url_list.append(url)
#print(chapter_url_list)
for u in chapter_url_list:
body_html=requests.get(u)
body_page=ds(body_html.content,'lxml')
body = body_page.find(class_='content')
body_content=str(body)
# print(body_content)
body_regx='<br/> (.*?)\n'
content_list=re.findall(body_regx,body_content)
#print(content_list)
title_regx = '<h1>(.*?)</h1>'
title = re.findall(title_regx, body_html.text)
#print(title)
with open('1.txt', 'a+') as f:
f.write('\n\n')
f.write(title[0] + '\n')
f.write('\n\n')
for e in content_list:
f.write(e + '\n')
print('{} Crawling over '.format(title[0]))
 Copy code 

Recently, a lot of friends have consulted through private letters about Python Learning problems . To facilitate communication , Click blue to join yourself Discussion and answer resource base

版权声明
本文为[User 1477324266325]所创,转载请带上原文链接,感谢
https://pythonmana.com/2021/04/20210406213026004X.html

  1. Spark Delta Lake 0.4.0 发布,支持 Python API 和部分 SQL
  2. How to transfer office files to PDF
  3. Are you still worried about multiple excel summary statistics? Python second processing really fragrant!
  4. Making music aggregate downloader with Python
  5. Spark delta Lake 0.4.0 is released, supporting Python API and part of SQL
  6. Python信息搜集
  7. Python information gathering
  8. Python - 关于类(self/cls) 以及 多进程通讯的思考
  9. Python - thinking about class (self / CLS) and multi process communication
  10. Python - 关于类(self/cls) 以及 多进程通讯的思考
  11. Python - thinking about class (self / CLS) and multi process communication
  12. Python信用评分卡建模(附代码)
  13. Python credit score card modeling (with code)
  14. 学Python需要学数据库吗?Python学习教程!
  15. Do you need to learn database to learn Python!
  16. Python私有变量如何定义?Python学习教程!
  17. How to define Python private variables? Python tutorial!
  18. Python数据分析入门(六):Pandas的函数应用
  19. Introduction to Python data analysis (6): function application of pandas
  20. 学Python需要学数据库吗?Python学习教程!
  21. Do you need to learn database to learn Python!
  22. Python描述 LeetCode 80. 删除有序数组中的重复项 II
  23. C++/python描述 AcWing 94. 递归实现排列型枚举
  24. C++/python描述 AcWing 92. 递归实现指数型枚举
  25. Python描述 LeetCode 88. 合并两个有序数组
  26. 苏州大学计算机考研 复试机试真题2013-2021真题及Python题解
  27. Python描述 LeetCode 781. 森林中的兔子
  28. 字典和json的区别是什么?Python学习
  29. Python describes leetcode 80. Removing duplicate items from ordered arrays II
  30. C + + / Python description acwing 94. Recursive implementation of permutation enumeration
  31. C + + / Python description acwing 92. Recursive implementation of exponential enumeration
  32. Python describes leetcode 88. Merging two ordered arrays
  33. Real computer test questions 2013-2021 of computer postgraduate entrance examination of Soochow University and python solutions
  34. The rabbit in the forest
  35. Python中的魔法属性
  36. What's the difference between dictionary and JSON? Python learning
  37. Magic properties in Python
  38. 字典和json的区别是什么?Python学习
  39. What's the difference between dictionary and JSON? Python learning
  40. python刷题-字母图形
  41. Python brush questions - letter graphics
  42. Python数据分析入门(七):Pandas层级索引
  43. Introduction to Python data analysis (7): Pandas hierarchical index
  44. Python 操作腾讯云短信(sms)详细教程
  45. Python operation Tencent cloud SMS (SMS) detailed tutorial
  46. Python数据可视化,完整版实操指南 !
  47. Python data visualization, full version of the practical guide!
  48. 上手Pandas,带你玩转数据(2)-- 使用pandas从多种文件中读取数据
  49. 上手Pandas,带你玩转数据(1)-- 实例详解pandas数据结构
  50. Using pandas to read data from various files
  51. Hands on pandas, take you to play with data (1) -- detailed explanation of pandas data structure with examples
  52. Pandas数据结构基础用法
  53. Basic usage of pandas data structure
  54. Python读取ini配置文件,保存到对象属性
  55. Python reads the INI configuration file and saves it to the object properties
  56. Foundation of Python: classes in Python
  57. python刷题-闰年判断
  58. python刷题-01字串
  59. How to judge leap year
  60. Python brush title-01 string