Theme of 2020 Python developer Daily Online Technology Summit: technical implementation of crawler framework and experience sharing of module application

Tianyuan prodigal son 2020-11-13 09:52:57
theme python developer daily online


1. Preface

2 month 15 Japan ,CSDN union PyCon China 、wuhan2020、xinguan2020 Wait for power , Hold to 「 Fight the epidemic , The developers are acting 」 As the theme of 2020 Python Developer Day · Online Technology Summit , around Python Specific application and project in epidemic situation , For the vast Python developer 、 Fans reveal the power of code .

When invited by the Organizer , My first feeling is pressure and responsibility . Because the background of this activity is the current epidemic , All walks of life are helping Hubei , All eyes are focused on Wuhan , Wuhan affects the hearts of millions of people . That moment , I can't help it , stay ppt The following words are written on the :

 Insert picture description here
This is a pure public welfare activity , There is no interest involved . Participants can scan the code and choose to join for free , Or pay 19 Yuan joined , Completely voluntary . If there is income , All the income will be donated by the organizer to the areas in urgent need .
 Insert picture description here
All the code in this article , Uploaded to my Github:https://github.com/xufive/2020Pyday, If necessary , Please download by yourself .

2. About reptiles , Some concepts we have to understand

Reptiles are probably Pythoneer First contact 、 One of the most used technologies , It seems simple , But it involves network communication 、 Application Protocol 、html/CSS/js、 Data analysis 、 Service framework and other technical fields , Beginners are not easy to master , Even many people equate reptiles with some popular reptile library , such as scrapy etc. . In my submission , Concept is the foundation of theory , Ideas are the pioneers of code . Figure out the basic concepts 、 principle , Then write and use reptiles , There will be a half power effect .

2.1 Definition of reptile

  • Definition 1: Reptiles (crawler) It refers to a program that automatically grabs Internet information , Grabbing valuable information from the Internet .
  • Definition 2: Reptiles are also called web spiders (spider), It's a kind of internet robot used to browse the World Wide Web automatically .

Experience carefully , There is a slight difference between the two : The former tends to crawl through specific objects , The latter tends to search the whole site or the whole network .

2.2 Legal risks of reptiles

As a computer technology , Reptiles themselves are not prohibited by law , But using reptile technology to get data is a crime risk :

  • Over load crawling , Resulting in website paralysis or inability to access
  • Hacking into computer information system
  • Crawling for personal information
  • Invasion of privacy
  • Unfair competition

2.3 Understand reptile types from reptile application scenarios

  • focused crawler : For a specific object or goal , It's usually temporary
  • Incremental web crawler : Only the incremental part is crawled , This means that crawling is a frequent or periodic activity
  • Deep web crawler : For those content that cannot be obtained through static links 、 Hidden in login 、 Search after the form , Data that can only be obtained if the user submits the necessary information
  • General purpose web crawler : Also known as the whole web crawler , Mainly for portal search engines and large Web Service providers collect data

2.4 Basic technology and framework of reptile

A basic reptile framework , At least three parts : Scheduling server 、 Data downloader 、 Data processor , Corresponding to the three basic techniques of reptiles : Scheduling service framework 、 Data capture technology 、 Data preprocessing technology .

 Insert picture description here
The picture above is a framework that we have been using in recent years , One more management platform than the basic framework , Used to configure download tasks 、 Monitor the working condition of system components 、 Monitor data arrival 、 Balance the load of each node 、 Analyze the continuity of downloaded data 、 Complete or download data again, etc .

3. Data capture technology

Usually , We use standard modules urllib, Or third party modules requests/pycurl Wait for the module to grab data , Sometimes automated testing tools are used selenium modular . Of course , There are also many packaged frameworks available , such as ,pyspider/scrapy wait . The technical points of capturing data include :

  • Construct and send requests : Method 、 Headlines 、 Parameters 、cookie file
  • Receive and interpret responses : Answer code 、 Type of response 、 Response content 、 Coding format
  • A data grab , Often by multiple requests - The response consists of

3.1 tencent NPC Download epidemic data

A little analysis , We can easily from Tencent Real time tracking of epidemic situation On the site , Get its data service url:

https://view.inews.qq.com/g2/getOnsInfo

as well as 3 individual QueryString Parameters :

  • name: disease_h5
  • callback: Callback function
  • _: Timestamps accurate to milliseconds

Next , It's just a matter of course :

>>> import time, json, requests
>>> url = 'https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5&callback=&_=%d'%int(time.time()*1000)
>>> data = json.loads(requests.get(url=url).json()['data'])
>>> data.keys()
dict_keys(['lastUpdateTime', 'chinaTotal', 'chinaAdd', 'isShowAdd', 'chinaDayList', 'chinaDayAddList', 'dailyNewAddHistory', 'dailyDeadRateHistory', 'confirmAddRank', 'areaTree', 'articleList'])
>>> d = data['areaTree'][0]['children'] >>> [item['name'] for item in d]
[' hubei ', ' guangdong ', ' Henan ', ' Zhejiang ', ' hunan ', ' anhui ', ' jiangxi ', ' jiangsu ', ' Chongqing ', ' Shandong ’, …, ' Hong Kong ', ' Taiwan ', ' qinghai ', ' Macau ', ' Tibet ']
>>> d[0]['children'][0]
{
'name': ' wuhan ', 'today': {
'confirm': 1104, 'suspect': 0, 'dead': 0, 'heal': 0, 'isUpdated': True}, 'total': {
'confirm': 19558, 'suspect': 0, 'dead': 820, 'heal': 1377, 'showRate': True, 'showHeal': False, 'deadRate': 4.19, 'healRate': 7.04}}

More detailed description , Please refer to 《Python actual combat : Catch the real-time data of pneumonia , draw 2019-nCoV Epidemic map 》.

3.2 Modis Data download

Modis It's in TERRA and AQUA An important sensor on a remote sensing satellite , It's the only satellite that passes real-time observation data x The band broadcasts directly to the world , And can receive data free of charge and free use of onboard instruments . A wide spectrum : share 36 Band , The spectral range is from 0.4μm-14.4μm. download Modis The steps of data are as follows :

  1. With GET Mode request https://urs.earthdata.nasa.gov/home
  2. Parse out the parse... From the reply token
  3. Construct form , Fill in the username and password and token
  4. With POST Mode request https://urs.earthdata.nasa.gov/login
  5. Record cookie
  6. With GET Method request file download page
  7. Parse out the downloaded file from the response url
  8. Download the file

 Insert picture description here
below , We are Python IDLE Used interactively in requests The module completes the whole process . Of course , Same function , You can also use pycurl Module implementation . I am here Github At the same time requests and pycurl Two kinds of implementation code .

>>> import re
>>> from requests import request
>>> from requests.cookies import RequestsCookieJar
>>> resp = request('GET', 'https://urs.earthdata.nasa.gov/home')
>>> pt = re.compile(r'.*<input type="hidden" name="authenticity_token" value="(.*)" />.*')
>>> token = pt.findall(resp.text)[0]
>>> jar = RequestsCookieJar()
>>> jar.update(resp.cookies)
>>> url = 'https://urs.earthdata.nasa.gov/login'
>>> forms = {
“username”: “linhl”, “redirect_uri”: “”, “commit”: “Log+in, “client_id”: “”, “authenticity_token”: token, “password”:*********"}
>>> resp = request('POST', url, data=forms, cookies=jar)
>>> resp.cookies.items()
[('urs_user_already_logged', 'yes'), ('_urs-gui_session', '4f87b3fd825b06ad825a666133481861')]
>>> jar.update(resp.cookies)
>>> url = 'https://ladsweb.modaps.eosdis.nasa.gov/archive/allData/6/MOD13Q1/2019/321/MOD13Q1.A2019321.h00v08.006.2019337235356.hdf'
>>> resp = request('GET', url, cookies=jar)
>>> pu = re.compile(r'href="(https://ladsweb.modaps.eosdis.nasa.gov.*hdf)"')
>>> furl = pu.findall(resp.text)[0]
>>> furl= furl.replace('&amp;', '&')
>>> resp = request('GET', furl, cookies=jar)
>>> with open(r'C:\Users\xufive\Documents\215PyCSDN\fb\modis_demo.hdf', 'wb') as fp:
fp.write(resp.content)

We downloaded Modis data , yes HDF Format .HDF(Hierarchical Data File), Multi layer data file , It's the National Center for advanced computing applications (National Center for Supercomputing Application, NCSA) In order to meet the research needs of various fields, a new data format which can store and distribute scientific data efficiently is developed .HDF It can show many necessary conditions for the storage and distribution of scientific data .HDF , And another data format file netCDF, Not just Americans , We in China are also using , Especially in space science 、 Atmospheric science 、 Geophysics and other fields , Almost all data distribution , Both rely on files in both formats .

This is what I just downloaded hdf The true face of data files :
 Insert picture description here

3.3 quarks AI Search for NPC Epidemic data

quarks AI Search for NPC Epidemic data , This site can't use normal means to grab data , The source code of the web page does not match the content of the page ( Not rendered ). In the face of such a website , Do we have any technical means ? Don't worry , Let me introduce an interesting data capture technology : Just through the browser address bar can access the data , Can be caught , Truly accomplish “ You can see it ”.

We can see the realization of grasping , Depend on selenium modular . actually ,selenium It's not just a tool for data capture , It's an automated testing tool for testing websites , Support for a variety of browsers including Chrome、Firefox、Safari And so on the mainstream interface browser . use selenium Fetching the data , It's not a universal way , Because it only supports GET Method ( Of course , There are also some extension technologies that can help selenium Realization POST, Such as installation seleniumrequests modular ).

>>> from selenium import webdriver
>>> from selenium.webdriver.chrome.options import Options
>>> opt = Options()
>>> opt.add_argument('--headless')
>>> opt.add_argument('--disable-gpu')
>>> opt.add_argument('--window-size=1366,768')
>>> driver = webdriver.Chrome(options=opt)
>>> url = 'https://broccoli.uc.cn/apps/pneumonia/routes/index?uc_param_str=dsdnfrpfbivesscpgimibtbmnijblauputogpintnwktprchmt&fromsource=doodle'
>>> driver.get(url)
>>> with open(r'd:\broccoli.html', 'w') as fp:
fp.write(driver.page_source)
247532
>>> driver.quit()

About selenium Module installation and use , More details , Please refer to This paper introduces an interesting data capture technology : You can see it .

4. Data preprocessing technology

4.1 Common pretreatment techniques

Problems to be considered in data preprocessing :

  • Is the data format standard ?
  • Is the data complete ?
  • Not in accordance with specifications 、 How to deal with incomplete data ?
  • How to save ? How to distribute ?

Based on the above considerations , The following pretreatment techniques have been developed :

  • xml/html Data analysis
  • Text data analysis
  • Data cleaning 、 check 、 duplicate removal 、 Make up for a deficiency 、 interpolation 、 Standardization
  • Data storage and distribution

4.2 Parsing examples : Geomagnetic index (dst)

Geomagnetic index , It is a classification index to describe the intensity of magnetic disturbance in a period of time . The geomagnetic index used in the low and middle latitude stations is called Dst Index , This index is measured every hour , It mainly measures the intensity change of geomagnetic horizontal component . This The site offers Dst Index Download , The page provides... Every day and hour of last month Dst Index . The whole process is as follows :

  1. Grab html page , Use requests
  2. from html The text data is parsed , Save as data file , Use bs4
  3. Parsing text data , Save as a two-dimensional data sheet , Using regular expressions

We are still Python IDLE The process is implemented interactively in :

>>> import requests
>>> html = requests.get('http://wdc.kugi.kyoto-u.ac.jp/dst_realtime/lastmonth/index.html')
>>> with open(r'C:\Users\xufive\Documents\215PyCSDN\dst.html', 'w') as fp:
fp.write(html.text)
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html.text, "lxml")
>>> data_text = soup.pre.text
>>> with open(r'C:\Users\xufive\Documents\215PyCSDN\dst.txt', 'w') as fp:
fp.write(data_text)
>>> import re
>>> r = re.compile('^\s?(\d{1,2})\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)$', flags=re.M)
>>> data_re = r.findall(data_text)
>>> data = dict()
>>> for day in data_re:
data.update({
int(day[0]):[int(day[hour+1]) for hour in range(24)]})

5. Data deep processing technology

5.1 Data visualization

Data visualization , It's the visual representation of the data , With the help of graphic means , Convey and communicate information clearly and effectively . It's an evolving concept , Its borders are expanding . in fact , Data visualization is also considered as one of the means of data mining .

Matplotlib yes Python The most influential 2D Drawing library , It offers a complete set and Matlab Similar orders API, Very suitable for interactive mapping . It can also be easily used as a drawing control , The embedded GUI In the application .matplotlib You can draw many kinds of graphs, including ordinary line graphs , Histogram , The pie chart , Scatter plot and error line plot etc ; It is convenient to customize various attributes of graphics, such as the type of lines , Color , thickness , Font size, etc ; It can support a part of TeX Typesetting order , It can display the mathematical formula in the figure more aesthetically .

although matplotlib Mainly focus on drawing , And mainly two-dimensional graphics , But it also has some different extensions , It allows us to map on a geographical map , Let's put Excel and 3D Combine the charts . stay matplotlib In the world of , These extensions are called toolkits (toolkits). Toolkits are some focus on a certain topic ( Such as 3D mapping ) The set of specific functions of . The more popular tool kits are Basemap、GTK Tools 、Excel Tools 、Natgrid、AxesGrid and mplot3d etc. .

Pyecharts It's also a great drawing library , Especially its Geo The geographical coordinate system is powerful , Easy to use . I'm from it js edition echarts Start to know it . however ,Pyecharts The shortcomings are also prominent : First, there is no continuity in the version change , Second, it does not support TeX Typesetting order . Especially the 2 A question , Seriously restricted Pyecharts Development space of .

data 3D In terms of visualization , Recommended choice PyOpenGL, Besides, there are VTK / Mayavi / Vispy And so on . I have one myself 3D library , Already in https://github.com/xufive/wxgl Open source . Please refer to Open source my 3D library WxGL:40 Line code to turn the epidemic map into a three-dimensional earth model .

 Insert picture description here

About Matplotlib, Please refer to one of my blog posts : Mathematical modeling three swordsmen MSN. About Basemap, I've been in Python actual combat : Catch the real-time data of pneumonia , draw 2019-nCoV Epidemic map There are more detailed application examples in this paper .

5.2 data mining

Data mining , It refers to revealing the implicit through algorithm from a large number of data 、 The process of previously unknown and potentially valuable information . Data mining is a decision support process , It's mainly based on Statistics 、 database 、 Visualization technique 、 Artificial intelligence 、 machine learning 、 Pattern recognition and other technologies , Highly automated analysis of data .

below , We are diagnosed every day throughout the country NCP Take the curve of population change , Simple demonstration of curve fitting technology . Curve fitting is often used for trend prediction , The common fitting methods are least square curve fitting 、 Objective function fitting, etc .

# -*- coding: utf-8 -*-
import time, json, requests
import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize
plt.rcParams['font.sans-serif'] = ['FangSong'] # Set default font 
plt.rcParams['axes.unicode_minus'] = False # When saving images '-' Questions displayed as squares 
def get_day_list():
""" Get daily data """
url = 'https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5&callback=&_=%d'%int(time.time()*1000)
data = json.loads(requests.get(url=url).json()['data'])['chinaDayList']
return [(item['date'], item['confirm']) for item in data]
def fit_exp():
""" fitting """
def func(x, a, b):
return np.power(a, (x+b)) # Exponential function y = a^(x+b)
_date, _y = zip(*get_day_list())
_x = np.arange(len(_y))
x = np.arange(len(_y)+1)
fita, fitb = optimize.curve_fit(func, _x, _y, (2,0))
y = func(x, fita[0], fita[1]) # fita It is the best fitting parameter 
plt.plot(_date, _y, label=' Raw data ')
plt.plot(x, y, label='$%0.3f^{x+%0.3f}$'%(fita[0], fita[1]))
plt.legend(loc='upper left')
plt.gcf().autofmt_xdate() # Optimize tagging ( Auto tilt )
plt.grid(linestyle=':') # Show grid 
plt.show()
if __name__ == '__main__':
fit_exp()

The fitting effect is as follows :
 Insert picture description here
At present, the whole country is devoted to , Fight the virus , The epidemic has gradually stabilized , We use exponential function as the fitting target , In the later period, the deviation will be more and more , But in the early days , This fitting method has certain reference value for trend prediction .

5.3 Data services

Data services , Is to provide data for crawlers to grab .Python There are many mature web frame , such as ,Django,Tornado,Falsk etc. , Can easily implement data services . Of course , In addition to the service framework , Data service is also inseparable from the database . Because of the time limit , Here is a simple demonstration of the most economical data server :

PS D:\XufiveGit\2020Pyday\fb> python -m http.server
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...

Give it a try , Browser becomes file browser .

6. Scheduling service framework

I talked about it before. , A basic reptile framework , At least three parts : Scheduling server 、 Data downloader 、 Data processor . We use these three parts , Demonstrate a minimal crawler framework .

6.1 Scheduling service module

APScheduler Is one of my favorite modules for scheduling services , Its full name is Advanced Python Scheduler. It's a lightweight Python Timed task scheduling framework , Very powerful .APScheduler There are many kinds of triggers , In the following code , Used cron trigger , This is a APScheduler The most complex trigger in , Support cron grammar , You can set very complex trigger mode .

6.2 Mini crawler frame

All the code , Including comments , There are only fifty lines , But can realize every 10 Minutes from Tencent epidemic data service site to grab a data , And resolve to save as csv Format data file .

# -*- coding: utf-8 -*-
import os, time, json, requests
import multiprocessing as mp
from apscheduler.schedulers.blocking import BlockingScheduler
def data_obtain():
""" get data """
url = 'https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5&callback=&_=%d'%int(time.time()*1000)
with open('fb/ncp.txt', 'w') as fp:
fp.write(requests.get(url=url).json()['data'])
print('Obtain OK')
def data_process():
""" Processing data """
while True:
if os.path.isfile('fb/ncp.txt'):
with open('fb/ncp.txt', 'r') as fp:
data = json.loads(fp.read())
with open('fb/ncp.csv', 'w') as fp:
for p in data['areaTree'][0]['children']:
fp.write('%s,%d,%d,%d,%d\n'%(p['name'], p['total']['confirm'], p['total']['suspect'], p['total']['dead'], p['total']['heal']))
os.remove('fb/ncp.txt')
print('Process OK')
else:
print('No data file')
time.sleep(10)
if __name__ == '__main__':
# Create and start data processing subprocesses 
p_process = mp.Process(target=data_process) # Create data processing subprocesses 
p_process.daemon = True # Set the child process as a daemons 
p_process.start() # Start the data processing subprocess 
# Create scheduler 
scheduler = BlockingScheduler()
# Add tasks 
scheduler.add_job(
data_obtain, # The task of getting data 
trigger = 'cron', # Set trigger to cron 
minute = '*/1', # Set to execute every minute 
misfire_grace_time = 30 # 30 No execution of this... In seconds job, Give up execution 
)
# Start the scheduling service 
scheduler.start()
版权声明
本文为[Tianyuan prodigal son]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database