Python crawler starter project

Pig brother 66 2020-11-13 07:32:38
python crawler starter project


Python What is it?

Python It's famous “ Uncle GUI ”Guido van Rossum stay 1989 During Christmas , A programming language written to kill the boring Christmas .

founder Guido van Rossum yes BBC Produce English drama Monty Python’s Flying Circus( chinese : Monti · Parson's Flying Circus ) Fans of , So I named the programming language that I created Python.

Life is too short , I use python, Translated from "Life is short, you need Python"

Python English pronunciation :/ˈpaɪθən/ , Chinese is similar to ‘ Patsen ’. And American pronunciation :/ˈpaɪθɑːn/, Chinese is similar to ‘ Take a reward ’. I think MIT professors read ‘ Take a reward ’, I think most of them read in China ‘ Patsen ’ Well .

2017 year python It's indisputable that it's number one , Compare AI First language , In the current situation of big data fire in artificial intelligence ,python Worthy of the title of the first language , as for C、C++、java I've been a big brother for ten thousand years , Compare the code in terms of quantity , Xiaobian believes java It must have exploded other languages .

But from this year's programming language trends ,java It's still the most widely spread , Compared with no matter app、web、 Cloud computing is inseparable , And the opposite python for , The learning path is a little more difficult , Want to run programming , And catch up with the trend ,python Is already the best language .

Many large websites use Python Developed , At home : douban 、 sohu 、 jinshan 、 tencent 、 Grand 、 NetEase 、 Baidu 、 Ali 、 TaoBao 、 Hot cool 、 potatoes 、 Sina 、 The nut …; Abroad : Google 、NASA、YouTube、Facebook、 Industrial Light magic 、 Red hat …

Python Will be included in the college entrance examination

Zhejiang Province information technology curriculum reform program has been introduced ,Python Enter Zhejiang information technology college entrance examination , from 2018 Zhejiang Province information technology teaching material programming language will be from vb Replace with Python. It's not just Zhejiang , Beijing and Shandong, the major educational provinces, are also determined to put Python Programming basis is included in the content system of information technology course and college entrance examination ,Python Language curriculum will also become a trend of children's learning . In particular, Shandong Province's latest publication of primary school information technology sixth grade textbooks also joined Python Content , The pupils all began to touch Python The language !!

No more learning , It's going to be finished by primary school students again ...

 

Python Introductory tutorial

Python What can be done

  • Web crawler
  • Web application development
  • System network operation and maintenance
  • Science and digital computing
  • GUI development
  • Network programming
  • natural language processing (NLP)
  • Artificial intelligence
  • Blockchain
  • There are so many ...

Python Entry crawler

This is my first python project , Share it with you here ~

  • demand
    • We are currently developing a product that is roughly : Users receive SMS such as : Buying movie tickets, train tickets, airline tickets, etc . then app Read SMS , Analysis of SMS , Get time and place , Then the background automatically creates a memo , Before the event started 1 Hour alert user .
  • Design
    • At first, we put the parsing function on the server side , But later, the privacy of users was considered . Later, the parsing function was put into app End , The server is only responsible for collecting data , The new data is then sent to app End .
    • On the server side, there are two main functions , One 、 Respond to app End request return data . Two 、 Crawl data , Store in database .
    • Return data usage in response to request java To do it , The crawling data is stored in the database python To do it , This is done in different languages because each language has its own advantages ,java Efficient than python Higher , Fit to do web End , Crawling data isn't about performance and python Language and a large number of libraries are suitable for crawlers .
  • Code
    • This project uses python3 Version of
    • Access to the source code : WeChat official account below scan 「 Naked pigs 」 reply : Introduction to reptiles obtain
       

       

    • To understand this project, you just need to have a simple python Basics , Can understand python Grammar is OK . In fact, I am myself python Not finished , And then I started writing , Baidu is the only one with problems , Learning while doing is not boring , because python You can do something interesting , For example, simulate continuous login to earn points , For example, I was writing a book about a model travel car recently python Script . It's recommended to see Liao Xuefeng python Introductory tutorial
    • First of all, let's take a look at my directory structure , At first, I was going to define a very good, very comprehensive specification , Later I found that because I was not familiar with the framework , It's just the beginning level , So I gave up . From simplicity :
    • Let's follow the order in the figure above , From the top to the next file explanation init.py The identification file for the package ,python A package is a folder , When there is a init.py After the file, it becomes a package, I've introduced some into this package py For others py call .

init.py

# -*- coding: UTF-8 -*-
# import need manager module
import MongoUtil
import FileUtil
import conf_dev
import conf_test
import scratch_airport_name
import scratch_flight_number
import scratch_movie_name
import scratch_train_number
import scratch_train_station
import MainUtil

The following two are configuration files , The first is the development environment (windows), The second is the test environment (linux), Then, different configuration files are enabled according to different systems

conf_dev.py

# -*- coding: UTF-8 -*-
# the configuration file of develop environment
# path configure
data_root_path = 'E:/APK98_GNBJ_SMARTSERVER/Proj-gionee-data/smart/data'
# mongodb configure
user = "cmc"
pwd = "123456"
server = "localhost"
port = "27017"
db_name = "smartdb"

conf_test.py

# -*- coding: UTF-8 -*-
# the configuration file of test environment
#path configure
data_root_path = '/data/app/smart/data'
#mongodb configure
user = "smart"
pwd = "123456"
server = "10.8.0.30"
port = "27017"
db_name = "smartdb"

The following file is a util file , It mainly reads the contents of the original file , And write the new content to the original file .

FileUtil.py

# -*- coding: UTF-8 -*-
import conf_dev
import conf_test
import platform
# configure Multi-confronment
# Judge the current system , And introduce the relative configuration file
platform_os = platform.system()
config = conf_dev
if (platform_os == 'Linux'):
config = conf_test
# path
data_root_path = config.data_root_path
# load old data
def read(resources_file_path, encode='utf-8'):
file_path = data_root_path + resources_file_path
outputs = []
for line in open(file_path, encoding=encode):
if not line.startswith("//"):
outputs.append(line.strip('\n').split(',')[-1])
return outputs
# append new data to file from scratch
def append(resources_file_path, data, encode='utf-8'):
file_path = data_root_path + resources_file_path
with open(file_path, 'a', encoding=encode) as f:
f.write(data)
f.close

The following main Methods control the execution process , Other execution methods call this main Method

MainUtil.py

# -*- coding: UTF-8 -*-
import sys
from datetime import datetime
import MongoUtil
import FileUtil
# @param resources_file_path Of resource files path
# @param base_url Climbing connection
# @param scratch_func Climbing method
def main(resources_file_path, base_url, scratch_func):
old_data = FileUtil.read(resources_file_path) # Read original resource
new_data = scratch_func(base_url, old_data) # Crawling for new resources
if new_data: # If the new data is not empty
date_new_data = "//" + datetime.now().strftime('%Y-%m-%d') + "\n" + "\n".join(new_data) + "\n" # Prefix the new data with the current date
FileUtil.append(resources_file_path, date_new_data) # Append new data to the file
MongoUtil.insert(resources_file_path, date_new_data) # Insert new data into mongodb In the database
else: # If the new data is empty , Print log
print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'), '----', getattr(scratch_func, '__name__'), ": nothing to update ")

Insert updated content into mongodb in

MongoUtil.py

# -*- coding: UTF-8 -*-
import platform
from pymongo import MongoClient
from datetime import datetime, timedelta, timezone
import conf_dev
import conf_test
# configure Multi-confronment
platform_os = platform.system()
config = conf_dev
if (platform_os == 'Linux'):
config = conf_test
# mongodb
uri = 'mongodb://' + config.user + ':' + config.pwd + '@' + config.server + ':' + config.port + '/' + config.db_name
# Write data to mongodb
# @author chenmc
# @param uri connect to mongodb
# @path save mongodb field
# @data save mongodb field
# @operation save mongodb field default value 'append'
# @date 2017/12/07 16:30
# First in mongodb Insert an auto increment data in db.sequence.insert({ "_id" : "version","seq" : 1})
def insert(path, data, operation='append'):
client = MongoClient(uri)
resources = client.smartdb.resources
sequence = client.smartdb.sequence
seq = sequence.find_one({
"_id": "version"})["seq"] # Gain self increase id
sequence.update_one({
"_id": "version"}, {
"$inc": {
"seq": 1}}) # Self increasing id+1
post_data = {
"_class": "com.gionee.smart.domain.entity.Resources", "version": seq, "path": path,
"content": data, "status": "enable", "operation": operation,
"createtime": datetime.now(timezone(timedelta(hours=8)))}
resources.insert(post_data) # insert data

Third party library introduced by project , You can use pip install -r requirements.txt Download third-party libraries

requirements.txt

# need to install module# need to install module
bs4
pymongo
requests
json

Here comes the real execution method , These five py They represent five kinds of crawling information : Airport name 、 flight number 、 The movie name 、 Train number 、 Train station . They all have the same structure , as follows :

 The first part : Define the url;
The second part : Get and compare with old data , Return new data ;
The third part :main Method , Write new data to file and mongodb in ;

scratch_airport_name.py: Climb to the National Airport

# -*- coding: UTF-8 -*-
import requests
import bs4
import json
import MainUtil
resources_file_path = '/resources/airplane/airportNameList.ini'
scratch_url_old = 'https://data.variflight.com/profiles/profilesapi/search'
scratch_url = 'https://data.variflight.com/analytics/codeapi/initialList'
get_city_url = 'https://data.variflight.com/profiles/Airports/%s'
# The url And old data , This method then compares whether there are new entries in the original data , If so, do not join , If not, rejoin , Finally, the new data is returned
def scratch_airport_name(scratch_url, old_airports):
new_airports = []
data = requests.get(scratch_url).text
all_airport_json = json.loads(data)['data']
for airport_by_word in all_airport_json.values():
for airport in airport_by_word:
if airport['fn'] not in old_airports:
get_city_uri = get_city_url % airport['id']
data2 = requests.get(get_city_uri).text
soup = bs4.BeautifulSoup(data2, "html.parser")
city = soup.find('span', text=" City ").next_sibling.text
new_airports.append(city + ',' + airport['fn'])
return new_airports
#main Method , Execute this py, Default call main Method , amount to java Of main
if __name__ == '__main__':
MainUtil.main(resources_file_path, scratch_url, scratch_airport_name)

scratch_flight_number.py: Crawling the national flight number

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import requests
import bs4
import MainUtil
resources_file_path = '/resources/airplane/flightNameList.ini'
scratch_url = 'http://www.variflight.com/sitemap.html?AE71649A58c77='
def scratch_flight_number(scratch_url, old_flights):
new_flights = []
data = requests.get(scratch_url).text
soup = bs4.BeautifulSoup(data, "html.parser")
a_flights = soup.find('div', class_='list').find_all('a', recursive=False)
for flight in a_flights:
if flight.text not in old_flights and flight.text != ' List of domestic segments ':
new_flights.append(flight.text)
return new_flights
if __name__ == '__main__':
MainUtil.main(resources_file_path, scratch_url, scratch_flight_number)

scratch_movie_name.py: Crawling through the latest movies

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import re
import requests
import bs4
import json
import MainUtil
# Relative paths , You also need to store this path in the database
resources_file_path = '/resources/movie/cinemaNameList.ini'
scratch_url = 'http://theater.mtime.com/China_Beijing/'
# scratch data with define url
def scratch_latest_movies(scratch_url, old_movies):
data = requests.get(scratch_url).text
soup = bs4.BeautifulSoup(data, "html.parser")
new_movies = []
new_movies_json = json.loads(
soup.find('script', text=re.compile("var hotplaySvList")).text.split("=")[1].replace(";", ""))
coming_movies_data = soup.find_all('li', class_='i_wantmovie')
# Movies on display
for movie in new_movies_json:
move_name = movie['Title']
if move_name not in old_movies:
new_movies.append(movie['Title'])
# The upcoming movie
for coming_movie in coming_movies_data:
coming_movie_name = coming_movie.h3.a.text
if coming_movie_name not in old_movies and coming_movie_name not in new_movies:
new_movies.append(coming_movie_name)
return new_movies
if __name__ == '__main__':
MainUtil.main(resources_file_path, scratch_url, scratch_latest_movies)

scratch_train_number.py: Crawling the national train number

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import requests
import bs4
import json
import MainUtil
resources_file_path = '/resources/train/trainNameList.ini'
scratch_url = 'http://www.59178.com/checi/'
def scratch_train_number(scratch_url, old_trains):
new_trains = []
resp = requests.get(scratch_url)
data = resp.text.encode(resp.encoding).decode('gb2312')
soup = bs4.BeautifulSoup(data, "html.parser")
a_trains = soup.find('table').find_all('a')
for train in a_trains:
if train.text not in old_trains and train.text:
new_trains.append(train.text)
return new_trains
if __name__ == '__main__':
MainUtil.main(resources_file_path, scratch_url, scratch_train_number)

scratch_train_station.py: Climb to the national railway station

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import requests
import bs4
import random
import MainUtil
resources_file_path = '/resources/train/trainStationNameList.ini'
scratch_url = 'http://www.smskb.com/train/'
def scratch_train_station(scratch_url, old_stations):
new_stations = []
provinces_eng = (
"Anhui", "Beijing", "Chongqing", "Fujian", "Gansu", "Guangdong", "Guangxi", "Guizhou", "Hainan", "Hebei",
"Heilongjiang", "Henan", "Hubei", "Hunan", "Jiangsu", "Jiangxi", "Jilin", "Liaoning", "Ningxia", "Qinghai",
"Shandong", "Shanghai", "Shanxi", "Shanxisheng", "Sichuan", "Tianjin", "Neimenggu", "Xianggang", "Xinjiang",
"Xizang",
"Yunnan", "Zhejiang")
provinces_chi = (
" anhui ", " Beijing ", " Chongqing ", " fujian ", " gansu ", " guangdong ", " guangxi ", " guizhou ", " hainan ", " hebei ",
" heilongjiang ", " Henan ", " hubei ", " hunan ", " jiangsu ", " jiangxi ", " Ji Lin ", " liaoning ", " ningxia ", " qinghai ",
" Shandong ", " Shanghai ", " shaanxi ", " shanxi ", " sichuan ", " tianjin ", " Inner Mongolia ", " Hong Kong ", " xinjiang ", " Tibet ",
" yunnan ", " Zhejiang ")
for i in range(0, provinces_eng.__len__(), 1):
cur_url = scratch_url + provinces_eng[i] + ".htm"
resp = requests.get(cur_url)
data = resp.text.encode(resp.encoding).decode('gbk')
soup = bs4.BeautifulSoup(data, "html.parser")
a_stations = soup.find('left').find('table').find_all('a')
for station in a_stations:
if station.text not in old_stations:
new_stations.append(provinces_chi[i] + ',' + station.text)
return new_stations
if __name__ == '__main__':
MainUtil.main(resources_file_path, scratch_url, scratch_train_station)

Put the project on the test server (centos7 System ) To run , I wrote one crontab, Call them regularly , The following post crontab.

/etc/crontab

SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
# For details see man 4 crontabs
# Example of job definition:
# .---------------- minute (0 - 59)
# | .------------- hour (0 - 23)
# | | .---------- day of month (1 - 31)
# | | | .------- month (1 - 12) OR jan,feb,mar,apr ...
# | | | | .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# | | | | |
# * * * * * user-name command to be executed
0 0 * * * root python3 /data/app/smart/py/scratch_movie_name.py >> /data/logs/smartpy/out.log 2>&1
0 1 * * 1 root python3 /data/app/smart/py/scratch_train_station.py >> /data/logs/smartpy/out.log 2>&1
0 2 * * 2 root python3 /data/app/smart/py/scratch_train_number.py >> /data/logs/smartpy/out.log 2>&1
0 3 * * 4 root python3 /data/app/smart/py/scratch_flight_number.py >> /data/logs/smartpy/out.log 2>&1
0 4 * * 5 root python3 /data/app/smart/py/scratch_airport_name.py >> /data/logs/smartpy/out.log 2>&1

follow-up

At present, the project has been running normally for more than three months ...

Feedback on problems

Any problems in reading and learning , Welcome feedback , You can use the following contact information to communicate with me

  • WeChat official account : Naked pigs
  • Leave a message below
  • Send me a private message directly

About the official account

  • Later or provide a variety of software free activation code
  • push python,java Such as programming technical articles and interview skills
  • Of course, you can send me what you are interested in
  • Thank you for your sincere attention , The proceeds from this official account will be All I'll give it to you in the form of a lottery
  • If you want to start a business in the future , You will also choose a small buddy in this official account ~
  • I hope you can share it , Let more want to learn python My friends saw it ~

 

 

版权声明
本文为[Pig brother 66]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database