Graphical crawler, with a few of the simplest examples to take you to the introduction of Python crawler

DeROy 2021-01-20 13:21:46
graphical crawler simplest examples introduction


One 、 Preface

Reptiles have always been Python A big application scenario of , Crawlers can be written in almost every language , But programmers love Python. I prefer Python Because of her simple grammar , We use Python Can be very simple to write a crawler program . This blog will use Python Language , With a few very simple examples to introduce you to Python Reptiles .

Two 、 Web crawler

If we compare our Internet to a complex spider web , Then our reptile is a spider , We can let this spider crawl on the web , Look for something valuable to us in the web “ prey ”.

First of all, our web crawler is built on the web , So the basis of web crawler is web request . In our daily life , We use browsers to browse the web , We enter a URL in the URL bar , Click enter to display a web page in a few seconds .

On the surface, we hit a few buttons , In fact, the browser helps us to complete some operations , The specific operations are as follows :

1. Send a network request to the server 2. The browser receives and processes your request 3. The browser returns the data you need 4. The browser parses the data , And show it in the form of a web page

We can compare the above process to our daily shopping :

1. Tell the boss I'll have a pearl milk tea 2. The boss is in the store to see if there is anything you want 3. The boss took out the materials for making milk tea 4. The boss will make milk tea and give you

The above example of buying milk tea is not appropriate , But I think it's a good way to explain what a web request is .

After knowing what the network request is , We can learn what reptiles are . In fact, crawlers are also web requests , Usually we use the browser , And our crawler is through the program to simulate the process of network requests . But this basic web request is not a crawler , Reptiles are usually purposeful . For example, I want to write a picture of a crawling beauty , We need to do some filtering on the data we request 、 matching , Find data that's valuable to us . It's the whole process of crawling from the web to crawling .

Sometimes the website's anti crawler is worse , We can find it directly in the browser API, We go through API We can get the data we need directly , This comparison is much simpler .

3、 ... and 、 Simple reptiles

A simple crawler is a simple web request , You can also do some simple processing of the requested data .Python A native network request module is provided urllib, And the packaged version of requests modular . Compared to a straight line requests It should be more convenient and easy to use , So this article uses requests Make a network request .

3.1、 Crawl to a simple web page

When we send the request , The data returned is diverse , Yes HTML Code 、json data 、xml data , And binary streams . Let's take Baidu home page as an example , To climb :

import requests
# With get Method to send a request , Return the data
response = requests.get('http://www.baidu.com')
# Open a file as binary write
f = open('index.html', 'wb')
# Write the byte stream of the response to the file
f.write(response.content)
# Close file
f.close()

Let's take a look at what the website looks like when it opens :

This is our familiar Baidu page , It looks more complete on it . Let's take other websites for example , It can be a different effect , We use CSDN For example :

As you can see, the layout of the page is completely out of order , And also lost a lot of things . Anyone who has studied the front end knows , A web page is made up of html There are many static files on the page , And we're just going to crawl HTML Code crawls down ,HTML Static resources linked in , image css There is no crawling of styles, picture files, etc , So I see this kind of strange page .

3.2、 Crawling through the images in the web page

First of all, we need to be clear , When crawling through some simple web pages , We crawl images or videos to match what's in the web page url Information , That's what we call the web site . And then we go through this specific url Download the pictures , This completes the crawling of the picture . We have the following url:https://img-blog.csdnimg.cn/2020051614361339.jpg, We will take this picture url To demonstrate the code for downloading images :

import requests
# Get ready url
url = 'https://img-blog.csdnimg.cn/2020051614361339.jpg'
# send out get request
response = requests.get(url)
# Open the picture file in binary writing mode
f = open('test.jpg', 'wb')
# Write the file stream to the picture
f.write(response.content)
# Close file
f.close()

You can see , The code is the same as the above page crawling , Just open the file with suffix jpg. Actually the picture 、 video 、 This is the proper way to write audio files , And the corresponding html Code this kind of text information , We usually get its text directly , The acquisition method is response.text, After we get the text, we can match the images in it url 了 . We have the following http://topit.pro For example :

import re
import requests
# The website to be crawled
url = 'http://topit.pro'
# Get web source
response = requests.get(url)
# Match the source image resources
results = re.findall("<img[\\s\\S]+?src=\"(.+?)\"", response.text)
# Variable used for naming
name = 0
# Ergodic result
for result in results:
# In the source code analysis out of the picture resources, write is the absolute path , So it's complete url It's the main station + Absolute path
img_url = url+result
# Download the pictures
f = open(str(name) + '.jpg', 'wb')
f.write(requests.get(img_url).content)
f.close()
name += 1

Above we have completed a website crawling . In matching, we use regular expressions , Because there's a lot of regular content , It's not going to unfold here , Interested readers can find out for themselves , Here's just a simple .Python Using regular is done by re Modular implemented , You can call findall Match all required strings in the text . This function passes in two parameters , The first is a regular expression , The second is the string to match , If you don't know the regularization, you just need to know that we can use the regular to change the image of src Take out the content .

Four 、 Use BeautifulSoup analysis HTML

BeautifulSoup It's an analysis of XML Document and HTML Module of file , We used regular expressions for pattern matching , But it's a complicated process to write regular expressions yourself , And it's easy to make mistakes . If we give the parsing to BeautifulSoup It will greatly reduce our workload , Let's install... Before we use it .

4.1、BeautifulSoup Installation and simple use of

We use it directly pip install :

pip install beautifulsoup4

The modules are imported as follows :

from bs4 import BeautifulSoup

Let's take a look BeautifulSoup Use , We use the following HTML File test :

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<img class="test" src="1.jpg">
<img class="test" src="2.jpg">
<img class="test" src="3.jpg">
<img class="test" src="4.jpg">
<img class="test" src="5.jpg">
<img class="test" src="6.jpg">
<img class="test" src="7.jpg">
<img class="test" src="8.jpg">
</body>
</html>

It's a very short answer html page ,body It contains 8 individual img label , Now we need to get their src, The code is as follows :

from bs4 import BeautifulSoup
# Read html file
f = open('test.html', 'r')
str = f.read()
f.close()
# establish BeautifulSoup object , The first parameter is the parsed string , The second parameter is the parser
soup = BeautifulSoup(str, 'html.parser')
# Match content , The first is the label name , The second is the qualified attribute , The following is a match class by test Of img label
img_list = soup.find_all('img', {'class':'test'})
# Traversal tag
for img in img_list:
# obtain img Labeled src value
src = img['src']
print(src)

The results are as follows :

1.jpg
2.jpg
3.jpg
4.jpg
5.jpg
6.jpg
7.jpg
8.jpg

That's exactly what we need .

4.2、BeautifulSoup actual combat

We can parse web pages , We can figure out what's in it src, In this way, we can crawl resources such as pictures . Let's take the pear video as an example , Crawl the video . The homepage address is as follows :https://www.pearvideo.com/. We can see the following page by right clicking :

We can click on 1 It's about , Then choose the location you need to climb , such as 2, On the right, it will jump to the corresponding position . We can see that the outer layer is covered with a a label , In our actual operation is to find the click 2 The location of the jump page , Analysis out of the jump page should be a In the tag herf value . because herf The value is based on / At the beginning , So the whole URL Should be master station +href value , Knowing this, we can proceed to the next step of operation , Let's get the jump from the main station url:

import requests
from bs4 import BeautifulSoup
# master station
url = 'https://www.pearvideo.com/'
# Simulated browser access
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
# Send a request
response = requests.get(url, headers=headers)
# obtain BeautifulSoup object
soup = BeautifulSoup(response.text, 'html.parser')
# To find out what meets the requirements a label
video_list = soup.find_all('a', {'class':'actwapslide-link'})
# Traversal tag
for video in video_list:
# obtain herf And put it together to make a complete url
video_url = video['href']
video_url = url + video_url
print(video_url)

The output is as follows :

https://www.pearvideo.com/video_1674906
https://www.pearvideo.com/video_1674921
https://www.pearvideo.com/video_1674905
https://www.pearvideo.com/video_1641829
https://www.pearvideo.com/video_1674822

Let's just crawl one , Let's go to the first website to see the source code , I found such a sentence :

var contId="1674906",liveStatusUrl="liveStatus.jsp",liveSta="",playSta="1",autoPlay=!1,isLiving=!1,isVrVideo=!1,hdflvUrl="",sdflvUrl="",hdUrl="",sdUrl="",ldUrl="",srcUrl="https://video.pearvideo.com/mp4/adshort/20200517/cont-1674906-15146856_adpkg-ad_hd.mp4",vdoUrl=srcUrl,skinRes="//www.pearvideo.com/domain/skin",videoCDN="//video.pearvideo.com";

among srcUrl The website that contains video files , But we certainly can't find a web page by ourselves , We can use regular expressions :

import re
# Get the source code of a single video page
response = requests.get(video_url)
# Match the video URL
results = re.findall('srcUrl="(.*?)"', response.text)
# Output results
print(results)

give the result as follows :

['https://video.pearvideo.com/mp4/adshort/20200516/cont-1674822-14379289-191950_adpkg-ad_hd.mp4']

Then we can download this video :

with open('result.mp4', 'wb') as f:
f.write(requests.get(results[0], headers=headers).content)

The complete code is as follows :

import re
import requests
from bs4 import BeautifulSoup
# master station
url = 'https://www.pearvideo.com/'
# Simulated browser access
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
# Send a request
response = requests.get(url, headers=headers)
# obtain BeautifulSoup object
soup = BeautifulSoup(response.text, 'html.parser')
# To find out what meets the requirements a label
video_list = soup.find_all('a', {'class':'actwapslide-link'})
# Traversal tag
video_url = video_list[0]['href']
response = requests.get(video_url)
results = re.findall('srcUrl="(.*?)"', response.text)
with open('result.mp4', 'wb') as f:
f.write(requests.get(results[0], headers=headers).content)

So far, we have implemented several different crawlers from simple web pages to images to videos .

End

This article is from WeChat official account. - Programming learning base (LearnBase)

The source and reprint of the original text are detailed in the text , If there is any infringement , Please contact the yunjia_community@tencent.com Delete .

Original publication time : 2021-01-14

Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .

版权声明
本文为[DeROy]所创,转载请带上原文链接,感谢
https://pythonmana.com/2021/01/20210120131932844T.html

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database