Python crawling Taobao product information

Pig brother 66 2020-11-13 07:32:53
python crawling taobao product information

 Insert picture description here
Dear students , I haven't written original technical articles for a long time , I've been busy lately , So it's slow , I'm sorry .

Warning : This tutorial is for learning communication only , Please do not use it for commercial profit , Those who disobey will be responsible for the consequences ! If this article infringes upon the privacy or interests of any organization group company , Please contact brother pig to delete !!!

Taobao series tutorial :

  • Chapter one :Python Simulated Login Taobao , Explain in detail how to use requests Library login Taobao pc End .
  • Second articles : Taobao auto login 2.0, newly added Cookies serialize , Teach you how to cookies Save up .
  • Third articles :Python Take Taobao condom , Teach you how to climb Taobao pc End product information .
  • Fourth articles :Python analysis 2000 A condom , How to do data analysis and draw conclusions .

 Insert picture description here

One 、 Taobao login review

We've shown you how to use requests Library login Taobao , Received a lot of feedback and questions from students , Brother pig is very pleased , At the same time, I'm sorry for those students who didn't reply in time !

By the way, this login function , There's no problem with the code . If you log in apply st Code failure When it's wrong , Replaceable _verify_password Method .
 Insert picture description here

stay Taobao login 2.0 We have added... To the improvement cookies The function of serialization , The purpose is to facilitate the access to Taobao data , Because if you The same ip Frequent login to Taobao may trigger the anti pickpocketing mechanism of Taobao

About the success rate of Taobao login , In the practical use of brother pig can basically succeed , If not, change the login parameters as above !

Two 、 Taobao commodity information crawling

This article is mainly about how to crawl data , The analysis of the data is in the next . The reason for the separation is that there are too many problems in climbing Taobao , And brother pig is going to explain how to climb it in detail , So considering the space and the absorption rate of students, let's explain it in two parts ! The purpose will remain the same : Let Xiaobai understand

This crawling is called TaoBao pc End search interface , Extract the returned data 、 And save it as excel file !

It seems that a simple function contains many problems , Let's look down a little bit !

3、 ... and 、 Crawling single page data

We all need to quantify before we start to write a crawler project , Generally, the first step is to take a page first !

1. Find the load data URL

We open Taobao in the website , Then login , open chrome The debug window , Click on network, And then check that Preserve log, Enter the product name you want to search in the search box
 Insert picture description here
This is the request on the first page , We looked at the data and found : The returned product information data is inserted into the web page , Instead of returning directly to the pure json data
 Insert picture description here

2. Is there a return to pure json Data interface ?

Then brother pig wondered if he had returned to pure json The data interface of ? So I ordered the next page ( That is the second page )
 Insert picture description here
After requesting the second page, brother pig found that the data returned was pure json, Then compare two requests url, Find only return json Parameters of data !
 Insert picture description here
By comparing, we found that search request url If you take ajax=true Parameters will be returned directly json data , Can we directly simulate the direct request json data !

So brother pig directly uses the request parameters on the second page to request data ( That is, direct request json data ), But there was an error on the first page of the request :
 Insert picture description here
Go straight back to a link and No json data , What the hell is this link ? click ...
 Insert picture description here
Dangdangdang , The slider appears , Some students will ask :** use requests Can you handle Taobao slider ?** Brother pig has consulted with some big reptiles , The principle of the slider is to collect the response time , Drag speed , Time , Location , The trajectory , The number of retries, etc. and then determine whether it is manual sliding . And it often changes algorithm , So brother pig chose to give up this road !

3. Use the request web interface

So we choose something like the first page ( request url With or without ajax=true Parameters , Go back to the entire web form ) The request interface of , Then extract the data !

 Insert picture description here
So we can crawl to Taobao's website information

Four 、 Extract product attributes

After crawling to the web , All we have to do is extract the data , Here we extract from the webpage json data , And then parse json Get the desired properties .

1. Extract products from the web page json data

Now that we have chosen to request the entire page , We need to know where the data is embedded in the web page , How to extract .

After the pig brother search comparison found , Go back to js Parameters :g_page_config It's the product information we want , And it's also json data format !
 Insert picture description here
Then we write a regular to extract the data !

goods_match ='g_page_config = (.*?)}};', response.text)

2. Get information about commodity prices

If you want to extract json data , We need to know how to return json Structure of data , We can copy the data to some json Plug in or online parsing
 Insert picture description here
understand json After the data structure , We can write a method to extract the attributes we want
 Insert picture description here

5、 ... and 、 Save as excel

operation excel There are lots of libraries , There are people on the Internet who specifically target excel If you are interested in the comparison and evaluation of the operation library, you can have a look :

Brother pig chooses to use pandas Library to operate excel, as a result of pandas It is easy to operate and is a common data analysis database !

1. Installation Library

pandas Library operation excel In fact, it depends on other libraries , So we need to install multiple libraries

pip install xlrd
pip install openpyxl
pip install numpy
pip install pandas

2. preservation excel

 Insert picture description here
What's a bit of a hole here is pandas operation excel No additional mode , It can only be used after reading the data append Append and write excel!

See the effect
 Insert picture description here

6、 ... and 、 Batch

The whole process of one-time crawling ( Crawling 、 Data Extraction 、 preservation ) When it's done , So we can batch cycle .
 Insert picture description here
The timeout seconds set here are from brother pig's practice , from 3s、5s To 10s above , Too often, the verification code is easy to appear !
 Insert picture description here
Brother pig crawled more than 2000 pieces of data several times
 Insert picture description here

7、 ... and 、 The problem of climbing Taobao

There are many problems in Taobao , Here is a list of :

1. Login questions

 Insert picture description here
problem : apply st What to do if the code fails ?
answer : Replace _verify_password Method .

If the parameters are OK, the login will basically succeed !

2. Agent pool

To prevent one's own ip Be sealed up , Brother pig used the agent pool . It needs high quality to climb Taobao ip To climb , Brother pig tried a lot of free online ip, I can't climb .
 Insert picture description here

But there's a website ip very good Standing master : , This website is updated every hour ip, Brother pig has tried many ip You can climb to Taobao .

3. Retry mechanism

To prevent normal requests from failing , Brother pig added a retry mechanism to the crawling method !
 Insert picture description here
Need to install retry library

pip install retry

4. The slider appears

None of the above is a problem , But there will still be sliders , Brother pig has been tested many times , Some climb 20 Time -40 The slide block is the most likely to appear .
 Insert picture description here
When the slider appears, you can only wait for half an hour to continue climbing , Because it can't be used yet requests Library solution slider , Learn later selenium Wait for other frameworks to see if they can solve !

5. Now this reptile

At present, this reptile is not perfect , It's only a semi-finished product , There are many things that can be improved , For example, automatic maintenance ip Pool function , Multi thread section crawling function , Solve the slider problem and so on , Let's work together to improve this reptile , So that he can become a perfect sensible reptile !

Access to the source code ,vx Scan the qr code below , Focus on vx official account 「 Naked pigs 」 reply : TaoBao Can get !
 Insert picture description here

本文为[Pig brother 66]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database