Dear students , I haven't written original technical articles for a long time , I've been busy lately , So it's slow , I'm sorry .
Warning ： This tutorial is for learning communication only , Please do not use it for commercial profit , Those who disobey will be responsible for the consequences ！ If this article infringes upon the privacy or interests of any organization group company , Please contact brother pig to delete ！！！
Taobao series tutorial ：
We've shown you how to use requests Library login Taobao , Received a lot of feedback and questions from students , Brother pig is very pleased , At the same time, I'm sorry for those students who didn't reply in time ！
By the way, this login function , There's no problem with the code . If you log in apply st Code failure When it's wrong , Replaceable
_verify_password Method .
stay Taobao login 2.0 We have added... To the improvement cookies The function of serialization , The purpose is to facilitate the access to Taobao data , Because if you The same ip Frequent login to Taobao may trigger the anti pickpocketing mechanism of Taobao ！
About the success rate of Taobao login , In the practical use of brother pig can basically succeed , If not, change the login parameters as above ！
This article is mainly about how to crawl data , The analysis of the data is in the next . The reason for the separation is that there are too many problems in climbing Taobao , And brother pig is going to explain how to climb it in detail , So considering the space and the absorption rate of students, let's explain it in two parts ！ The purpose will remain the same ： Let Xiaobai understand ！
This crawling is called TaoBao pc End search interface , Extract the returned data 、 And save it as excel file ！
It seems that a simple function contains many problems , Let's look down a little bit ！
We all need to quantify before we start to write a crawler project , Generally, the first step is to take a page first ！
We open Taobao in the website , Then login , open chrome The debug window , Click on network, And then check that Preserve log, Enter the product name you want to search in the search box
This is the request on the first page , We looked at the data and found ： The returned product information data is inserted into the web page , Instead of returning directly to the pure json data ！
Then brother pig wondered if he had returned to pure json The data interface of ？ So I ordered the next page （ That is the second page ）
After requesting the second page, brother pig found that the data returned was pure json, Then compare two requests url, Find only return json Parameters of data ！
By comparing, we found that search request url If you take ajax=true Parameters will be returned directly json data , Can we directly simulate the direct request json data ！
So brother pig directly uses the request parameters on the second page to request data （ That is, direct request json data ）, But there was an error on the first page of the request ：
Go straight back to a link and No json data , What the hell is this link ？ click ...
Dangdangdang , The slider appears , Some students will ask ：** use requests Can you handle Taobao slider ？** Brother pig has consulted with some big reptiles , The principle of the slider is to collect the response time , Drag speed , Time , Location , The trajectory , The number of retries, etc. and then determine whether it is manual sliding . And it often changes algorithm , So brother pig chose to give up this road ！
So we choose something like the first page （ request url With or without ajax=true Parameters , Go back to the entire web form ） The request interface of , Then extract the data ！
So we can crawl to Taobao's website information
After crawling to the web , All we have to do is extract the data , Here we extract from the webpage json data , And then parse json Get the desired properties .
Now that we have chosen to request the entire page , We need to know where the data is embedded in the web page , How to extract .
After the pig brother search comparison found , Go back to js Parameters ：g_page_config It's the product information we want , And it's also json data format ！
Then we write a regular to extract the data ！
goods_match = re.search(r'g_page_config = (.*?)}};', response.text)
If you want to extract json data , We need to know how to return json Structure of data , We can copy the data to some json Plug in or online parsing
understand json After the data structure , We can write a method to extract the attributes we want
operation excel There are lots of libraries , There are people on the Internet who specifically target excel If you are interested in the comparison and evaluation of the operation library, you can have a look ：https://dwz.cn/M6D8AQnq
Brother pig chooses to use pandas Library to operate excel, as a result of pandas It is easy to operate and is a common data analysis database ！
pandas Library operation excel In fact, it depends on other libraries , So we need to install multiple libraries
pip install xlrd pip install openpyxl pip install numpy pip install pandas
What's a bit of a hole here is pandas operation excel No additional mode , It can only be used after reading the data append Append and write excel！
See the effect
The whole process of one-time crawling （ Crawling 、 Data Extraction 、 preservation ） When it's done , So we can batch cycle .
The timeout seconds set here are from brother pig's practice , from 3s、5s To 10s above , Too often, the verification code is easy to appear ！
Brother pig crawled more than 2000 pieces of data several times
There are many problems in Taobao , Here is a list of ：
problem ： apply st What to do if the code fails ？
answer ： Replace
_verify_password Method .
If the parameters are OK, the login will basically succeed ！
To prevent one's own ip Be sealed up , Brother pig used the agent pool . It needs high quality to climb Taobao ip To climb , Brother pig tried a lot of free online ip, I can't climb .
But there's a website ip very good Standing master ：http://ip.zdaye.com/dayProxy.html , This website is updated every hour ip, Brother pig has tried many ip You can climb to Taobao .
To prevent normal requests from failing , Brother pig added a retry mechanism to the crawling method ！
Need to install retry library
pip install retry
None of the above is a problem , But there will still be sliders , Brother pig has been tested many times , Some climb 20 Time -40 The slide block is the most likely to appear .
When the slider appears, you can only wait for half an hour to continue climbing , Because it can't be used yet requests Library solution slider , Learn later selenium Wait for other frameworks to see if they can solve ！
At present, this reptile is not perfect , It's only a semi-finished product , There are many things that can be improved , For example, automatic maintenance ip Pool function , Multi thread section crawling function , Solve the slider problem and so on , Let's work together to improve this reptile , So that he can become a perfect sensible reptile ！
Access to the source code ,vx Scan the qr code below , Focus on vx official account 「 Naked pigs 」 reply ： TaoBao Can get ！