The goal is to crawl all comments .
Here is a list of the crawled Links . If you use AJAX Load dynamic web pages , There are two ways to crawl it .
Two methods are introduced :( If you have any questions about the code , Please make suggestions for improvement )
Resolve real address crawling
Examples are provided in the reference link URL, Links to comments on the site must use
beats Crawling . If you click “ The Internet ” To refresh the page , The annotation data will be in these files . Usually , These data are as follows json File format provides . Then find the annotation data file . See the figure below . Click preview to see the data .
After execution , Crawling data , Add comments and describe , And print the test results .
improvement : Just crawl the notes on the first page here . All comments should be captured . Click on another page number to find more json file :
If you click on these Json Document and compare URL, You will see that the parameters are different .
This parameter represents the number of pages ,offset = 1 For the first page . ( Be careful : The first time you type , By default , The offset for the 1、 And there may be no offset parameter , So it doesn't show up in URL in )
Use selenium Simulate browser crawling
In the previous method , Some websites encrypt addresses to avoid these crawls , So the second way
seleniuminstallation And testing You can use .
If you are using Firefox, The download address is
. Other browsers can use Baidu
To test . The code is as follows :
Grab the data using the following code . Please note that , The notes are located in iframe Under the frame , So you need to parse iframe. therefore , use first switch_to Shift the focus .
Added here Driver.implicitly_wait(10) Wait implicitly 10 second . If this line of code is not added , be iframe The framework will take a long time to load , And will report errors , Hint not found div.reply-content.
Each page is 10 Small page . Browse 10 After page , single click [ The next page ] common 27 page . Everything is nested for In a loop . The outer layer represents 1-10、11-20、21-27 page , The inner layer prints notes on each page . The way to print comments on each page is the same as to grab the last comment . See... Below
Code , And there's a note :
Usually ,Selenium Tends to slow down , Because it has to load the entire page before it starts crawling content . however , You can use the following methods : Ban
Images ,CSS and JS after , The results are shown below :
The above code is used fp = webdriver.FirefoxProfile() control CSS Loading . To set not to load CSS, Please use fp.set_preference(“ permissions.default.stylesheet ”,2). And then use webdriver.Firefox(firefox_profile = fp) To control css No load . After running the code above , The results page will be displayed below .