Get rid of Python web crawler, eat inside and out?

Chen Python 2020-11-16 18:47:49
rid python web crawler eat

Data analysis

Multi person learning python, I don't know where to start .

Many people study python, After mastering the basic grammar , I don't know where to look for cases to start .

A lot of people who have done cases , But I don't know how to learn more advanced knowledge .

So for these three kinds of people , I'll give it up

Big data era , For data analysis , First of all, there must be data sources , Just rely on the drizzle of the company ( data ), It's not enough to analyze loneliness , Only by learning to crawl , From the outside ( Website ) Crawl into something related to 、 Useful data , To give the boss a basis for making business decisions , And you , It's also the boss .

When it comes to the boss , Beautiful little MM, I'm so excited , Ask out loud at once : You IT world , The most handsome is not that does search engine boss Li ?

I'm a little unconvinced , A little unhappy , But how can I get , After all, in terms of web crawlers , He ( Boss Li ) Technology is better than it is . He knows how to use reptiles , Crawling through massive Internet information every day , Crawling up high-quality information and recording it in his database . When users are in search engines , When entering keywords , The engine system will analyze and process the keywords , Find out from the relevant pages , Sort according to certain ranking rules and present the results to users .

The thought of ranking makes money, Boss Li doesn't give me a cent , I'll talk to people MM say : Okay , I won't talk to you , I want to talk to my old fellow about the principle of web crawler. , You're a creep , Go see your boss .

  1. What is a reptile


Web crawler is also called web spider 、 Internet ants 、 Network machines, etc , It follows the rules we set , Crawling data on the network . There will be in the results of climbing HTML Code 、JSON data 、 picture 、 Audio or video . Programmer according to the actual requirements , Filter data , Extract the useful , For storage .

White point , Just use Python Programming language simulation browser , Visit the designated website , Return the result , Filter according to the rules and extract the data you need , Store and use , For use .

You've seen me 《  The first 10 God | 12 Sky fix Python, File operations  》 and 《  The first 11 God | 12 Sky fix Python, Database operation 》 Old fellow iron , You should know , Data often exists in a file or database .

  1. Crawling process


How users access network data through browser : Open the browser -> Enter url -> Browser submit request -> Download Web code -> Parse to page .

Crawler programming , Specify the web address , Impersonate a browser to send a request ( Get web code )-> Extract useful data -> Stored in a file or database .

Crawler programming , Recommend to use Python, Because Python The crawler library is easy to use , stay Python In the built-in environment , Can satisfy most functions . It can :

(1) use http The library makes a request to the target site , Send a Request( Including request header and request body );

(2) To the server Response, Use the built-in Library (html、json、 Regular expressions ) It will be analyzed

(3) Store the required data in a file or database .

If Python If the built-in library is not enough , It can be used pip install Library name , Quick download page 3 And use it .

  1. Climbing point positioning


In the process of writing crawler code , It is often necessary to specify the node or path to crawl . If I told you ,Chrome browser , You can quickly get the node or path , I'll see if you can install the computer right away ?

If so , That's right , I won't , Go ahead and install it .

In the page , Press the keyboard F2 key , Display source code . Select the node you want to get , Right click 【 Check 】 You can locate it in the code , Right click code , choice 【Copy】-【Copy Selector 】 or 【Copy XPath】 You can copy the contents of the node or path .

Okay , About crawler principle , Lao Chen is finished , If you think it helps you , I hope the old fellow can forward the praise. , Let more people see this article . Your forwarding and likes , It is the greatest encouragement for Lao Chen to continue to create and share .

本文为[Chen Python]所创,转载请带上原文链接,感谢

