I'm learning Python The reptile part , You need to have learned Python Basic and front-end knowledge .

List of articles

    • a) The concept of reptile
    • b) Reptile classification
    • c) How reptiles work
    • d) A comparison of crawlers written in various languages
    • Introduction to development environment :
    • The source and function of data
    • The role of data
    • Related concepts of reptiles

Introduction to development environment :

  • window10 operating system
  • Python Interpreter 3.8
  • Integrated development environment pycharm

The source and function of data

What are the sources of the data ?

  • User generated data : baidu index
  • Government Statistics : Government data
  • Data management companies : Aggregate data
  • Self crawling data : Crawling some videos on the website

The role of data

  • Data analysis
  • Practice data for smart products
  • other ( Like buying and selling )

Related concepts of reptiles

a) The concept of reptile

A crawler is an application , Download all kinds of resources from the Internet .
In other words, using programming language to write a crawler web perhaps app Data applications for .
How to crawl data ?

  • Find the target website to crawl , Initiate request
  • analysis url How it changes and extracts useful url
  • Extract useful information

Can a crawler crawl any data ?
Of course not. , You need to follow certain rules and protocols

Take a look at Jingdong's :
 Insert picture description here
Some are allowed , Some are not allowed .

b) Reptile classification

  • Universal crawler
    Baidu and other search engines , From some initial URL Extend to the entire site , It mainly collects data for portal site search and large-scale website services
  • Focus on web crawlers
    Theme crawler , Web crawler that selectively crawls relevant pages according to requirements
  • Incremental web crawler
    Update the knowledge of the downloaded pages and only climb the new ones .

c) How reptiles work

  • The general reptile principle
  •  Insert picture description here
  • Focus on the principle of web crawler
     Insert picture description here

d) A comparison of crawlers written in various languages

  • php Multithreading , Asynchronous support is not very friendly , Weak concurrency . Low speed and efficiency
  • java: A lot of code , And the restructuring cost is relatively high , Any change can lead to a lot of changes , The crawler needs to modify the collection code frequently
  • Python: High development efficiency , The code is concise , There are many supported modules , and HTTP Request and html Parsing modules are very rich , also scrapy,scrapy-redis frame , Make it easier to develop crawlers .