Generally speaking, for us , What you need to capture is the content of a website or an application , Extract useful value , The content is generally divided into two parts , Unstructured text , Or structured text .
Multi person learning python, I don't know where to start .
Many people study python, After mastering the basic grammar , I don't know where to look for cases to start .
A lot of people who have done cases , But I don't know how to learn more advanced knowledge .
So for these three kinds of people , I will provide you with a good learning platform , Get a free video tutorial , electronic text , And the source code of the course !??¤
QQ Group :1057034340
About structured data
JSON、XML、HTML
HTML Text ( contain JavaScript Code ) Is the most common data format , It should be a structured text organization , But because the key information we need is not directly available
It needs to be right HTML Resolution lookup for , Even some string operations can get , So it's still classified as unstructured data processing .
Compare a web page to a person , that HTML It's his skeleton ,JS It's his muscles ,CSS It's the clothes .
Common parsing methods are as follows :XPath、CSS Selectors 、 Regular expressions
HTML DOM Example
HTML DOM Defines access and operations HTML Standard approach to documentation .
DOM Express in tree structure HTML file .
Text data
For example, an article , Or in a word , Our original intention is to extract valid information , So if it's lag processing , Can store directly , If you need to extract useful information in real time , The common ways to deal with it are as follows :
- The participle is based on the type of website crawled , The use of thesaurus is different , Do the basic participle , And then it becomes word frequency statistics , It's like the representation of a vector , Words as direction , Word frequency is length .
- NLP natural language processing , Do semantic analysis , Express in terms of results , For example, the positive and negative sides, etc .