It's still a little chatter , Once again, bloggers are looking for a sense of existence .
Looking back , In the past, we talked about the simple operation of reptiles , And encapsulates a simple , Get the function of web page source data , It's good, isn't it .
Python Reptile self study series one
Today, we're going to grab the data we want from the web data we get .
（ notes ： Many things in this article have already been mentioned , So this article is basically a link , It won't be long ）
XPath Is a kind of will XML The hierarchical structure of a document is described as a relational way . because HTML yes from XML The elements make up , So we can use XPath from HTML Locate and select elements in the document .
If you want to know more XPath Relevant knowledge , You can click on the blue on this side .
as for beautifulsoup Don't mention it .
Don't ask me why I didn't mention it , Look down and you'll see .
Do you realize ？
good , I've finished my knowledge , What is needed in the comparison is the code encapsulation in the actual combat of the project .
The actual combat of the project also has , Crawling 2021 Tencent school recruitment in
Let's look at it first. It's about , Then we come back to pick up some functions and encapsulate them .
What about this function , Take the data straight away , But this Xpath The use of , It's not that easy .
def get_data(html_data,Xpath_path): ''' This is a function to grab the required data from the web page source data :param html_data: Web source data ( A single data ) :param Xpath_path: Xpath Addressing method :return: A list of stored results ''' data = html_data.content data = data.decode().replace("<!--", "").replace("-->", "") # Delete comments from data tree = etree.HTML(data) # establish element object el_list = tree.xpath(Xpath_path) return el_list
The one above is disposable , What about sustainable development ？ For example, in a web page, you need to capture more than one type of data , That is to say, there are many sets of Xpath, What to do with that ？
I have two ways ：
1. take element Object is used to transfer , The function is divided into two , see ：
First step , Get the URL of element Object and return
# Get the URL of element object def get_element(html_data): data = html_data.content data = data.decode().replace("<!--", "").replace("-->", "") tree = etree.HTML(data) return tree
The second step , from element Object
def parser_element_data(Tree,Xpath): el_list = Tree.xpath(Xpath) return el_list
This method , It's a bit rustic , It really needs to be used , It's not very beautiful either , redundancy .
Let's look at method two .
What about this method , It's going to be all Xpath Pass in as a list , And then take the data through the loop .
def get_data_2(html_data,Xpath_path_list): ''' Through multiple Xpath Extract data :param html_data: Raw web data :param Xpath_paths: Xpath Addressing list :return: 2 d list , A kind of addressing data, a list ''' el_data =  data = html_data.content data = data.decode().replace("<!--", "").replace("-->", "") tree = etree.HTML(data) for Xpath_path in Xpath_path_list: el_list = tree.xpath(Xpath_path) el_data.append(el_list) el_list =  # Let's clean it up for safety's sake return el_data
This one is relatively short , But the content is not short .
If you have a heart, you can find a website to practice Xpath, Let's say recruitment website .