Usage Summary of Python parsing library lxml and XPath

Irene181 2021-06-28 02:54:20
usage summary python parsing library


This paper focuses on xpath and lxml Expand the library :

One 、xpath Concept 、xpath node 、xpath grammar 、xpath Axis 、xpath Operator

Two 、lxml Installation 、lxml Use 、lxml Case study

One 、xpath

1.xpath Concept

XPath Is a door in XML The language in which information is found in a document .XPath Using path expressions in XML Navigation in the document .XPath Contains a library of standard functions .XPath yes XSLT The main element in .XPath It's a W3C standard .

2.xpath node

xpath There are seven types of nodes : Elements 、 attribute 、 Text 、 Namespace 、 A processing instruction 、 Comments and documentation ( root ) node .

Node relationship : Father 、 Son 、 brother 、 Forefathers 、 Junior .

3.xpath grammar

xpath Syntax in W3c There is a detailed introduction on the website , Here's some knowledge , For everyone to learn .

XPath Using path expressions in XML Select node in document . The node is either by following the path or step To select . The most useful path expressions are listed below :

expression describe
nodename Select all children of this node .
/ Select from root node .
// Select the node in the document from the current node that matches the selection , Regardless of their location .
. Select the current node .
.. Select the parent of the current node .
@ Select Properties .

In the table below , We've listed some path expressions and their results :

Path expression result
bookstore selection bookstore All children of the element .
/bookstore Select the root element bookstore. notes : If the path starts with a forward slash ( / ), This path always represents the absolute path to an element !
bookstore/book Choose to belong to bookstore All of the child elements of book Elements .
//book Select all book Subelement , Regardless of where they are in the document .
bookstore//book Choose to belong to bookstore All of the descendants of the element book Elements , No matter where they are bookstore What's down there .
//@lang Choose the name lang All attributes of .

Predicate (Predicates)

Predicate is used to find a specific node or a node containing a specified value .

The predicate is embedded in square brackets .

In the table below , We list some path expressions with predicates , And the result of the expression :

Path expression result
/bookstore/book[1] Choose to belong to bookstore The first of the child elements book Elements .
/bookstore/book[last()] Choose to belong to bookstore The last of the child elements book Elements .
/bookstore/book[last()-1] Choose to belong to bookstore The penultimate of a child element book Elements .
/bookstore/book[position()<3] Select the first two of bookstore Of a child element book Elements .
//title[@lang] Select all owners named lang Property of title Elements .
//title[@lang='eng'] Select all title Elements , And these elements have a value of eng Of lang attribute .
/bookstore/book[price>35.00] selection bookstore All of the elements book Elements , And one of them price The value of the element must be greater than 35.00.
/bookstore/book[price>35.00]/title selection bookstore In the element book All of the elements title Elements , And one of them price The value of the element must be greater than 35.00.

Select unknown node

XPath Wildcards can be used to select unknown XML Elements .

wildcard describe
* Match any element node .
@* Match any attribute node .
node() Match any type of node .

In the table below , We've listed some path expressions , And the results of these expressions :

Path expression result
/bookstore/* selection bookstore All child elements of the element .
//* Select all elements in the document .
//title[@*] Select all of the title Elements .

Select several paths

By using in a path expression "|" Operator , You can choose several paths .

In the table below , We've listed some path expressions , And the results of these expressions :

Path expression result
//book/title //book/price
//title //price
/bookstore/book/title //price

4.xpath Axis

The axis defines the node set relative to the current node .

Axis name result
ancestor Select all predecessors of the current node ( Father 、 Grandfather, etc ).
ancestor-or-self Select all predecessors of the current node ( Father 、 Grandfather, etc ) And the current node itself .
attribute Select all attributes of the current node .
child Select all child elements of the current node .
descendant Select all descendant elements of the current node ( Son 、 Sun et al ).
descendant-or-self Select all descendant elements of the current node ( Son 、 Sun et al ) And the current node itself .
following Select all nodes after the end tag of the current node in the document .
namespace Select all namespace nodes of the current node .
parent Select the parent of the current node .
preceding Select all nodes before the start tag of the current node in the document .
preceding-sibling Select all peers before the current node .
self Select the current node .

5.xpath Operator

The following is a list of the available XPath Operators in expressions :

Operator describe example Return value
Compute two node sets //book
+ Add 6 + 4 10
- Subtraction 6 - 4 2
* Multiplication 6 * 4 24
div division 8 div 4 2
= be equal to price=9.80 If price yes 9.80, Then return to true. If price yes 9.90, Then return to false.
!= It's not equal to price!=9.80 If price yes 9.90, Then return to true. If price yes 9.80, Then return to false.
< Less than price<9.80 If price yes 9.00, Then return to true. If price yes 9.90, Then return to false.
<= Less than or equal to price<=9.80 If price yes 9.00, Then return to true. If price yes 9.90, Then return to false.
> Greater than price>9.80 If price yes 9.90, Then return to true. If price yes 9.80, Then return to false.
>= Greater than or equal to price>=9.80 If price yes 9.90, Then return to true. If price yes 9.70, Then return to false.
or or price=9.80 or price=9.70 If price yes 9.80, Then return to true. If price yes 9.50, Then return to false.
and And price>9.00 and price<9.90 If price yes 9.80, Then return to true. If price yes 8.50, Then return to false.
mod Calculate the remainder of division 5 mod 2 1

Okay ,xpath That's all there is to it . Now we're going to introduce an artifact lxml, He's very fast , I used to use it all the time beautifulsoup My favorite parser , Not one of them. , Because he's really faster than the others html.parser and html5lib A lot faster .

Two 、lxml

1.lxml install

lxml It's a xpath Format parsing module , It's easy to install , direct pip install lxml perhaps easy_install lxml that will do .

2.lxml Use

lxml Provides two ways to parse web pages , One is when you parse an offline web page you write , Another kind It's about parsing online pages .

Import package :

from lxml import etree

1. Parsing offline web pages :

html=etree.parse('xx.html',etree.HTMLParser())aa=html.xpath('//*[@id="s_xmancard_news"]/div/div[2]/div/div[1]/h2/a[1]/@href')print(aa)

2. Parsing online pages :

from lxml import etreeimport requestsrep=requests.get('https://www.baidu.com')html=etree.HTML(rep.text)aa=html.xpath('//*[@id="s_xmancard_news"]/div/div[2]/div/div[1]/h2/a[1]/@href')print(aa)

So how do we get these tags and their corresponding attribute values , It's simple , Get the tag first, just do it :

Python Parsing library lxml And xpath Usage Summary


And then we can , For example , You want to get a The text inside the tag and its attributes href The corresponding value , There are two ways ,

1. Get... In an expression

aa=html.xpath('//*[@id="s_xmancard_news"]/div/div[2]/div/div[1]/h2/a[1]/text()')ab=html.xpath('//*[@id="s_xmancard_news"]/div/div[2]/div/div[1]/h2/a[1]/@href')

2. Get out of expression

aa=html.xpath('//*[@id="s_xmancard_news"]/div/div[2]/div/div[1]/h2/a[1]')aa.textaa.attrib.get('href')

This completes the acquisition , What about? , Is it very simple , Ha ha ha .

Now come again lxml The rules of parsing :

expression describe
nodename Select all children of this node
/ Select the direct child node from the current node
// Select a descendant node from the current node
. Select the current node
.. Select the parent of the current node
@ Select Properties
html = lxml.etree.HTML(text)# Use text Construct a XPath Parse object ,etree Modules can be automatically modified HTML Text html = lxml.etree.parse('./ex.html',etree.HTMLParser())# Read text directly for parsing from lxml import etreeresult = html.xpath('//*')# Select all nodes result = html.xpath('//li')# Get all li node result = html.xpath('//li/a')# Get all li Direct connection of nodes a Child node result = html.xpath('//li//a')# Get all li All of the nodes a Descendants node result = html.xpath('//a[@href="link.html"]/../@class')# Get all href The attribute is link.html Of a The parent of a node class attribute result = html.xpath('//li[@class="ni"]')# Get all class The attribute is ni Of li node result = html.xpath('//li/text()')# Get all li The text of the node result = html.xpath('//li/a/@href')# Get all li Node a Node href attribute result = html.xpath('//li[contains(@class,"li")]/a/text())# When li Of class When a property has multiple values , need contains Function to complete the match result = html.xpath('//li[contains(@class,"li") and @name="item"]/a/text()')# Multiple attribute matching result = html.xpath('//li[1]/a/text()')result = html.xpath('//li[last()]/a/text()')result = html.xpath('//li[position()<3]/a/text()')result = html.xpath('//li[last()-2]/a/text()')# Select in order , In brackets is XPath Provided function result = html.xpath('//li[1]/ancestor::*')# Get ancestor nodes result = html.xpath('//li[1]/ancestor::div')result = html.xpath('//li[1]/attribute::*')# Get attribute value result = html.xpath('//li[1]/child::a[@href="link1.html"]')# Get direct child nodes result = html.xpath('//li[1]/descendant::span')# Get all descendant nodes result = html.xpath('//li[1]/following::*[2]')# Get the second of all nodes after the current node result = html.xpath('//li[1]/following-sibling::*')# Get all subsequent peer nodes 

3.lxml Case study

In order to be lazy , Xiaobian decided to adopt urllib The code for that article , Ha ha ha , Tact as I .

Python Parsing library lxml And xpath Usage Summary

Okay , So much for today , If you are interested, you can pay more attention to it , It's wonderful !!!!

In this paper, references are given :

https://www.w3school.com.cn/

After reading this article, there are gains ? Please forward it to more people

IT Home of sharing

Please reply in wechat background 【 The group of 】

Python Parsing library lxml And xpath Usage Summary

**-----**------**-----**---**** End **-----**--------**-----**-****

Excellent articles in the past are recommended :

Welcome to join the group chat 【helloworld Developer community 】:https://jq.qq.com/?_wv=1027&k=mBlk6nzX Enter the group to communicate IT Technology hotspots .

In this paper, from https://www.helloworld.net/redirect?target=https://mp.weixin.qq.com/s/yCvlmswfuRY9k-HiVxTXKg, If there is any infringement , Please contact to delete .

版权声明
本文为[Irene181]所创,转载请带上原文链接,感谢
https://pythonmana.com/2021/06/20210626091951816N.html

  1. 小白量化投资交易入门课(python入门金融分析)
  2. Python:PyCharm选择性忽略PEP8警告
  3. Python: pychar selectively ignores pep8 warnings
  4. Django-模板
  5. Django template
  6. Python正则表达式大全
  7. 最全Python正则表达式来袭
  8. A python knowledge for Xiaobai
  9. 2. Flexible pandas index
  10. 1. Get to know pandas
  11. See how I use Python to create a magic with baby (one play can play for a day)?
  12. Wow, python can do real-time translation
  13. Python经典编程习题100例
  14. 100 examples of Python classic programming exercises
  15. Invincible, with Python for English teachers to develop a magic tool for English composition correction (support primary school to IELTS)
  16. 抖音数据采集教程,最全python库selenium自动化使用
  17. Pandas 11-综合练习
  18. Pandas 11 - comprehensive exercises
  19. Pandas基础|用户游览日志时间合并排序
  20. python自学 第三章 python语言基础之保留字、标识符与内置函数
  21. python学习例程3-函数
  22. Python GUI 之Tkinter小结 - 知乎
  23. Pandas foundation | user travel log time merge sort
  24. Chapter 3 reserved words, identifiers and built-in functions of the foundation of Python
  25. Tkinter summary of Python GUI - Zhihu
  26. 【Python常用包】itertools
  27. Itertools
  28. [Python] Matplotlib 图表的绘制和美化技巧
  29. Drawing and beautifying skills of [Python] Matplotlib chart
  30. Drawing and beautifying skills of [Python] Matplotlib chart
  31. Python序列之列表(一)
  32. Python解析库lxml与xpath用法总结
  33. Python解析库lxml与xpath用法总结
  34. Usage Summary of Python parsing library lxml and XPath
  35. Usage Summary of Python parsing library lxml and XPath
  36. Python web/HTML GUI
  37. Why is sanic better than Django flame?
  38. Wechat applet Python sends subscription message
  39. Invincible, with Python for English teachers to develop an English composition correction artifact (support primary school to IELTS)
  40. How can I use Python to create a magic with children (one can play for one day)?
  41. Pandas module
  42. Machine learning in Python - Boston house price forecast
  43. 50 Great Python modules
  44. Share the survival status of Python practitioners and tell you the real salary of general programmers
  45. Pandas basic operation update
  46. Python Programming day02 Python operator
  47. 1. First meeting pandas
  48. Conversion between Python and base conversion between Python and base
  49. Basics of Python
  50. Fundamentals of python (XIV): errors and exceptions
  51. Fundamentals of python (8): time related modules
  52. Fundamentals of python (I): necessary knowledge for getting started
  53. Operators in Python 3
  54. The list of national computer non graduate schools (captured by Python), just look at this one!
  55. Python data visualization: Seaborn
  56. Quick start pandas (lower)
  57. Operators in Python 3
  58. Python tarfile module
  59. Python basic syntax