Python regular expression, this article is enough!

Pig brother 66 2020-11-13 07:32:44
python regular expression article

We explained before Regular expressions The origin of 、 Development 、 Schools 、 grammar 、 engine 、 Optimization and other related knowledge , Today we are mainly going to study The regular expression is in Python Language Application in

Most programming languages learn from regular expression design Perl, So the grammar is basically similar , The difference is that each language has its own functions to support the regular , Today we are going to study Python About China Functions of regular expressions .
 Insert picture description here

One 、re Module introduction

Talk about Python Regular expression support , First of all, I will definitely think of re library , This is a Python Dealing with text Standard library .

Standard library This is a Python Built-in module , No need to download extra , at present Python There are built-in modules 300 individual . You can view it here Python All built-in modules :

because re It's a built-in module , So there's no need to download , It can be used directly :

import re

re Module mainly defines 9 Constant 、12 A function 、1 Exceptions , Every constant and function pig will be explained through the actual code case , So that we can more intuitive understanding of its role !

notes : In order to avoid code format disorder , Pig brother try to use code screenshots to demonstrate .

re Module official documentation :
re Module library source code :

Two 、re Module constants

Constants are variables that cannot be changed , Generally used for marking .

re There is 9 Constant , Constant values are all int type !
 Insert picture description here
As we can see in the picture above , All constants are in RegexFlag Enumeration class To achieve , This is Python 3.6 Make a new version of . stay Python 3.6 Previous versions wrote constants directly in in , The advantage of using enumeration is that it is easy to manage and use !
 Insert picture description here

Let's quickly learn the function of these constants and how to use them , Sort by popularity !


grammar : re.IGNORECASE Or abbreviated as re.I

effect : Ignore case matching .

Code case :
 Insert picture description here
In default match mode Capital B Can't match Lowercase letters b, And in the Ignore case It's OK in mode .


grammar : re.ASCII Or abbreviated as re.A

effect : seeing the name of a thing one thinks of its function ,ASCII Express ASCII Code means , Give Way \w, \W, \b, \B, \d, \D, \s and \S Only match ASCII, instead of Unicode.

Code case :
 Insert picture description here
In default match mode \w+ Matches all strings , And in the ASCII In mode , Only matched a、b、c(ASCII Encoding supported characters ).

Be careful : This is only valid for string matching patterns , Invalid for byte match pattern .


grammar : re.DOTALL Or abbreviated as re.S

effect : DOT Express .,ALL Express all , All in all . Match all , Include line breaks \n. In default mode . Can't match line \n Of .

Code case :
 Insert picture description here
In default match mode . There is no match for newline \n, Instead, match strings separately ; And in the re.DOTALL In mode , A newline \n Match with string to .

Be careful : In default match mode . It doesn't match line breaks \n.


grammar : re.MULTILINE Or abbreviated as re.M

effect : Multi line mode , When there is a line break in a string \n, Line breaks are not supported in default mode , such as : The beginning of the line and End of line , In multiline mode, matching line start is supported .

Code case :
 Insert picture description here
In regular expressions ^ Indicates the beginning of the matching line , By default, it can only match the beginning of a string ; And in multiline mode , It can also match A newline \n Following character .

Be careful : In regular grammar ^ Match the beginning of the line 、\A Match the beginning of a string , In single line mode, the two effects are the same , In multiline mode \A Can't identify \n.


grammar : re.VERBOSE Or abbreviated as re.X

effect : Detailed mode , You can annotate regular expressions !

Code case :
 Insert picture description here
Annotations in regular expressions are not recognized by default , And detailed patterns are recognizable .

When a regular expression is very complex , Detailed patterns may provide you with another way to annotate , But it shouldn't be a way to show off , It is recommended to use... After careful consideration !


grammar : re.LOCALE Or abbreviated as re.L

effect : Determined by the current language region \w, \W, \b, \B Match case sensitivity , This mark can only be used for byte The pattern works . This sign is officially not recommended , Because the regional mechanism of language is very unreliable , It can only handle one at a time " habit ”, And only for 8 Bit bytes are valid .

Be careful : Because this mark is not recommended by the government , And brother pig has never used , So we don't give the actual case !


grammar : re.UNICODE Or abbreviated as re.U

effect : And ASCII Similar model , matching unicode Encoding supported characters , however Python 3 The default string is already Unicode, So it's a little redundant .


grammar : re.DEBUG

effect : Show compile time debug Information .

Code case :
 Insert picture description here

although debug It does print the compiled information in mode , But brother pig doesn't understand the language And the meaning of the expression , I hope my friends who know me can give me some advice .


grammar : re.TEMPLATE Or abbreviated as re.T

effect : Brother pig didn't understand TEMPLATE The specific use of , The source code annotation says :disable backtracking( Disable backtracking ), You can leave a message to let me know !
 Insert picture description here

10. Constant summary

  1. 9 Of the constants , front 5 individual (IGNORECASE、ASCII、DOTALL、MULTILINE、VERBOSE) Useful , Two (LOCALE、UNICODE) The official does not recommend the use of 、 Two (TEMPLATE、DEBUG) Experimental function , Can't rely on .
  2. Constant in re Common functions can be used in , Check the source code to see . Insert picture description here
  3. Constants can be superimposed , Because constant values are 2 Power of , So it can be superimposed , Please use | Symbol , Do not use + Symbol ! Insert picture description here

Finally, let's summarize with a mind map re Constants in modules .
 Insert picture description here

3、 ... and 、re Module function

re Module has 12 A function , Brother pig will explain it in terms of function classification ; This is more comparative , It's also easy to remember .

1. Find a match

The functions that find and return a match are 3 individual :search、match、fullmatch, The difference between them is :

  1. search: Find matches anywhere
  2. match: Must match from the beginning of the string
  3. fullmatch: The whole string matches the regular exactly

Let's compare the actual code cases :

Case study 1:
 Insert picture description here
Case study 1 in search function Match anywhere in the string , As long as there is a string that matches the regular expression, it will match successfully , There are actually two matches , but search The function value returns a .

and match function To match from the beginning , And there's a letter at the beginning of the string a, So it can't match ,fullmatch function It needs to be exactly the same , So it doesn't match !

Case study 2:
 Insert picture description here
Case study 2 Deleted text The first letter a, such match function You can match , and fullmatch function Still can't match exactly !

Case study 3:
 Insert picture description here
Case study 3 in , We only leave a passage , And consistent with regular expressions ; At this time fullmatch function Finally, it can match .

The whole case :
 Insert picture description here
Be careful : lookup A match All that is returned is a match object (Match).

2. Find multiple matches

Look for an item at the end , Now let's look at finding multiple items , The main ways to find multiple functions are :findall function And finditer function

  1. findall: Find... From anywhere in the string , Return a list
  2. finditer: Find... From anywhere in the string , Returns an iterator

The two methods are basically similar , It's just a return list , One is to return iterators . We know that lists are generated in memory at one time , And iterators are generated little by little when they need to be used , Better memory usage .

 Insert picture description here
If there could be a large number of matches , It is recommended to use finditer function , General use findall function Basically no impact .

3. Division

re.split(pattern, string, maxsplit=0, flags=0) function : use pattern Separate string , maxsplit Indicates the maximum number of segmentation times , flags Presentation mode , That's the constant we explained above !

 Insert picture description here
Be careful :str The module also has a split function , How to choose these two functions ?
str.split Function function is simple , Regular segmentation is not supported , and re.split Support regular .

About the speed of both ? Brother pig actually tests , Use... With the same amount of data re.split Function and str.split function Number of executions And execution time Contrast figure :
 Insert picture description here
Through the comparison of the above figure, it is found that ,1000 Within the second cycle str.split Functions are faster , And the number of cycles 1000 After more than one time re.split The function is significantly faster , And the more times there are, the bigger the gap !

So the conclusion is : stay No need for regular support And The amount of data and the number of times are not much In case of use str.split Function is more suitable , Otherwise use re.split function .

notes : The specific execution time is related to the test data !

4. Replace

There are mainly sub function And subn function , They have similar functions !

First look at it. sub function Usage of :

re.sub(pattern, repl, string, count=0, flags=0) Function parameters :repl Replace string Middle quilt pattern Matched character , count Indicates the maximum number of replacements ,flags Constants representing regular expressions .

It is worth noting that :sub function In the :repl The replacement can be either a string , It can also be a function ! If repl For the function , There can only be one participant :Match A match object .

 Insert picture description here

re.subn(pattern, repl, string, count=0, flags=0) Function and re.sub function Consistent function , Just return a tuple ( character string , Number of replacements ).
 Insert picture description here

5. Compile regular objects

compile function And template function Compile the regular expression style as a Regular expression objects ( Regular objects Pattern), This object and re Modules have the same regular functions ( We will explain later Pattern Regular objects ).
 Insert picture description here
and template function And compile function similar , Just added what we said before re.TEMPLATE Pattern , We can see the source code .
 Insert picture description here

6. other

re.escape(pattern) You can escape characters with special meanings in regular expressions , such as :. perhaps * , Take a real case :
 Insert picture description here
re.escape(pattern) It seems that it's very easy to use without adding our own escape , But using it is easy to escape the wrong problem , So it's not recommended to use escape , And we suggest that you manually escape !

re.purge() The function is to clear Regular expression cache , What kind of cache does it have ? Let's take a look at the source code and know it's behind the scenes what :
 Insert picture description here
The way to look is to clear the cache , Let's take a look at the specific case :
 Insert picture description here
Brother pig used... Between the two cases re.purge() Function to clear the cache , Then compare the cache in the case source code before and after , See if there's any change !
 Insert picture description here

7. summary

At the end of the paper, I'd like to summarize my mind map re Functions in modules .
 Insert picture description here

Four 、re Module exception

re Module also contains a regular expression compilation error , When we give Regular expression is an invalid expression ( It's the expression itself that has problems ) when , will raise An exception !

Let's take a look at specific cases :
 Insert picture description here
In the above case, we can see , In writing regular expressions, we write an extra bracket , This leads to an error in the execution result ; And before all the other cases , So the error is reported at regular expression compilation time .

Be careful : The exception must be Regular expressions It doesn't work in itself , Nothing to do with the string to match !

5、 ... and 、 Regular objects Pattern

About re Module constants 、 function 、 We are all finished explaining the abnormality , But it's absolutely necessary to talk about Regular objects Pattern.

1. And re modular The functions are the same

stay re There is an important function in the function of the module compile function , This function can precompile and return a regular object , This regular object owns and re Module the same function , Let's see Pattern class Source code .
 Insert picture description here
Since it is the same , That should be used in the end re modular still Regular objects Pattern

and , Some students may have seen re Module source code , You'll find out compile function And other re function (search、split、sub wait ) The same function is called internally , In the end, we call the regular object's function !
 Insert picture description here
That is to say, below Two kinds of code writing Underlying implementation In fact, they are the same :

# re function, text)
# Regular object functions 
compile = re.compile(pattern)

It's also necessary to use compile function Get the regular object and call search function Do you ? Call directly Is it OK to ?

2. What about the official documents

About what to use re modular still Regular objects Pattern , Does the official document state ?

 Insert picture description here
Official documents recommend : Regular objects are recommended when using a regular expression multiple times Pattern To increase reusability , Because by re.compile(pattern) The compiled module level functions will be cached !

3. How about the actual test ?

The official documents above recommend that we are in Use regular objects when using a regular expression multiple times , Is that really the case ?

Let's measure it

 Insert picture description here
Brother pig wrote two functions , A use function Another use function , , respectively, ( Different time ) Loop execution count Time (count from 1-1 ten thousand ), Comparing the two takes time !

The result is a broken line :
 Insert picture description here
The conclusion is that :100 The speed of the two is basically the same within the secondary cycle , When exceeding 100 Next time , Use Regular objects Pattern Function of It takes significantly less time , So than re modular Be quick !

It is known from the actual test that :Python Official documents recommend Use regular object functions when using a regular expression multiple times Basically true !

6、 ... and 、 matters needing attention

Python Regular expression knowledge is basically explained , Finally, I would like to give you a little bit of attention .

1. Byte string And character string

The pattern and the searched string can be either Unicode character string (str) , It can also be 8 Bit byte string (bytes). however ,Unicode String and 8 Bit byte string cannot be mixed !

2.r The role of

Regular expressions use backslashes (’’) To express a particular form , Or escape special characters to normal characters .

And the backslash is in the normal Python Strings have the same effect , So there's a conflict .

The solution is to use the regular expression style Python The original string representation of ; With ‘r’ In the string literal of the prefix , The backslash doesn't have to do anything special .

3. Regular search function Return match object

Find a match (search、match、fullmatch) The function return value of is a A match object Match , Need to pass through Get the match value , It's easy to forget .
 Insert picture description here
In addition, we need to pay attention to And match.groups() The difference between functions !

4. Reuse a regular

If you want to reuse a regular expression , It is recommended to use re.compile(pattern) function Returns a regular object , Then reuse the regular object , It will be faster !

5.Python Regular interviews

Written examination may meet the need to use Python Regular expressions , But it won't be too hard , All you have to do is remember the difference between those methods , Will use , The basic problem is not big .

Whether the Python We have a clear understanding of the regular expression of ?

本文为[Pig brother 66]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database