We explained before Regular expressions The origin of 、 Development 、 Schools 、 grammar 、 engine 、 Optimization and other related knowledge , Today we are mainly going to study The regular expression is in Python Language Application in !
Most programming languages learn from regular expression design Perl, So the grammar is basically similar , The difference is that each language has its own functions to support the regular , Today we are going to study Python About China Functions of regular expressions .
Talk about Python Regular expression support , First of all, I will definitely think of re
library , This is a Python Dealing with text Standard library .
Standard library This is a Python Built-in module , No need to download extra , at present Python There are built-in modules 300 individual . You can view it here Python All built-in modules :https://docs.python.org/3/py-modindex.html#cap-r
because re It's a built-in module , So there's no need to download , It can be used directly :
import re
re Module mainly defines 9 Constant 、12 A function 、1 Exceptions , Every constant and function pig will be explained through the actual code case , So that we can more intuitive understanding of its role !
notes : In order to avoid code format disorder , Pig brother try to use code screenshots to demonstrate .
re Module official documentation :https://docs.python.org/zh-cn/3.8/library/re.html
re Module library source code :https://github.com/python/cpython/blob/3.8/Lib/re.py
Constants are variables that cannot be changed , Generally used for marking .
re There is 9 Constant , Constant values are all int type !
As we can see in the picture above , All constants are in RegexFlag Enumeration class To achieve , This is Python 3.6 Make a new version of . stay Python 3.6 Previous versions wrote constants directly in re.py in , The advantage of using enumeration is that it is easy to manage and use !
Let's quickly learn the function of these constants and how to use them , Sort by popularity !
grammar : re.IGNORECASE Or abbreviated as re.I
effect : Ignore case matching .
Code case :
In default match mode Capital B Can't match Lowercase letters b, And in the Ignore case It's OK in mode .
grammar : re.ASCII Or abbreviated as re.A
effect : seeing the name of a thing one thinks of its function ,ASCII Express ASCII Code means , Give Way \w
, \W
, \b
, \B
, \d
, \D
, \s
and \S
Only match ASCII, instead of Unicode.
Code case :
In default match mode \w+
Matches all strings , And in the ASCII In mode , Only matched a、b、c(ASCII Encoding supported characters ).
Be careful : This is only valid for string matching patterns , Invalid for byte match pattern .
grammar : re.DOTALL Or abbreviated as re.S
effect : DOT Express .
,ALL Express all , All in all .
Match all , Include line breaks \n
. In default mode .
Can't match line \n
Of .
Code case :
In default match mode .
There is no match for newline \n
, Instead, match strings separately ; And in the re.DOTALL In mode , A newline \n
Match with string to .
Be careful : In default match mode .
It doesn't match line breaks \n
.
grammar : re.MULTILINE Or abbreviated as re.M
effect : Multi line mode , When there is a line break in a string \n
, Line breaks are not supported in default mode , such as : The beginning of the line and End of line , In multiline mode, matching line start is supported .
Code case :
In regular expressions ^
Indicates the beginning of the matching line , By default, it can only match the beginning of a string ; And in multiline mode , It can also match A newline \n
Following character .
Be careful : In regular grammar ^
Match the beginning of the line 、\A
Match the beginning of a string , In single line mode, the two effects are the same , In multiline mode \A
Can't identify \n
.
grammar : re.VERBOSE Or abbreviated as re.X
effect : Detailed mode , You can annotate regular expressions !
Code case :
Annotations in regular expressions are not recognized by default , And detailed patterns are recognizable .
When a regular expression is very complex , Detailed patterns may provide you with another way to annotate , But it shouldn't be a way to show off , It is recommended to use... After careful consideration !
grammar : re.LOCALE Or abbreviated as re.L
effect : Determined by the current language region \w
, \W
, \b
, \B
Match case sensitivity , This mark can only be used for byte The pattern works . This sign is officially not recommended , Because the regional mechanism of language is very unreliable , It can only handle one at a time " habit ”, And only for 8 Bit bytes are valid .
Be careful : Because this mark is not recommended by the government , And brother pig has never used , So we don't give the actual case !
grammar : re.UNICODE Or abbreviated as re.U
effect : And ASCII Similar model , matching unicode Encoding supported characters , however Python 3 The default string is already Unicode, So it's a little redundant .
grammar : re.DEBUG
effect : Show compile time debug Information .
Code case :
although debug It does print the compiled information in mode , But brother pig doesn't understand the language And the meaning of the expression , I hope my friends who know me can give me some advice .
grammar : re.TEMPLATE Or abbreviated as re.T
effect : Brother pig didn't understand TEMPLATE The specific use of , The source code annotation says :disable backtracking( Disable backtracking ), You can leave a message to let me know !
|
Symbol , Do not use +
Symbol ! Finally, let's summarize with a mind map re Constants in modules .
re Module has 12 A function , Brother pig will explain it in terms of function classification ; This is more comparative , It's also easy to remember .
The functions that find and return a match are 3 individual :search、match、fullmatch, The difference between them is :
Let's compare the actual code cases :
Case study 1:
Case study 1 in search function Match anywhere in the string , As long as there is a string that matches the regular expression, it will match successfully , There are actually two matches , but search The function value returns a .
and match function To match from the beginning , And there's a letter at the beginning of the string a
, So it can't match ,fullmatch function It needs to be exactly the same , So it doesn't match !
Case study 2:
Case study 2 Deleted text The first letter a, such match function You can match , and fullmatch function Still can't match exactly !
Case study 3:
Case study 3 in , We only leave a passage , And consistent with regular expressions ; At this time fullmatch function Finally, it can match .
The whole case :
Be careful : lookup A match All that is returned is a match object (Match).
Look for an item at the end , Now let's look at finding multiple items , The main ways to find multiple functions are :findall function And finditer function :
The two methods are basically similar , It's just a return list , One is to return iterators . We know that lists are generated in memory at one time , And iterators are generated little by little when they need to be used , Better memory usage .
If there could be a large number of matches , It is recommended to use finditer function , General use findall function Basically no impact .
re.split(pattern, string, maxsplit=0, flags=0) function : use pattern Separate string , maxsplit Indicates the maximum number of segmentation times , flags Presentation mode , That's the constant we explained above !
Be careful :str
The module also has a split function , How to choose these two functions ?
str.split Function function is simple , Regular segmentation is not supported , and re.split Support regular .
About the speed of both ? Brother pig actually tests , Use... With the same amount of data re.split
Function and str.split
function Number of executions And execution time Contrast figure :
Through the comparison of the above figure, it is found that ,1000 Within the second cycle str.split
Functions are faster , And the number of cycles 1000 After more than one time re.split
The function is significantly faster , And the more times there are, the bigger the gap !
So the conclusion is : stay No need for regular support And The amount of data and the number of times are not much In case of use str.split
Function is more suitable , Otherwise use re.split
function .
notes : The specific execution time is related to the test data !
There are mainly sub function And subn function , They have similar functions !
First look at it. sub function Usage of :
re.sub(pattern, repl, string, count=0, flags=0) Function parameters :repl Replace string Middle quilt pattern Matched character , count Indicates the maximum number of replacements ,flags Constants representing regular expressions .
It is worth noting that :sub function In the :repl The replacement can be either a string , It can also be a function ! If repl For the function , There can only be one participant :Match A match object .
re.subn(pattern, repl, string, count=0, flags=0) Function and re.sub function Consistent function , Just return a tuple ( character string , Number of replacements ).
compile function And template function Compile the regular expression style as a Regular expression objects ( Regular objects Pattern), This object and re Modules have the same regular functions ( We will explain later Pattern Regular objects ).
and template function And compile function similar , Just added what we said before re.TEMPLATE Pattern , We can see the source code .
re.escape(pattern) You can escape characters with special meanings in regular expressions , such as :.
perhaps *
, Take a real case :
re.escape(pattern) It seems that it's very easy to use without adding our own escape , But using it is easy to escape the wrong problem , So it's not recommended to use escape , And we suggest that you manually escape !
re.purge() The function is to clear Regular expression cache , What kind of cache does it have ? Let's take a look at the source code and know it's behind the scenes what :
The way to look is to clear the cache , Let's take a look at the specific case :
Brother pig used... Between the two cases re.purge() Function to clear the cache , Then compare the cache in the case source code before and after , See if there's any change !
At the end of the paper, I'd like to summarize my mind map re Functions in modules .
re Module also contains a regular expression compilation error , When we give Regular expression is an invalid expression ( It's the expression itself that has problems ) when , will raise An exception !
Let's take a look at specific cases :
In the above case, we can see , In writing regular expressions, we write an extra bracket , This leads to an error in the execution result ; And before all the other cases , So the error is reported at regular expression compilation time .
Be careful : The exception must be Regular expressions It doesn't work in itself , Nothing to do with the string to match !
About re
Module constants 、 function 、 We are all finished explaining the abnormality , But it's absolutely necessary to talk about Regular objects Pattern.
stay re
There is an important function in the function of the module compile function , This function can precompile and return a regular object , This regular object owns and re
Module the same function , Let's see Pattern class Source code .
Since it is the same , That should be used in the end re modular still Regular objects Pattern ?
and , Some students may have seen re
Module source code , You'll find out compile function And other re function (search、split、sub wait ) The same function is called internally , In the end, we call the regular object's function !
That is to say, below Two kinds of code writing Underlying implementation In fact, they are the same :
# re function
re.search(pattern, text)
# Regular object functions
compile = re.compile(pattern)
compile.search(text)
It's also necessary to use compile function Get the regular object and call search function Do you ? Call directly re.search Is it OK to ?
About what to use re modular still Regular objects Pattern , Does the official document state ?
Official documents recommend : Regular objects are recommended when using a regular expression multiple times Pattern To increase reusability , Because by re.compile(pattern) The compiled module level functions will be cached !
The official documents above recommend that we are in Use regular objects when using a regular expression multiple times , Is that really the case ?
Let's measure it
Brother pig wrote two functions , A use re.search function Another use compile.search function , , respectively, ( Different time ) Loop execution count Time (count from 1-1 ten thousand ), Comparing the two takes time !
The result is a broken line :
The conclusion is that :100 The speed of the two is basically the same within the secondary cycle , When exceeding 100 Next time , Use Regular objects Pattern Function of It takes significantly less time , So than re modular Be quick !
It is known from the actual test that :Python Official documents recommend Use regular object functions when using a regular expression multiple times Basically true !
Python Regular expression knowledge is basically explained , Finally, I would like to give you a little bit of attention .
The pattern and the searched string can be either Unicode character string (str) , It can also be 8 Bit byte string (bytes). however ,Unicode String and 8 Bit byte string cannot be mixed !
Regular expressions use backslashes (’’) To express a particular form , Or escape special characters to normal characters .
And the backslash is in the normal Python Strings have the same effect , So there's a conflict .
The solution is to use the regular expression style Python The original string representation of ; With ‘r’ In the string literal of the prefix , The backslash doesn't have to do anything special .
Find a match (search、match、fullmatch) The function return value of is a A match object Match , Need to pass through match.group() Get the match value , It's easy to forget .
In addition, we need to pay attention to :match.group() And match.groups() The difference between functions !
If you want to reuse a regular expression , It is recommended to use re.compile(pattern) function Returns a regular object , Then reuse the regular object , It will be faster !
Written examination may meet the need to use Python Regular expressions , But it won't be too hard , All you have to do is remember the difference between those methods , Will use , The basic problem is not big .
Whether the Python We have a clear understanding of the regular expression of ?