We explained before Regular expressions The origin of 、 Development 、 Schools 、 grammar 、 engine 、 Optimization and other related knowledge , Today we are mainly going to study The regular expression is in Python Language Application in ！
Most programming languages learn from regular expression design Perl, So the grammar is basically similar , The difference is that each language has its own functions to support the regular , Today we are going to study Python About China Functions of regular expressions .
Talk about Python Regular expression support , First of all, I will definitely think of
re library , This is a Python Dealing with text Standard library .
Standard library This is a Python Built-in module , No need to download extra , at present Python There are built-in modules 300 individual . You can view it here Python All built-in modules ：https://docs.python.org/3/py-modindex.html#cap-r
because re It's a built-in module , So there's no need to download , It can be used directly ：
re Module mainly defines 9 Constant 、12 A function 、1 Exceptions , Every constant and function pig will be explained through the actual code case , So that we can more intuitive understanding of its role ！
notes ： In order to avoid code format disorder , Pig brother try to use code screenshots to demonstrate .
re Module official documentation ：https://docs.python.org/zh-cn/3.8/library/re.html
re Module library source code ：https://github.com/python/cpython/blob/3.8/Lib/re.py
Constants are variables that cannot be changed , Generally used for marking .
re There is 9 Constant , Constant values are all int type ！
As we can see in the picture above , All constants are in RegexFlag Enumeration class To achieve , This is Python 3.6 Make a new version of . stay Python 3.6 Previous versions wrote constants directly in re.py in , The advantage of using enumeration is that it is easy to manage and use ！
Let's quickly learn the function of these constants and how to use them , Sort by popularity ！
grammar ： re.IGNORECASE Or abbreviated as re.I
effect ： Ignore case matching .
Code case ：
In default match mode Capital B Can't match Lowercase letters b, And in the Ignore case It's OK in mode .
grammar ： re.ASCII Or abbreviated as re.A
effect ： seeing the name of a thing one thinks of its function ,ASCII Express ASCII Code means , Give Way
\S Only match ASCII, instead of Unicode.
Code case ：
In default match mode
\w+ Matches all strings , And in the ASCII In mode , Only matched a、b、c（ASCII Encoding supported characters ）.
Be careful ： This is only valid for string matching patterns , Invalid for byte match pattern .
grammar ： re.DOTALL Or abbreviated as re.S
effect ： DOT Express
.,ALL Express all , All in all
. Match all , Include line breaks
\n. In default mode
. Can't match line
\n Of .
Code case ：
In default match mode
. There is no match for newline
\n, Instead, match strings separately ; And in the re.DOTALL In mode , A newline
\n Match with string to .
Be careful ： In default match mode
. It doesn't match line breaks
grammar ： re.MULTILINE Or abbreviated as re.M
effect ： Multi line mode , When there is a line break in a string
\n, Line breaks are not supported in default mode , such as ： The beginning of the line and End of line , In multiline mode, matching line start is supported .
Code case ：
In regular expressions
^ Indicates the beginning of the matching line , By default, it can only match the beginning of a string ; And in multiline mode , It can also match A newline
\n Following character .
Be careful ： In regular grammar
^ Match the beginning of the line 、
\A Match the beginning of a string , In single line mode, the two effects are the same , In multiline mode
\A Can't identify
grammar ： re.VERBOSE Or abbreviated as re.X
effect ： Detailed mode , You can annotate regular expressions ！
Code case ：
Annotations in regular expressions are not recognized by default , And detailed patterns are recognizable .
When a regular expression is very complex , Detailed patterns may provide you with another way to annotate , But it shouldn't be a way to show off , It is recommended to use... After careful consideration ！
grammar ： re.LOCALE Or abbreviated as re.L
effect ： Determined by the current language region
\B Match case sensitivity , This mark can only be used for byte The pattern works . This sign is officially not recommended , Because the regional mechanism of language is very unreliable , It can only handle one at a time " habit ”, And only for 8 Bit bytes are valid .
Be careful ： Because this mark is not recommended by the government , And brother pig has never used , So we don't give the actual case ！
grammar ： re.UNICODE Or abbreviated as re.U
effect ： And ASCII Similar model , matching unicode Encoding supported characters , however Python 3 The default string is already Unicode, So it's a little redundant .
grammar ： re.DEBUG
effect ： Show compile time debug Information .
Code case ：
although debug It does print the compiled information in mode , But brother pig doesn't understand the language And the meaning of the expression , I hope my friends who know me can give me some advice .
grammar ： re.TEMPLATE Or abbreviated as re.T
effect ： Brother pig didn't understand TEMPLATE The specific use of , The source code annotation says ：disable backtracking( Disable backtracking ), You can leave a message to let me know ！
|Symbol , Do not use
Finally, let's summarize with a mind map re Constants in modules .
re Module has 12 A function , Brother pig will explain it in terms of function classification ; This is more comparative , It's also easy to remember .
The functions that find and return a match are 3 individual ：search、match、fullmatch, The difference between them is ：
Let's compare the actual code cases ：
Case study 1:
Case study 1 in search function Match anywhere in the string , As long as there is a string that matches the regular expression, it will match successfully , There are actually two matches , but search The function value returns a .
and match function To match from the beginning , And there's a letter at the beginning of the string
a, So it can't match ,fullmatch function It needs to be exactly the same , So it doesn't match ！
Case study 2:
Case study 2 Deleted text The first letter a, such match function You can match , and fullmatch function Still can't match exactly ！
Case study 3:
Case study 3 in , We only leave a passage , And consistent with regular expressions ; At this time fullmatch function Finally, it can match .
The whole case ：
Be careful ： lookup A match All that is returned is a match object （Match）.
Look for an item at the end , Now let's look at finding multiple items , The main ways to find multiple functions are ：findall function And finditer function ：
The two methods are basically similar , It's just a return list , One is to return iterators . We know that lists are generated in memory at one time , And iterators are generated little by little when they need to be used , Better memory usage .
If there could be a large number of matches , It is recommended to use finditer function , General use findall function Basically no impact .
re.split(pattern, string, maxsplit=0, flags=0) function ： use pattern Separate string , maxsplit Indicates the maximum number of segmentation times , flags Presentation mode , That's the constant we explained above ！
Be careful ：
str The module also has a split function , How to choose these two functions ？
str.split Function function is simple , Regular segmentation is not supported , and re.split Support regular .
About the speed of both ？ Brother pig actually tests , Use... With the same amount of data
re.split Function and
str.split function Number of executions And execution time Contrast figure ：
Through the comparison of the above figure, it is found that ,1000 Within the second cycle
str.split Functions are faster , And the number of cycles 1000 After more than one time
re.split The function is significantly faster , And the more times there are, the bigger the gap ！
So the conclusion is ： stay No need for regular support And The amount of data and the number of times are not much In case of use
str.split Function is more suitable , Otherwise use
re.split function .
notes ： The specific execution time is related to the test data ！
There are mainly sub function And subn function , They have similar functions ！
First look at it. sub function Usage of ：
re.sub(pattern, repl, string, count=0, flags=0) Function parameters ：repl Replace string Middle quilt pattern Matched character , count Indicates the maximum number of replacements ,flags Constants representing regular expressions .
It is worth noting that ：sub function In the ：repl The replacement can be either a string , It can also be a function ！ If repl For the function , There can only be one participant ：Match A match object .
re.subn(pattern, repl, string, count=0, flags=0) Function and re.sub function Consistent function , Just return a tuple ( character string , Number of replacements ).
compile function And template function Compile the regular expression style as a Regular expression objects （ Regular objects Pattern）, This object and re Modules have the same regular functions （ We will explain later Pattern Regular objects ）.
and template function And compile function similar , Just added what we said before re.TEMPLATE Pattern , We can see the source code .
re.escape(pattern) You can escape characters with special meanings in regular expressions , such as ：
* , Take a real case ：
re.escape(pattern) It seems that it's very easy to use without adding our own escape , But using it is easy to escape the wrong problem , So it's not recommended to use escape , And we suggest that you manually escape ！
re.purge() The function is to clear Regular expression cache , What kind of cache does it have ？ Let's take a look at the source code and know it's behind the scenes what ：
The way to look is to clear the cache , Let's take a look at the specific case ：
Brother pig used... Between the two cases re.purge() Function to clear the cache , Then compare the cache in the case source code before and after , See if there's any change ！
At the end of the paper, I'd like to summarize my mind map re Functions in modules .
re Module also contains a regular expression compilation error , When we give Regular expression is an invalid expression （ It's the expression itself that has problems ） when , will raise An exception ！
Let's take a look at specific cases ：
In the above case, we can see , In writing regular expressions, we write an extra bracket , This leads to an error in the execution result ; And before all the other cases , So the error is reported at regular expression compilation time .
Be careful ： The exception must be Regular expressions It doesn't work in itself , Nothing to do with the string to match ！
re Module constants 、 function 、 We are all finished explaining the abnormality , But it's absolutely necessary to talk about Regular objects Pattern.
re There is an important function in the function of the module compile function , This function can precompile and return a regular object , This regular object owns and
re Module the same function , Let's see Pattern class Source code .
Since it is the same , That should be used in the end re modular still Regular objects Pattern ？
and , Some students may have seen
re Module source code , You'll find out compile function And other re function （search、split、sub wait ） The same function is called internally , In the end, we call the regular object's function ！
That is to say, below Two kinds of code writing Underlying implementation In fact, they are the same ：
# re function re.search(pattern, text) # Regular object functions compile = re.compile(pattern) compile.search(text)
It's also necessary to use compile function Get the regular object and call search function Do you ？ Call directly re.search Is it OK to ？
About what to use re modular still Regular objects Pattern , Does the official document state ？
Official documents recommend ： Regular objects are recommended when using a regular expression multiple times Pattern To increase reusability , Because by re.compile(pattern) The compiled module level functions will be cached ！
The official documents above recommend that we are in Use regular objects when using a regular expression multiple times , Is that really the case ？
Let's measure it
Brother pig wrote two functions , A use re.search function Another use compile.search function , , respectively, ( Different time ) Loop execution count Time (count from 1-1 ten thousand ), Comparing the two takes time ！
The result is a broken line ：
The conclusion is that ：100 The speed of the two is basically the same within the secondary cycle , When exceeding 100 Next time , Use Regular objects Pattern Function of It takes significantly less time , So than re modular Be quick ！
It is known from the actual test that ：Python Official documents recommend Use regular object functions when using a regular expression multiple times Basically true ！
Python Regular expression knowledge is basically explained , Finally, I would like to give you a little bit of attention .
The pattern and the searched string can be either Unicode character string (str) , It can also be 8 Bit byte string (bytes). however ,Unicode String and 8 Bit byte string cannot be mixed ！
Regular expressions use backslashes （’’） To express a particular form , Or escape special characters to normal characters .
And the backslash is in the normal Python Strings have the same effect , So there's a conflict .
The solution is to use the regular expression style Python The original string representation of ; With ‘r’ In the string literal of the prefix , The backslash doesn't have to do anything special .
Find a match （search、match、fullmatch） The function return value of is a A match object Match , Need to pass through match.group() Get the match value , It's easy to forget .
In addition, we need to pay attention to ：match.group() And match.groups() The difference between functions ！
If you want to reuse a regular expression , It is recommended to use re.compile(pattern) function Returns a regular object , Then reuse the regular object , It will be faster ！
Written examination may meet the need to use Python Regular expressions , But it won't be too hard , All you have to do is remember the difference between those methods , Will use , The basic problem is not big .
Whether the Python We have a clear understanding of the regular expression of ？