Multi line matching pattern
problem
You're trying to use regular expressions to match a large chunk of text , And you need to match across multiple lines .
solution
This is a typical problem when you use a little bit of (.) To match any character , Forget a little (.) The fact that line breaks can't be matched . such as , Suppose you want to try to match C Notes on language segmentation :
>>> comment = re.compile(r'/\*(.*?)\*/')
>>> text1 = '/* this is a comment */'
>>> text2 = '''/* this is a
... multiline comment */
... '''
>>>
>>> comment.findall(text1)
[' this is a comment ']
>>> comment.findall(text2)
[]
>>>
To fix this problem , You can modify the pattern string , Increase support for line breaks . such as :
>>> comment = re.compile(r'/\*((?:.j\n)*?)\*/')
>>> comment.findall(text2)
[' this is a\n multiline comment ']
>>>
In this mode , (?:.|\n) A non capture group was specified ( That is, it defines a match only , It can't be captured or numbered individually ).
Discuss
re.compile() The function takes a flag parameter called re.DOTALL , It's very useful here . It can make points in regular expressions (.) Match any character, including line breaks . such as :
>>> comment = re.compile(r'/\*(.*?)\*/', re.DOTALL)
>>> comment.findall(text2)
[' this is a\n multiline comment ']
For simple cases use re.DOTALL Tag parameters work well , But if the patterns are very complex or if you combine multiple patterns to construct string tokens , In this case, there may be some problems with this tag parameter . If you choose , It's better to define your own regular expression patterns , In this way, it can work well without additional tag parameters .