​ When we do interface automation , When dealing with related data that the interface depends on , Regular expressions are usually used to extract relevant data .

​ Regular expressions , Also known as normal expression 、 Normal representation 、 Normal expression 、 Regular expressions 、 Conventional representation (Regular Expression, In code it is often abbreviated as regex、regexp or RE) . It's a special sequence of characters , It can help you easily check whether a string matches a certain pattern . In many text editors , Regular expressions are often used for retrieval 、 Replace the text that matches a pattern . and Python since 1.5 Version has been added re modular , It provides Perl Style regular expression pattern .

regular expression syntax

Represents a single character

​ A single character : It means a single character , Like matching numbers with \d, Match non numbers with \D.

​ Except for the following syntax , You can also match specific characters specified , It can be 1 One or more .

character Functional specifications
. Match arbitrarily 1 Characters ( except \n)
[2a] matching [] The characters listed in brackets , So here's the match 2 perhaps a One of these two characters
\d Match the Numbers , namely 0-9
\D Match non numeric
\s Match blanks , That is, the space 、tab key (tab The key is two spaces )
\S Match non blank
\w Match word characters , namely a-z、A-Z、0-9、_( Numbers 、 Letter 、 Underline )
\W Match non word characters

​ Examples are as follows , Let's start with findall( Matching rules , String to match ) This method is to find all the matching data , Return... As a list , It'll be later re Module for detailed explanation :

import re
# .: Match arbitrarily 1 Characters 
re1 = r'.'
res1 = re.findall(re1, '\nj8?0\nbth\nihb')
print(res1) # Running results :['j', '8', '?', '0', 'b', 't', 'h', 'i', 'h', 'b'] # []: Match one of the enumerations
re2 = r"[abc]"
res2 = re.findall(re2, '1iugfiSHOIFUOFGIDHFGFD2345a6a78b99cc')
print(res2) # Running results :['a', 'a', 'b', 'c', 'c'] # \d: Match a number
re3 = r"\d"
res3 = re.findall(re3, "dfghjkl32212dfghjk")
print(res3) # Running results :['3', '2', '2', '1', '2'] # \D: Match a non number
re4 = r"\D"
res4 = re.findall(re4, "d212dk?\n$%3;]a")
print(res4) # Running results :['d', 'd', 'k', '?', '\n', '$', '%', ';', ']', 'a'] # \s: Match a blank key or tab key (tab The key is actually two blank keys )
re5 = r"\s"
res5 = re.findall(re5,"a s d a 9999")
print(res5) # Running results :[' ', ' ', ' ', ' ', ' '] # \S: Match non blank keys
re6 = r"\S"
res6 = re.findall(re6, "a s d a 9999")
print(res6) # Running results :['a', 's', 'd', 'a', '9', '9', '9', '9'] # \w: Match a word character ( Numbers 、 Letter 、 Underline )
re7 = r"\w"
res7 = re.findall(re7, "ce12sd@#a as_#$")
print(res7) # Running results :['c', 'e', '1', '2', 's', 'd', 'a', 'a', 's', '_'] # \W: Match a non word character ( Not numbers 、 Letter 、 Underline )
re8 = r"\W"
res8 = re.findall(re8, "ce12sd@#a as_#$")
print(res8) # Running results :['@', '#', ' ', '#', '$'] # Matches the specified character
re9 = r"python"
res9 = re.findall(re9, "cepy1thon12spython123@@python")
print(res9) # Running results :['python', 'python']

It means quantity

​ If you want to match a character more than once , You can add a number after the character to indicate , The specific rules are as follows :

character Functional specifications
* Match previous character appears 0 Times or infinite times , You can have it or not
+ Match previous character appears 1 Times or infinite times , At least 1 Time
? Match previous character appears 0 Time or 1 Time , That is, either there is no , Or only 1 Time
{m} Match previous character appears m Time
{m,} Match the previous character at least m Time
{m,n} Match previous character appears from m To n Time

​ Examples are as follows :

import re
# *: Indicates that the previous character appears 0 More than once ( Include 0 Time )
re21 = r"\d*" # The matching rules here , The previous character is a number
res21 = re.findall(re21, "343aa1112df345g1h6699") # If it matches a when , Belong to conform to 0 Time , But because there is no value, it will be empty
print(res21) # Running results :['343', '', '', '1112', '', '', '345', '', '1', '', '6699', ''] # ? : Express 0 Once or once
re22 = r"\d?"
res22 = re.findall(re22, "3@43*a111")
print(res22) # Running results :['3', '', '4', '3', '', '', '1', '1', '1', ''] # {m}: To match a character m Time
re23 = r"1[3456789]\d{9}" # cell-phone number : The first 1 Position as 1, The first 2 Bit matching is one of the 1 A digital , The first 3 Bits start with numbers , And match 9 Time
res23 = re.findall(re23,"sas13566778899fgh256912345678jkghj12788990000aaa113588889999")
print(res23) # Running results :['13566778899', '13588889999'] # {m,}: Means to match a character at least m Time
re24 = r"\d{7,}"
res24 = re.findall(re24, "sas12356fgh1234567jkghj12788990000aaa113588889999")
print(res24) # Running results :['1234567', '12788990000', '113588889999'] # {m,n}: Indicates that a matching character appears m Time to n Time
re25 = r"\d{3,5}"
res25 = re.findall(re25, "aaaaa123456ghj333yyy77iii88jj909768876")
print(res25) # Running results :['12345', '333', '90976', '8876']

Match groups

character Functional specifications
| Match any expression left or right
(ab) Use the characters in brackets as a group

​ Examples are as follows :

import re
# Define multiple rules at the same time , Just satisfy one of them 
re31 = r"13566778899|13534563456|14788990000"
res31 = re.findall(re31, "sas13566778899fgh13534563456jkghj14788990000")
print(res31) # Running results :['13566778899', '13534563456', '14788990000'] # (): Match groups : Extract the data in brackets from the data of the matching rule
re32 = r"aa(\d{3})bb" # How data fits the rules , The result will only take the data in brackets , namely \d{3}
res32 = re.findall(re32, "ggghjkaa123bbhhaa672bbjhjjaa@45bb")
print(res32) # Running results :['123', '672']

Represent boundary

character Functional specifications
^ Match the beginning of a string , You can only match the beginning
$ Match string end , Can only match the end
\b Match the boundary of a word ( word : Letter 、 Numbers 、 Underline )
\B Match non word boundaries

​ Examples are as follows :

import re
# ^: Match the beginning of the string 
re41 = r"^python" # The string begins with python
res41 = re.findall(re41, "python999python") # It only matches the beginning of the string
res411 = re.findall(re41, "1python999python") # Because it starts with 1, The first 1 It doesn't fit
print(res41) # Running results :['python']
print(res411) # Running results :[] # $: Match the end of the string
re42=r"python$" # String to python ending
res42 = re.findall(re42, "python999python")
print(res42) # Running results :['python'] # \b: Match the boundaries of words , Words are : Letter 、 Numbers 、 Underline
re43 = r"\bpython" # Match python, And python Is the first word of the word
res43 = re.findall(re43, "1python 999 python") # Here I 1 individual python Before 1 Bits are words , So the first 1 It's not a match
print(res43) # Running results :['python'] # \B: Match non word boundaries
re44 = r"\Bpython" # Match python, And python The first word in English is the word
res44 = re.findall(re44, "1python999python")
print(res44) # Running results :['python', 'python']

Greedy mode

​ python The quantifier is greedy by default , Always try to match as many characters as possible , The non greedy pattern is trying to match as few characters as possible , Add a question mark after the expression for quantity (?) You can turn off greedy mode .

​ The following example , matching 2 More than one number , If it meets the criteria, it will match until it doesn't , Such as 34656fya,34656 accord with 2 More than a number , So it will match all the way to 6 until , If you turn off greedy mode , So in satisfaction 2 It stops when it reaches a number , In the end, we can match 34、65.

import re
# In the default greedy mode 
test = 'aa123aaaa34656fyaa12a123d'
res = re.findall(r'\d{2,}', test)
print(res) # Running results :['123', '34656', '12', '123'] # Turn off greedy mode
res2 = re.findall(r'\d{2,}?', test)
print(res2) # Running results :['12', '34', '65', '12', '12']

re modular

​ stay python Using regular expressions , Will be used re Module to operate , The method provided usually needs to pass in two parameters :

  • Parameters 1: Matching rules
  • Parameters 2: The string to match

re.findall()

​ Find all the strings that match the specification , Return... As a list .

import re
test = 'aa123aaaa34656fyaa12a123d'
res = re.findall(r'\d{2,}', test)
print(res) # Running results :['123', '34656', '12', '123']

re.search()

​ Find the first qualified string , What is returned is a matching object , Can pass group() Extract the matched data directly .

import re
s = "123abc123aaa123bbb888ccc"
res2 = re.search(r'123', s)
print(res2) # Running results :<re.Match object; span=(0, 3), match='123'> # adopt group Extract the matched data , The return type is str
print(res2.group()) # Running results :123

​ In the matching object returned ,span Is the subscript range of the matched data ,match Is the matching value .

group() Parameter description