Use in regular expressions Unicode
You're using regular expressions to process text , But the focus is Unicode Character processing .
By default re Modules have been used for some Unicode Character classes have basic support . such as , \d Has matched any unicode Number character ：
'\d+') > # ASCII digits > num.match('123') <_sre.SRE_Match object at 0x1007d9ed0> > # Arabic digits > num.match('\u0661\u0662\u0663') <_sre.SRE_Match object at 0x101234030> >> import re > num = re.compile(
If you want to include the specified Unicode character , You can use Unicode The escape sequence of the character ( such as nuFFF perhaps nUFFFFFFF ). such as , Here's a regular expression that matches all the characters in several different Arabic coded pages ：
'[\u0600-\u06ff\u0750-\u077f\u08a0-\u08ff]+') >> arabic = re.compile(
When performing matching and searching operations , It's best to standardize and clean up all the text in a standardized format . But we should also pay attention to some special situations , For example, the behavior of ignoring case matching and case conversion .
'stra\u00dfe', re.IGNORECASE) > s = 'straße' > pat.match(s) # Matches <_sre.SRE_Match object at 0x10069d370> > pat.match(s.upper()) # Doesn't match > s.upper() # Case folds 'STRASSE' >> pat = re.compile(
A mixture of Unicode And regular expressions usually drive you crazy . If you're really going to do this , It's best to consider installing a third-party canonical Library , They will Unicode Case conversion and a number of other interesting features provide comprehensive support , Including fuzzy matching .