take Unicode Text Standardization
You're dealing with Unicode character string , You need to make sure that all strings have the same representation at the bottom .
stay Unicode in , Some characters can be represented by more than one legal encoding . To illustrate , Consider the following example ：
'Spicy Jalape\u00f1o' > s2 = 'Spicy Jalapen\u0303o' > s1 'Spicy Jalapeño' > s2 'Spicy Jalapeño' > s1 == s2 False > len(s1) 14 > len(s2) 15 >> s1 =
The text here ”Spicy Jalapeño” Two forms are used to express . The first is to use whole characters ”ñ”(U+00F1), The second one uses Latin letters ”n” Followed by a ”˜” The combined characters of (U+0303).
Using multiple representations of characters in programs that need to compare strings can cause problems . To fix this problem , You can use unicodedata The module standardizes the text first ：
'NFC', s1) > t2 = unicodedata.normalize('NFC', s2) > t1 == t2 True > print(ascii(t1)) 'Spicy Jalape\xf1o' > t3 = unicodedata.normalize('NFD', s1) > t4 = unicodedata.normalize('NFD', s2) > t3 == t4 True > print(ascii(t3)) 'Spicy Jalapen\u0303o' >> import unicodedata > t1 = unicodedata.normalize(
normalize() The first parameter specifies how strings are normalized . NFC Indicates that the character should be a whole group become ( For example, use a single encoding if possible ), and NFD The representation character should be decomposed into several combined characters to represent .
Python It also supports standardized forms of extensions NFKC and NFKD, When they deal with certain characters Added additional compatibility features . such as ：
'\ufb01' # A single character > s ' fi' > unicodedata.normalize('NFD', s) ' fi' # Notice how the combined letters are broken apart here > unicodedata.normalize('NFKD', s) 'fi' > unicodedata.normalize('NFKC', s) 'fi' >> s =
Standardization for anything that needs to be dealt with in a consistent way Unicode The program of text is very important . This is especially true when dealing with strings from user input and it's hard for you to control the encoding .
Standardization of characters is also important when cleaning and filtering text . such as , Suppose you want to erase some of the diacritical notes from the text ( Maybe it's for searching and matching )：
'NFD', s1) > ''.join(c for c in t1 if not unicodedata.combining(c)) 'Spicy Jalapeno' >> t1 = unicodedata.normalize(
The last example shows unicodedata Another important aspect of the module , That is to test the tool function of character class . combining() Function can test whether a character is a harmony character . There are other functions in this module for finding character categories , Test for numeric characters, etc .