String token parsing
problem
You have a string , You want to parse it from left to right as a token stream .
solution
If you have a text string like this :
text = 'foo = 23 + 42 * 10'
To tokenize strings , You don't just need to match patterns , You also have to specify the type of pattern . such as , You may want to convert a string into a sequence pair like this :
tokens = [('NAME', 'foo'), ('EQ','='), ('NUM', '23'), ('PLUS','+'),
('NUM', '42'), ('TIMES', '*'), ('NUM', 10')]
This segmentation is for execution , The first step is to define all possible tokens using a regular expression named capture group as follows , Including Spaces :
import re
NAME = r'(?P<NAME>[a-zA-Z_][a-zA-Z_0-9]*)'
NUM = r'(?P<NUM>\d+)'
PLUS = r'(?P<PLUS>\+)'
TIMES = r'(?P<TIMES>\*)'
EQ = r'(?P<EQ>=)'
WS = r'(?P<WS>\s+)'
master_pat = re.compile('j'.join([NAME, NUM, PLUS, TIMES, EQ, WS]))
In the pattern above , ?P<TOKENNAME> Used to name a pattern , For later use . next step , To tokenize , Using schema objects is rarely known scanner() Method . This method will create a scanner object , It's called constantly on this object match() Method will scan the target text step by step , One match at a time . Here's a demo of scanner Examples of how interactive objects work :
>>> scanner = master_pat.scanner('foo = 42')
>>> scanner.match()
<_sre.SRE_Match object at 0x100677738>
>>> _.lastgroup, _.group()
('NAME', 'foo')
>>> scanner.match()
<_sre.SRE_Match object at 0x100677738>
>>> _.lastgroup, _.group()
('WS', ' ')
>>> scanner.match()
<_sre.SRE_Match object at 0x100677738>
>>> _.lastgroup, _.group()
('EQ', '=')
>>> scanner.match()
<_sre.SRE_Match object at 0x100677738>
>>> _.lastgroup, _.group()
('WS', ' ')
>>> scanner.match()
<_sre.SRE_Match object at 0x100677738>
>>> _.lastgroup, _.group()
('NUM', '42')
>>> scanner.match()
>>>
When you actually use this technology , It's easy to package the above code into a generator as follows :
def generate_tokens(pat, text):
Token = namedtuple('Token', ['type', 'value'])
scanner = pat.scanner(text)
for m in iter(scanner.match, None):
yield Token(m.lastgroup, m.group())
# Example use
for tok in generate_tokens(master_pat, 'foo = 42'):
print(tok)
# Produces output
# Token(type='NAME', value='foo')
# Token(type='WS', value=' ')
# Token(type='EQ', value='=')
# Token(type='WS', value=' ')
# Token(type='NUM', value='42')
Filter the token if you think about it , You can define more generator functions or use a generator expression . such as , Here's how to filter all blank tokens :
tokens = (tok for tok in generate_tokens(master_pat, text) if tok.type != 'WS')
for tok in tokens:
print(tok)
Discuss
Generally speaking, tokenization is the first step in many advanced text parsing and processing . To use the scanning method above , You need to remember some important points here . The first is that you have to make sure that you use regular expressions to specify all possible sequences of text in your input . If any unmatched text appears , The scan will stop . That's why the blank character token must be specified in the example above .
The order of tokens also has an impact . re Modules will match in the specified order . therefore , If a pattern happens to be a substring of another longer pattern , Then you need to make sure that the long pattern is written first . such as :
LT = r'(?P<LT><)'
LE = r'(?P<LE><=)'
EQ = r'(?P<EQ>=)'
master_pat = re.compile('j'.join([LE, LT, EQ])) # Correct
# master_pat = re.compile('j'.join([LT, LE, EQ])) # Incorrect
The second pattern is wrong , Because it will put the text <= Match as token LT Follow closely EQ, Not a single token LE, This is not what we want .
Last , You need to pay attention to the pattern of substrings . such as , Suppose you have the following two patterns
PRINT = r'(?P<PRINT>print)'
NAME = r'(?P<NAME>[a-zA-Z_][a-zA-Z_0-9]*)'
master_pat = re.compile('j'.join([PRINT, NAME]))
for tok in generate_tokens(master_pat, 'printer'):
print(tok)
# Outputs :
# Token(type='PRINT', value='print')
# Token(type='NAME', value='er')
About higher-level tokenization , You may need to check PyParsing perhaps PLY package . A call PLY In the next section we'll show you an example of .