Implementation of a simple recursive descent Analyzer
problem
You want to parse text and execute commands based on a set of grammar rules , Or construct an abstract syntax tree representing input . If the grammar is very simple , You can write this parser yourself , Instead of using frameworks .
solution
In this case , We focus on the problem of parsing text according to special syntax . In order to do so , You have to start with BNF perhaps EBNF The form specifies a standard grammar . such as , A simple mathematical expression syntax might look like this :
expr ::= expr + term
| expr - term
| term
term ::= term * factor
| term / factor
| factor
factor ::= ( expr )
| NUM
perhaps , With EBNF form :
expr ::= term { (+|-) term }*
term ::= factor { (*|/) factor }*
factor ::= ( expr )
| NUM
stay EBNF in , Be included in {...} * The rules in are optional . * representative 0 Repeat several or more times ( It has the same meaning as in regular expressions ).
Now? , If you are right about BNF If you don't understand the working mechanism of , Think of it as a set of rules in which left and right symbols can be replaced . In general , The principle of parsing is that you use BNF Complete multiple substitutions and extensions to match input text and grammar rules . To demonstrate , Suppose you are parsing the form 3 + 4 * 5 The expression of . This expression is first expressed by using 2.18 The techniques described in the section are broken down into a set of token streams . The result could be a sequence of tokens like this :
NUM + NUM * NUM
On this basis , The parse action tries to match the syntax to the input token through the substitution operation :
expr
expr ::= term { (+|-) term }*
expr ::= factor { (*|/) factor }* { (+|-) term }*
expr ::= NUM { (*|/) factor }* { (+|-) term }*
expr ::= NUM { (+|-) term }*
expr ::= NUM + term { (+|-) term }*
expr ::= NUM + factor { (*|/) factor }* { (+|-) term }*
expr ::= NUM + NUM { (*|/) factor}* { (+|-) term }*
expr ::= NUM + NUM * factor { (*|/) factor }* { (+|-) term }*
expr ::= NUM + NUM * NUM { (*|/) factor }* { (+|-) term }*
expr ::= NUM + NUM * NUM { (+|-) term }*
expr ::= NUM + NUM * NUM
All of the following parsing steps may take some time to figure out , But they all work by finding input and trying to match grammar rules . The first input token is NUM, So the substitution first matches that part . Once the match is successful , It will go to the next token +, And so on . When it is determined that the next token cannot be matched , On the right ( such as { (*/) factor }* ) It will be cleaned up . In a successful analysis , The entire right-hand section is fully expanded to match the input token stream .
With the previous knowledge background , Let's take a simple example to show how to build a recursive falling expression calculator program :
import re
import collections
# Token specification
NUM = r'(?P<NUM>\d+)'
PLUS = r'(?P<PLUS>\+)'
MINUS = r'(?P<MINUS>-)'
TIMES = r'(?P<TIMES>\*)'
DIVIDE = r'(?P<DIVIDE>/)'
LPAREN = r'(?P<LPAREN>\()'
RPAREN = r'(?P<RPAREN>\))'
WS = r'(?P<WS>\s+)'
master_pat = re.compile('|'.join([NUM, PLUS, MINUS, TIMES, DIVIDE, LPAREN, RPAREN, WS]))
# Tokenizer
Token = collections.namedtuple('Token', ['type','value'])
def generate_tokens(text):
scanner = master_pat.scanner(text)
for m in iter(scanner.match, None):
tok = Token(m.lastgroup, m.group())
if tok.type != 'WS':
yield tok
# Parser
class ExpressionEvaluator:
'''
Implementation of a recursive descent parser. Each method
implements a single grammar rule. Use the ._accept() method
to test and accept the current lookahead token. Use the ._expect()
method to exactly match and discard the next token on on the input
(or raise a SyntaxError if it doesn't match).
'''
def parse(self,text):
self.tokens = generate_tokens(text)
self.tok = None # Last symbol consumed
self.nexttok = None # Next symbol tokenized
self._advance() # Load first lookahead token
return self.expr()
def _advance(self):
'Advance one token ahead'
self.tok, self.nexttok = self.nexttok, next(self.tokens, None)
def _accept(self,toktype):
'Test and consume the next token if it matches toktype'
if self.nexttok and self.nexttok.type == toktype:
self._advance()
return True
else:
return False
def _expect(self,toktype):
'Consume next token if it matches toktype or raise SyntaxError'
if not self._accept(toktype):
raise SyntaxError('Expected ' + toktype)
# Grammar rules follow
def expr(self):
"expression ::= term { ('+'|'-') term }*"
exprval = self.term()
while self._accept('PLUS') or self._accept('MINUS'):
op = self.tok.type
right = self.term()
if op == 'PLUS':
exprval += right
elif op == 'MINUS':
exprval -= right
return exprval
def term(self):
"term ::= factor { ('*'|'/') factor }*"
termval = self.factor()
while self._accept('TIMES') or self._accept('DIVIDE'):
op = self.tok.type
right = self.factor()
if op == 'TIMES':
termval *= right
elif op == 'DIVIDE':
termval /= right
return termval
def factor(self):
"factor ::= NUM | ( expr )"
if self._accept('NUM'):
return int(self.tok.value)
elif self._accept('LPAREN'):
exprval = self.expr()
self._expect('RPAREN')
return exprval
else:
raise SyntaxError('Expected NUMBER or LPAREN')
Here's how to use it interactively ExpressionEvaluator Example of a class :
>>> e = ExpressionEvaluator()
>>> e.parse('2')
2
>>> e.parse('2 + 3')
5
>>> e.parse('2 + 3 * 4')
14
>>> e.parse('2 + (3 + 4) * 5')
37
>>> e.parse('2 + (3 + * 4)')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "exprparse.py", line 40, in parse
return self.expr()
File "exprparse.py", line 67, in expr
right = self.term()
File "exprparse.py", line 77, in term
termval = self.factor()
File "exprparse.py", line 93, in factor
exprval = self.expr()
File "exprparse.py", line 67, in expr
right = self.term()
File "exprparse.py", line 77, in term
termval = self.factor()
File "exprparse.py", line 97, in factor
raise SyntaxError("Expected NUMBER or LPAREN")
SyntaxError: Expected NUMBER or LPAREN
>>>
If we want to do more than just calculate , Then it needs to be modified ExpressionEvaluator Class to achieve . For example, the following implementation builds a simple parse tree :
class ExpressionTreeBuilder(ExpressionEvaluator):
def expr(self):
"expression ::= term { ('+'|'-') term }"
exprval = self.term()
while self._accept('PLUS') or self._accept('MINUS'):
op = self.tok.type
right = self.term()
if op == 'PLUS':
exprval = ('+', exprval, right)
elif op == 'MINUS':
exprval = ('-', exprval, right)
return exprval
def term(self):
"term ::= factor { ('*'|'/') factor }"
termval = self.factor()
while self._accept('TIMES') or self._accept('DIVIDE'):
op = self.tok.type
right = self.factor()
if op == 'TIMES':
termval = ('*', termval, right)
elif op == 'DIVIDE':
termval = ('/', termval, right)
return termval
def factor(self):
'factor ::= NUM | ( expr )'
if self._accept('NUM'):
return int(self.tok.value)
elif self._accept('LPAREN'):
exprval = self.expr()
self._expect('RPAREN')
return exprval
else:
raise SyntaxError('Expected NUMBER or LPAREN')
The following example shows how it works :
>>> e = ExpressionTreeBuilder()
>>> e.parse('2 + 3')
('+', 2, 3)
>>> e.parse('2 + 3 * 4')
('+', 2, ('*', 3, 4))
>>> e.parse('2 + (3 + 4) * 5')
('+', 2, ('*', ('+', 3, 4), 5))
>>> e.parse('2 + 3 + 4')
('+', ('+', 2, 3), 4)
>>>
Discuss
Text parsing is a big topic , It usually takes up the first three weeks when students study the compiler course . If you're looking for grammar , Analysis algorithm and other related background knowledge , You should look at compiler books . Obviously , There's too much about this , It can't be all unfolded here .
For all that , The overall idea of writing a recursive descent parser is relatively simple . At the beginning , You get all the grammar rules first , Or convert it to a function . So if your grammar is like this :
expr ::= term { ('+'|'-') term }*
term ::= factor { ('*'|'/') factor }*
factor ::= '(' expr ')'
| NUM
You should first convert them into a set of methods like this :
class ExpressionEvaluator:
...
def expr(self):
...
def term(self):
...
def factor(self):
...
The task of each method is simple - It has to traverse every part of the grammar rule from left to right , Process each token . In a sense , The purpose of the method is to either finish processing the grammar rules , Or a grammatical error . In order to do so , The following implementation methods are needed :
- If the next symbol in the rule is the name of another grammar rule ( such as term or factor), Simply call the method with the same name . That's what the algorithm is about ” falling ” The origin of - Control falls into another grammar rule . Sometimes rules call methods that have already been executed ( such as , stay factor ::= '('expr')' Chinese vs expr Call to ). This is the algorithm ” recursive ” The origin of .
- If the next symbol in the rule is a special symbol ( such as (), You have to find the next token and make sure it's an exact match ). If it doesn't match , There is a grammatical error . In this section expect() The method is used to do this step .
- If the next symbol in the rule is some possible options ( such as + or -), You have to check the next token for every possible situation , Only when it matches one can it continue . This is also the example in this section accept() Purpose of method . It is equivalent to expect() A weakened version of the method , Because if a match is found, it continues , But if you don't find , It doesn't generate errors, it rolls back ( Allow subsequent inspection to proceed ).
- For the repetitive part of the rule ( For example, in regular expressions ::= term { ('+'|'-') term }* in ), Repeat through a while Loop to achieve . The body of the loop collects or processes all the repeating elements until no other element can be found .
- Once the whole grammar rule is processed , Each method will return some result to the caller . This is the principle of how the values are accumulated in the analysis process . such as , In an expression evaluator , The return value represents part of the result after the expression is parsed . Finally, all values are merged in the top-level syntax rule method .
Although it's a simple example to show you , Recursive descent parsers can be used to implement very complex parsing . such as , Python The language itself is interpreted by a recursive descent parser . If you're interested , You can view Python The source code file Grammar/Grammar To study the underlying grammatical mechanism . After reading it, you will find that , There are many limitations and shortcomings in implementing a parser manually .
One of them limited That is, they can't be used in grammar rules that contain any left recursion . such as , Add the following rule that you need to translate :
items ::= items ',' item
| item
In order to do so , You may use it like this items() Method :
def items(self):
itemsval = self.items()
if itemsval and self._accept(','):
itemsval.append(self.item())
else:
itemsval = [ self.item() ]
The only problem is that this method doesn't work at all , in fact , It generates an infinite recursion error .
You may also run into some tricky questions about the grammar rules themselves . such as , You may want to know if the following simple grammar is well expressed :
expr ::= factor { ('+'|'-'|'*'|'/') factor }*
factor ::= '(' expression ')'
| NUM
This grammar doesn't seem to be a problem , But it can't detect the operator priority in the standard four operations . such as , expression "3 + 4 * 5" You'll get 35 Not the expected 23. Separate use ”expr” and ”term” Rules make it work correctly .
For complex grammar , You'd better choose a parsing tool like PyParsing Or is it PLY. Here's how to use PLY To rewrite the code of the expression evaluator :
# plyexample.py
#
# Example of parsing with PLY
from ply.lex import lex
from ply.yacc import yacc
# Token list
tokens = [ 'NUM', 'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'LPAREN', 'RPAREN' ]
# Ignored characters
t_ignore = ' \t\n'
# Token specifications (as regexs)
t_PLUS = r'\+'
t_MINUS = r'-'
t_TIMES = r'\*'
t_DIVIDE = r'/'
t_LPAREN = r'\('
t_RPAREN = r'\)'
# Token processing functions
def t_NUM(t):
r'\d+'
t.value = int(t.value)
return t
# Error handler
def t_error(t):
print('Bad character: {!r}'.format(t.value[0]))
t.skip(1)
# Build the lexer
lexer = lex()
# Grammar rules and handler functions
def p_expr(p):
'''
expr : expr PLUS term
| expr MINUS term
'''
if p[2] == '+':
p[0] = p[1] + p[3]
elif p[2] == '-':
p[0] = p[1] - p[3]
def p_expr_term(p):
'''
expr : term
'''
p[0] = p[1]
def p_term(p):
'''
term : term TIMES factor
| term DIVIDE factor
'''
if p[2] == '*':
p[0] = p[1] * p[3]
elif p[2] == '/':
p[0] = p[1] / p[3]
def p_term_factor(p):
'''
term : factor
'''
p[0] = p[1]
def p_factor(p):
'''
factor : NUM
'''
p[0] = p[1]
def p_factor_group(p):
'''
factor : LPAREN expr RPAREN
'''
p[0] = p[2]
def p_error(p):
print('Syntax error')
parser = yacc()
In this program , All the code is at a higher level . You just need to write the regular expression for the token and the higher-order processing function for the rule matching . And the actual running parser , The underlying actions such as accepting token have been implemented by library functions .
Here's an example of how to use the resulting parse object :
>>> parser.parse('2')
2
>>> parser.parse('2+3')
5
>>> parser.parse('2+(3+4)*5')
37
>>>
If you want to challenge and stimulate your programming process , Writing parsers and compilers is a good choice . Again , A compiler book will contain a lot of underlying theoretical knowledge . But a lot of good resources can also be found online . Python Their own ast The module is also worth looking at .