Notes on Python cookbook 3rd (2.19): implementation of a simple recursive descent analyzer

Giant ship 2020-11-15 01:20:23
notes python cookbook 3rd rd


Implementation of a simple recursive descent Analyzer

problem

You want to parse text and execute commands based on a set of grammar rules , Or construct an abstract syntax tree representing input . If the grammar is very simple , You can write this parser yourself , Instead of using frameworks .

solution

In this case , We focus on the problem of parsing text according to special syntax . In order to do so , You have to start with BNF perhaps EBNF The form specifies a standard grammar . such as , A simple mathematical expression syntax might look like this :

expr ::= expr + term
| expr - term
| term
term ::= term * factor
| term / factor
| factor
factor ::= ( expr )
| NUM

perhaps , With EBNF form :

expr ::= term { (+|-) term }*
term ::= factor { (*|/) factor }*
factor ::= ( expr )
| NUM

stay EBNF in , Be included in {...} * The rules in are optional . * representative 0 Repeat several or more times ( It has the same meaning as in regular expressions ).

Now? , If you are right about BNF If you don't understand the working mechanism of , Think of it as a set of rules in which left and right symbols can be replaced . In general , The principle of parsing is that you use BNF Complete multiple substitutions and extensions to match input text and grammar rules . To demonstrate , Suppose you are parsing the form 3 + 4 * 5 The expression of . This expression is first expressed by using 2.18 The techniques described in the section are broken down into a set of token streams . The result could be a sequence of tokens like this :

NUM + NUM * NUM

On this basis , The parse action tries to match the syntax to the input token through the substitution operation :

expr
expr ::= term { (+|-) term }*
expr ::= factor { (*|/) factor }* { (+|-) term }*
expr ::= NUM { (*|/) factor }* { (+|-) term }*
expr ::= NUM { (+|-) term }*
expr ::= NUM + term { (+|-) term }*
expr ::= NUM + factor { (*|/) factor }* { (+|-) term }*
expr ::= NUM + NUM { (*|/) factor}* { (+|-) term }*
expr ::= NUM + NUM * factor { (*|/) factor }* { (+|-) term }*
expr ::= NUM + NUM * NUM { (*|/) factor }* { (+|-) term }*
expr ::= NUM + NUM * NUM { (+|-) term }*
expr ::= NUM + NUM * NUM

All of the following parsing steps may take some time to figure out , But they all work by finding input and trying to match grammar rules . The first input token is NUM, So the substitution first matches that part . Once the match is successful , It will go to the next token +, And so on . When it is determined that the next token cannot be matched , On the right ( such as { (*/) factor }* ) It will be cleaned up . In a successful analysis , The entire right-hand section is fully expanded to match the input token stream .

With the previous knowledge background , Let's take a simple example to show how to build a recursive falling expression calculator program :

import re
import collections
# Token specification
NUM = r'(?P<NUM>\d+)'
PLUS = r'(?P<PLUS>\+)'
MINUS = r'(?P<MINUS>-)'
TIMES = r'(?P<TIMES>\*)'
DIVIDE = r'(?P<DIVIDE>/)'
LPAREN = r'(?P<LPAREN>\()'
RPAREN = r'(?P<RPAREN>\))'
WS = r'(?P<WS>\s+)'
master_pat = re.compile('|'.join([NUM, PLUS, MINUS, TIMES, DIVIDE, LPAREN, RPAREN, WS]))
# Tokenizer
Token = collections.namedtuple('Token', ['type','value'])
def generate_tokens(text):
scanner = master_pat.scanner(text)
for m in iter(scanner.match, None):
tok = Token(m.lastgroup, m.group())
if tok.type != 'WS':
yield tok
# Parser
class ExpressionEvaluator:
'''
Implementation of a recursive descent parser. Each method
implements a single grammar rule. Use the ._accept() method
to test and accept the current lookahead token. Use the ._expect()
method to exactly match and discard the next token on on the input
(or raise a SyntaxError if it doesn't match).
'''
def parse(self,text):
self.tokens = generate_tokens(text)
self.tok = None # Last symbol consumed
self.nexttok = None # Next symbol tokenized
self._advance() # Load first lookahead token
return self.expr()
def _advance(self):
'Advance one token ahead'
self.tok, self.nexttok = self.nexttok, next(self.tokens, None)
def _accept(self,toktype):
'Test and consume the next token if it matches toktype'
if self.nexttok and self.nexttok.type == toktype:
self._advance()
return True
else:
return False
def _expect(self,toktype):
'Consume next token if it matches toktype or raise SyntaxError'
if not self._accept(toktype):
raise SyntaxError('Expected ' + toktype)
# Grammar rules follow
def expr(self):
"expression ::= term { ('+'|'-') term }*"
exprval = self.term()
while self._accept('PLUS') or self._accept('MINUS'):
op = self.tok.type
right = self.term()
if op == 'PLUS':
exprval += right
elif op == 'MINUS':
exprval -= right
return exprval
def term(self):
"term ::= factor { ('*'|'/') factor }*"
termval = self.factor()
while self._accept('TIMES') or self._accept('DIVIDE'):
op = self.tok.type
right = self.factor()
if op == 'TIMES':
termval *= right
elif op == 'DIVIDE':
termval /= right
return termval
def factor(self):
"factor ::= NUM | ( expr )"
if self._accept('NUM'):
return int(self.tok.value)
elif self._accept('LPAREN'):
exprval = self.expr()
self._expect('RPAREN')
return exprval
else:
raise SyntaxError('Expected NUMBER or LPAREN')

Here's how to use it interactively ExpressionEvaluator Example of a class :

>>> e = ExpressionEvaluator()
>>> e.parse('2')
2
>>> e.parse('2 + 3')
5
>>> e.parse('2 + 3 * 4')
14
>>> e.parse('2 + (3 + 4) * 5')
37
>>> e.parse('2 + (3 + * 4)')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "exprparse.py", line 40, in parse
return self.expr()
File "exprparse.py", line 67, in expr
right = self.term()
File "exprparse.py", line 77, in term
termval = self.factor()
File "exprparse.py", line 93, in factor
exprval = self.expr()
File "exprparse.py", line 67, in expr
right = self.term()
File "exprparse.py", line 77, in term
termval = self.factor()
File "exprparse.py", line 97, in factor
raise SyntaxError("Expected NUMBER or LPAREN")
SyntaxError: Expected NUMBER or LPAREN
>>>

If we want to do more than just calculate , Then it needs to be modified ExpressionEvaluator Class to achieve . For example, the following implementation builds a simple parse tree :

class ExpressionTreeBuilder(ExpressionEvaluator):
def expr(self):
"expression ::= term { ('+'|'-') term }"
exprval = self.term()
while self._accept('PLUS') or self._accept('MINUS'):
op = self.tok.type
right = self.term()
if op == 'PLUS':
exprval = ('+', exprval, right)
elif op == 'MINUS':
exprval = ('-', exprval, right)
return exprval
def term(self):
"term ::= factor { ('*'|'/') factor }"
termval = self.factor()
while self._accept('TIMES') or self._accept('DIVIDE'):
op = self.tok.type
right = self.factor()
if op == 'TIMES':
termval = ('*', termval, right)
elif op == 'DIVIDE':
termval = ('/', termval, right)
return termval
def factor(self):
'factor ::= NUM | ( expr )'
if self._accept('NUM'):
return int(self.tok.value)
elif self._accept('LPAREN'):
exprval = self.expr()
self._expect('RPAREN')
return exprval
else:
raise SyntaxError('Expected NUMBER or LPAREN')

The following example shows how it works :

>>> e = ExpressionTreeBuilder()
>>> e.parse('2 + 3')
('+', 2, 3)
>>> e.parse('2 + 3 * 4')
('+', 2, ('*', 3, 4))
>>> e.parse('2 + (3 + 4) * 5')
('+', 2, ('*', ('+', 3, 4), 5))
>>> e.parse('2 + 3 + 4')
('+', ('+', 2, 3), 4)
>>>

Discuss

Text parsing is a big topic , It usually takes up the first three weeks when students study the compiler course . If you're looking for grammar , Analysis algorithm and other related background knowledge , You should look at compiler books . Obviously , There's too much about this , It can't be all unfolded here .

For all that , The overall idea of writing a recursive descent parser is relatively simple . At the beginning , You get all the grammar rules first , Or convert it to a function . So if your grammar is like this :

expr ::= term { ('+'|'-') term }*
term ::= factor { ('*'|'/') factor }*
factor ::= '(' expr ')'
| NUM

You should first convert them into a set of methods like this :

class ExpressionEvaluator:
...
def expr(self):
...
def term(self):
...
def factor(self):
...

The task of each method is simple - It has to traverse every part of the grammar rule from left to right , Process each token . In a sense , The purpose of the method is to either finish processing the grammar rules , Or a grammatical error . In order to do so , The following implementation methods are needed :

  • If the next symbol in the rule is the name of another grammar rule ( such as term or factor), Simply call the method with the same name . That's what the algorithm is about ” falling ” The origin of - Control falls into another grammar rule . Sometimes rules call methods that have already been executed ( such as , stay factor ::= '('expr')' Chinese vs expr Call to ). This is the algorithm ” recursive ” The origin of .
  • If the next symbol in the rule is a special symbol ( such as (), You have to find the next token and make sure it's an exact match ). If it doesn't match , There is a grammatical error . In this section expect() The method is used to do this step .
  • If the next symbol in the rule is some possible options ( such as + or -), You have to check the next token for every possible situation , Only when it matches one can it continue . This is also the example in this section accept() Purpose of method . It is equivalent to expect() A weakened version of the method , Because if a match is found, it continues , But if you don't find , It doesn't generate errors, it rolls back ( Allow subsequent inspection to proceed ).
  • For the repetitive part of the rule ( For example, in regular expressions ::= term { ('+'|'-') term }* in ), Repeat through a while Loop to achieve . The body of the loop collects or processes all the repeating elements until no other element can be found .
  • Once the whole grammar rule is processed , Each method will return some result to the caller . This is the principle of how the values are accumulated in the analysis process . such as , In an expression evaluator , The return value represents part of the result after the expression is parsed . Finally, all values are merged in the top-level syntax rule method .

Although it's a simple example to show you , Recursive descent parsers can be used to implement very complex parsing . such as , Python The language itself is interpreted by a recursive descent parser . If you're interested , You can view Python The source code file Grammar/Grammar To study the underlying grammatical mechanism . After reading it, you will find that , There are many limitations and shortcomings in implementing a parser manually .


One of them limited That is, they can't be used in grammar rules that contain any left recursion . such as , Add the following rule that you need to translate :

items ::= items ',' item
| item

In order to do so , You may use it like this items() Method :

def items(self):
itemsval = self.items()
if itemsval and self._accept(','):
itemsval.append(self.item())
else:
itemsval = [ self.item() ]

The only problem is that this method doesn't work at all , in fact , It generates an infinite recursion error .


You may also run into some tricky questions about the grammar rules themselves . such as , You may want to know if the following simple grammar is well expressed :

expr ::= factor { ('+'|'-'|'*'|'/') factor }*
factor ::= '(' expression ')'
| NUM

This grammar doesn't seem to be a problem , But it can't detect the operator priority in the standard four operations . such as , expression "3 + 4 * 5" You'll get 35 Not the expected 23. Separate use ”expr” and ”term” Rules make it work correctly .


For complex grammar , You'd better choose a parsing tool like PyParsing Or is it PLY. Here's how to use PLY To rewrite the code of the expression evaluator :

# plyexample.py
#
# Example of parsing with PLY
from ply.lex import lex
from ply.yacc import yacc
# Token list
tokens = [ 'NUM', 'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'LPAREN', 'RPAREN' ]
# Ignored characters
t_ignore = ' \t\n'
# Token specifications (as regexs)
t_PLUS = r'\+'
t_MINUS = r'-'
t_TIMES = r'\*'
t_DIVIDE = r'/'
t_LPAREN = r'\('
t_RPAREN = r'\)'
# Token processing functions
def t_NUM(t):
r'\d+'
t.value = int(t.value)
return t
# Error handler
def t_error(t):
print('Bad character: {!r}'.format(t.value[0]))
t.skip(1)
# Build the lexer
lexer = lex()
# Grammar rules and handler functions
def p_expr(p):
'''
expr : expr PLUS term
| expr MINUS term
'''
if p[2] == '+':
p[0] = p[1] + p[3]
elif p[2] == '-':
p[0] = p[1] - p[3]
def p_expr_term(p):
'''
expr : term
'''
p[0] = p[1]
def p_term(p):
'''
term : term TIMES factor
| term DIVIDE factor
'''
if p[2] == '*':
p[0] = p[1] * p[3]
elif p[2] == '/':
p[0] = p[1] / p[3]
def p_term_factor(p):
'''
term : factor
'''
p[0] = p[1]
def p_factor(p):
'''
factor : NUM
'''
p[0] = p[1]
def p_factor_group(p):
'''
factor : LPAREN expr RPAREN
'''
p[0] = p[2]
def p_error(p):
print('Syntax error')
parser = yacc()

In this program , All the code is at a higher level . You just need to write the regular expression for the token and the higher-order processing function for the rule matching . And the actual running parser , The underlying actions such as accepting token have been implemented by library functions .

Here's an example of how to use the resulting parse object :

>>> parser.parse('2')
2
>>> parser.parse('2+3')
5
>>> parser.parse('2+(3+4)*5')
37
>>>

If you want to challenge and stimulate your programming process , Writing parsers and compilers is a good choice . Again , A compiler book will contain a lot of underlying theoretical knowledge . But a lot of good resources can also be found online . Python Their own ast The module is also worth looking at .

版权声明
本文为[Giant ship]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database