淺析Python解釋器的設計（一）

一些鋪墊（扯淡）

歷史上葡缰，在Python 2.4以及之前的版本蚂斤，py代碼的執(zhí)行一膨，也就是從源碼到bytecode分為兩步：

解析py源碼成為分析樹 (Parser/pgen.c)
基于分析樹優(yōu)化縮減bytecode (Python/compile.c)

也是由于Guido van Rossum大叔的歷史局限性缴啡，Python當年就被實現(xiàn)成這么一個樣子了砸讳，運行的也挺好。后來隨著LLVM黨的崛起诵叁，大家逐漸認識到這種模式的局限性趣斤，后面還會說。

正經(jīng)說起來黎休，一個標準的“現(xiàn)代”解釋器（編譯器）應該是“介事兒”的：

把py源碼解析成分析樹 (Parser/pgen.c)
把分析樹轉(zhuǎn)化成抽象語法樹AST（Abstract Syntax Tree） (Python/ast.c)
把AST轉(zhuǎn)化成CFG（Control Flow Graph） (Python/compile.c)
基于Control Flow Graph優(yōu)化生成bytecode (Python/compile.c)

從Python 2.5開始，CPython“改邪歸正”走上了現(xiàn)代編譯器的康莊大道玉凯。簡單來說就是把原先的第二步拆解成3個步驟势腮。后面我們會著重簡要講一下后面三步的過程：

這里不會闡述太多關于語法解析如何工作的內(nèi)容，除了必須的有助于我們了解編譯過程的內(nèi)容漫仆。如果想要了解Python語法解析的細節(jié)捎拯，還是建議你去直接閱讀CPython的源碼。

語法分析樹

2.5版以后的Python的語法解釋器是基于“龍書”的LL(1) parser的一個標準的實現(xiàn)盲厌。Python的語法定義可以在CPython源碼的Grammar/Grammar （https://github.com/python/cpython/blob/master/Grammar/Grammar）看到：

# Grammar for Python# NOTE WELL: You should also follow all the steps listed at# https://docs.python.org/devguide/grammar.html# Start symbols for the grammar:#       single_input is a single interactive statement;#       file_input is a module or sequence of commands read from an input file;#       eval_input is the input for the eval() functions.# NB: compound_stmt in single_input is followed by extra NEWLINE!single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
file_input: (NEWLINE | stmt)* ENDMARKER
eval_input: testlist NEWLINE* ENDMARKER

decorator: '@' dotted_name [ '(' [arglist] ')' ] NEWLINE
decorators: decorator+
decorated: decorators (classdef | funcdef | async_funcdef)async_funcdef: ASYNC funcdef
funcdef: 'def' NAME parameters ['->' test] ':' suite

parameters: '(' [typedargslist] ')'typedargslist: (tfpdef ['=' test] (',' tfpdef ['=' test])* [',' [
        '*' [tfpdef] (',' tfpdef ['=' test])* [',' ['**' tfpdef [',']]]
      | '**' tfpdef [',']]]
  | '*' [tfpdef] (',' tfpdef ['=' test])* [',' ['**' tfpdef [',']]]
  | '**' tfpdef [','])tfpdef: NAME [':' test]varargslist: (vfpdef ['=' test] (',' vfpdef ['=' test])* [',' [
        '*' [vfpdef] (',' vfpdef ['=' test])* [',' ['**' vfpdef [',']]]
      | '**' vfpdef [',']]]
  | '*' [vfpdef] (',' vfpdef ['=' test])* [',' ['**' vfpdef [',']]]
  | '**' vfpdef [','])vfpdef: NAME

stmt: simple_stmt | compound_stmt
simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
small_stmt: (expr_stmt | del_stmt | pass_stmt | flow_stmt |
             import_stmt | global_stmt | nonlocal_stmt | assert_stmt)expr_stmt: testlist_star_expr (annassign | augassign (yield_expr|testlist) |
                     ('=' (yield_expr|testlist_star_expr))*)annassign: ':' test ['=' test]testlist_star_expr: (test|star_expr) (',' (test|star_expr))* [',']augassign: ('+=' | '-=' | '*=' | '@=' | '/=' | '%=' | '&=' | '|=' | '^=' |
            '<<=' | '>>=' | '**=' | '//=')# For normal and annotated assignments, additional restrictions enforced by the interpreterdel_stmt: 'del' exprlist
pass_stmt: 'pass'flow_stmt: break_stmt | continue_stmt | return_stmt | raise_stmt | yield_stmt
break_stmt: 'break'continue_stmt: 'continue'return_stmt: 'return' [testlist]yield_stmt: yield_expr
raise_stmt: 'raise' [test ['from' test]]import_stmt: import_name | import_from
import_name: 'import' dotted_as_names# note below: the ('.' | '...') is necessary because '...' is tokenized as ELLIPSISimport_from: ('from' (('.' | '...')* dotted_name | ('.' | '...')+)
              'import' ('*' | '(' import_as_names ')' | import_as_names))import_as_name: NAME ['as' NAME]dotted_as_name: dotted_name ['as' NAME]import_as_names: import_as_name (',' import_as_name)* [',']dotted_as_names: dotted_as_name (',' dotted_as_name)*
dotted_name: NAME ('.' NAME)*
global_stmt: 'global' NAME (',' NAME)*
nonlocal_stmt: 'nonlocal' NAME (',' NAME)*
assert_stmt: 'assert' test [',' test]compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated | async_stmt
async_stmt: ASYNC (funcdef | with_stmt | for_stmt)if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]while_stmt: 'while' test ':' suite ['else' ':' suite]for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite]try_stmt: ('try' ':' suite           ((except_clause ':' suite)+                       ['else' ':' suite]
           ['finally' ':' suite] |
           'finally' ':' suite))with_stmt: 'with' with_item (',' with_item)*  ':' suite
with_item: test ['as' expr]# NB compile.c makes sure that the default except clause is lastexcept_clause: 'except' [test ['as' NAME]]suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT

test: or_test ['if' or_test 'else' test] | lambdef
test_nocond: or_test | lambdef_nocond
lambdef: 'lambda' [varargslist] ':' testlambdef_nocond: 'lambda' [varargslist] ':' test_nocond
or_test: and_test ('or' and_test)*
and_test: not_test ('and' not_test)*
not_test: 'not' not_test | comparison
comparison: expr (comp_op expr)*# <> isn't actually a valid comparison operator in Python. It's here for the# sake of a __future__ import described in PEP 401 (which really works :-)comp_op: '<'|'>'|'=='|'>='|'<='|'<>'|'!='|'in'|'not' 'in'|'is'|'is' 'not'star_expr: '*' expr
expr: xor_expr ('|' xor_expr)*
xor_expr: and_expr ('^' and_expr)*
and_expr: shift_expr ('&' shift_expr)*
shift_expr: arith_expr (('<<'|'>>') arith_expr)*
arith_expr: term (('+'|'-') term)*
term: factor (('*'|'@'|'/'|'%'|'//') factor)*
factor: ('+'|'-'|'~') factor | power
power: atom_expr ['**' factor]atom_expr: [AWAIT] atom trailer*
atom: ('(' [yield_expr|testlist_comp] ')' |
       '[' [testlist_comp] ']' |
       '{' [dictorsetmaker] '}' |
       NAME | NUMBER | STRING+ | '...' | 'None' | 'True' | 'False')testlist_comp: (test|star_expr) ( comp_for | (',' (test|star_expr))* [','] )trailer: '(' [arglist] ')' | '[' subscriptlist ']' | '.' NAME
subscriptlist: subscript (',' subscript)* [',']subscript: test | [test] ':' [test] [sliceop]sliceop: ':' [test]exprlist: (expr|star_expr) (',' (expr|star_expr))* [',']testlist: test (',' test)* [',']dictorsetmaker: ( ((test ':' test | '**' expr)
                   (comp_for | (',' (test ':' test | '**' expr))* [','])) |
                  ((test | star_expr)
                   (comp_for | (',' (test | star_expr))* [','])) )classdef: 'class' NAME ['(' [arglist] ')'] ':' suite

arglist: argument (',' argument)*  [',']# The reason that keywords are test nodes instead of NAME is that using NAME# results in an ambiguity. ast.c makes sure it's a NAME.# "test '=' test" is really "keyword '=' test", but we have no such token.# These need to be in a single rule to avoid grammar that is ambiguous# to our LL(1) parser. Even though 'test' includes '*expr' in star_expr,# we explicitly match '*' here, too, to give it proper precedence.# Illegal combinations and orderings are blocked in ast.c:# multiple (test comp_for) arguments are blocked; keyword unpackings# that precede iterable unpackings are blocked; etc.argument: ( test [comp_for] |
            test '=' test |
            '**' test |
            '*' test )comp_iter: comp_for | comp_if
comp_for: [ASYNC] 'for' exprlist 'in' or_test [comp_iter]comp_if: 'if' test_nocond [comp_iter]# not used in grammar, but may appear in "node" passed from Parser to Compilerencoding_decl: NAME

yield_expr: 'yield' [yield_arg]yield_arg: 'from' test | testlist

各種token值的定義都在Include/graminit.h（https://github.com/python/cpython/blob/master/Include/graminit.h）署照。組成語法解析樹的node * structs定義在Include/node.h（https://github.com/python/cpython/blob/master/Include/node.h）祸泪。

從node結(jié)構(gòu)體里查詢是由一組定義在 Include/token.h 的一組宏實現(xiàn)的：

CHILD(node *, int)

Returns the nth child of the node using zero-offset indexing

RCHILD(node *, int)

Returns the nth child of the node from the right side; use negative numbers!

NCH(node *)

Number of children the node has

STR(node *)

String representation of the node; e.g., will return : for a COLON token

TYPE(node *)

The type of node as specified in Include/graminit.h

REQ(node *, TYPE)

Assert that the node is the type that is expected

LINENO(node *)

retrieve the line number of the source code that led to the creation of the parse rule; defined in Python/ast.c

我們可以通過while語法的定義了解到上面所有的示例：

while_stmt: 'while' test ':' suite ['else' ':' suite]

這段定義表示這個節(jié)點將會有如下等式成立：

TYPE(node) == while_stmt；

子節(jié)點的編號將會是4或者7建芙，取決于'while'后面是否有一個'else'聲明没隘。通過下面的代碼我們可以確定后面的第一個 ':' 是否存在以及代表什么。

(REQ(CHILD(node, 2), COLON)

抽象語法樹 (AST)

抽象語法樹（AST）是程序結(jié)構(gòu)的一種高抽象層次的表達禁荸，有了它我們并再不需要源代碼的存在了右蒲。它可以認為是源代碼等價的一種抽象表達。CPython采用Zephyr抽象語法定義語言（ASDL）（http://www.cs.princeton.edu/research/techreps/TR-554-97）來描述AST赶熟。

Python的AST定義可以在源代碼的Parser/Python.asdl（https://github.com/python/cpython/blob/master/Parser/Python.asdl）找到:

-- ASDL's 7 builtin types are:-- identifier, int, string, bytes, object, singleton, constant---- singleton: None, True or False-- constant can be None, whereas None means "no value" for object.module Python{
    mod = Module(stmt* body, string? docstring)
        | Interactive(stmt* body)
        | Expression(expr body)

        -- not really an actual node but useful in Jython's typesystem.
        | Suite(stmt* body)

    stmt = FunctionDef(identifier name, arguments args,
                       stmt* body, expr* decorator_list, expr? returns,
                       string? docstring)
          | AsyncFunctionDef(identifier name, arguments args,
                             stmt* body, expr* decorator_list, expr? returns,
                             string? docstring)

          | ClassDef(identifier name,
             expr* bases,
             keyword* keywords,
             stmt* body,
             expr* decorator_list,
             string? docstring)
          | Return(expr? value)

          | Delete(expr* targets)
          | Assign(expr* targets, expr value)
          | AugAssign(expr target, operator op, expr value)
          -- 'simple' indicates that we annotate simple name without parens
          | AnnAssign(expr target, expr annotation, expr? value, int simple)

          -- use 'orelse' because else is a keyword in target languages
          | For(expr target, expr iter, stmt* body, stmt* orelse)
          | AsyncFor(expr target, expr iter, stmt* body, stmt* orelse)
          | While(expr test, stmt* body, stmt* orelse)
          | If(expr test, stmt* body, stmt* orelse)
          | With(withitem* items, stmt* body)
          | AsyncWith(withitem* items, stmt* body)

          | Raise(expr? exc, expr? cause)
          | Try(stmt* body, excepthandler* handlers, stmt* orelse, stmt* finalbody)
          | Assert(expr test, expr? msg)

          | Import(alias* names)
          | ImportFrom(identifier? module, alias* names, int? level)

          | Global(identifier* names)
          | Nonlocal(identifier* names)
          | Expr(expr value)
          | Pass | Break | Continue

          -- XXX Jython will be different
          -- col_offset is the byte offset in the utf8 string the parser uses
          attributes (int lineno, int col_offset)

          -- BoolOp() can use left & right?
    expr = BoolOp(boolop op, expr* values)
         | BinOp(expr left, operator op, expr right)
         | UnaryOp(unaryop op, expr operand)
         | Lambda(arguments args, expr body)
         | IfExp(expr test, expr body, expr orelse)
         | Dict(expr* keys, expr* values)
         | Set(expr* elts)
         | ListComp(expr elt, comprehension* generators)
         | SetComp(expr elt, comprehension* generators)
         | DictComp(expr key, expr value, comprehension* generators)
         | GeneratorExp(expr elt, comprehension* generators)
         -- the grammar constrains where yield expressions can occur
         | Await(expr value)
         | Yield(expr? value)
         | YieldFrom(expr value)
         -- need sequences for compare to distinguish between
         -- x < 4 < 3 and (x < 4) < 3
         | Compare(expr left, cmpop* ops, expr* comparators)
         | Call(expr func, expr* args, keyword* keywords)
         | Num(object n) -- a number as a PyObject.
         | Str(string s) -- need to specify raw, unicode, etc?
         | FormattedValue(expr value, int? conversion, expr? format_spec)
         | JoinedStr(expr* values)
         | Bytes(bytes s)
         | NameConstant(singleton value)
         | Ellipsis
         | Constant(constant value)

         -- the following expression can appear in assignment context
         | Attribute(expr value, identifier attr, expr_context ctx)
         | Subscript(expr value, slice slice, expr_context ctx)
         | Starred(expr value, expr_context ctx)
         | Name(identifier id, expr_context ctx)
         | List(expr* elts, expr_context ctx)
         | Tuple(expr* elts, expr_context ctx)

          -- col_offset is the byte offset in the utf8 string the parser uses
          attributes (int lineno, int col_offset)

    expr_context = Load | Store | Del | AugLoad | AugStore | Param

    slice = Slice(expr? lower, expr? upper, expr? step)
          | ExtSlice(slice* dims)
          | Index(expr value)

    boolop = And | Or

    operator = Add | Sub | Mult | MatMult | Div | Mod | Pow | LShift
                 | RShift | BitOr | BitXor | BitAnd | FloorDiv

    unaryop = Invert | Not | UAdd | USub

    cmpop = Eq | NotEq | Lt | LtE | Gt | GtE | Is | IsNot | In | NotIn

    comprehension = (expr target, expr iter, expr* ifs, int is_async)

    excepthandler = ExceptHandler(expr? type, identifier? name, stmt* body)
                    attributes (int lineno, int col_offset)

    arguments = (arg* args, arg? vararg, arg* kwonlyargs, expr* kw_defaults,
                 arg? kwarg, expr* defaults)

    arg = (identifier arg, expr? annotation)
           attributes (int lineno, int col_offset)

    -- keyword arguments supplied to call (NULL identifier for **kwargs)
    keyword = (identifier? arg, expr value)

    -- import name with optional 'as' alias.
    alias = (identifier name, identifier? asname)

    withitem = (expr context_expr, expr? optional_vars)}

每種AST節(jié)點（聲明瑰妄，表達式，一些特殊類型例如：列表推導式映砖、異常處理handler）都被ASDL描述定義间坐。大多數(shù)的AST定義都對應一種特定的源碼結(jié)構(gòu)，例如一個'if'定義或者一個屬性查找邑退。這些定義都是和具體實現(xiàn)Python的語言無關的竹宋，也就是說無論CPython、PyPy瓜饥、Jyphon甚至IronPython逝撬，都是用的同樣的ASDL。

下面的Python ASDL結(jié)構(gòu)展示了Python的函數(shù)定義相關語法結(jié)構(gòu)和實現(xiàn)：

module Python{
      stmt = FunctionDef(identifier name, arguments args, stmt* body,
                          expr* decorators)
            | Return(expr? value) | Yield(expr value)
            attributes (int lineno)}

上面的示例描述了3種不同的語法結(jié)構(gòu)：def函數(shù)定義乓土、return聲明宪潮、yield聲明。這三種語法結(jié)構(gòu)的語法聲明被用 '|' 分割趣苏。它們接受的參數(shù)類型和個數(shù)各不相同狡相。

每個參數(shù)定義前面還帶了不同的修飾符（這點和正則很像）來說明需要多少個參數(shù)：

'?'說明這個參數(shù)是可有可無的（0~1個）；
'*'說明這個參數(shù)的數(shù)量是0或者更多（≥ 0）食磕；
沒有修飾符表示這個參數(shù)是必須且只能有一個的尽棕。

例如，函數(shù)定義（def）接受一個 identifier 作為函數(shù)名彬伦，'arguments'作為參數(shù)滔悉，0或者多個stmt作為函數(shù)體，0或者多個expr作為裝飾器单绑。

請注意這里的'arguments'回官，不像大家可能會認為的和stmt一樣，它是作為一個node類型會由一個單獨的AST定義所表示搂橙。

所有的這三種語法結(jié)構(gòu)類型都有一個 'attributes' 參數(shù)歉提，所以 'attributes' 前面沒有 '|' 。

上面的ASDL會被翻譯生成如下的C語言結(jié)構(gòu)體類型：

typedef struct _stmt *stmt_ty;struct _stmt {
      enum { FunctionDef_kind=1, Return_kind=2, Yield_kind=3 } kind;
      union {
              struct {
                      identifier name;
                      arguments_ty args;
                      asdl_seq *body;
              } FunctionDef;

              struct {
                      expr_ty value;
              } Return;

              struct {
                      expr_ty value;
              } Yield;
      } v;
      int lineno;
 }

一同被生成的還有一些構(gòu)造函數(shù)，它們會給這些語法結(jié)構(gòu)分配一個stmt_ty結(jié)構(gòu)體苔巨，并附帶一些必要的初始化版扩。'kind' enum字段表明下面聲明的是哪種union。FunctionDef() 的構(gòu)造函數(shù)將會吧'kind'設置成 FunctionDef_kind 并且初始化 'name', 'args', 'body', 和 'attributes' 字段侄泽。

未完待續(xù)

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末礁芦，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子蔬顾，更是在濱河造成了極大的恐慌宴偿，老刑警劉巖，帶你破解...
沈念sama閱讀 216,402評論 6贊 499
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件诀豁，死亡現(xiàn)場離奇詭異窄刘，居然都是意外死亡，警方通過查閱死者的電腦和手機舷胜，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,377評論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門娩践，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人烹骨，你說我怎么就攤上這事翻伺。” “怎么了沮焕？”我有些...
開封第一講書人閱讀 162,483評論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵吨岭，是天一觀的道長。經(jīng)常有香客問我峦树，道長辣辫，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 58,165評論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任魁巩，我火速辦了婚禮急灭，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘谷遂。我一直安慰自己葬馋，他們只是感情好，可當我...
茶點故事閱讀 67,176評論 6贊 388
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布肾扰。她就那樣靜靜地躺著畴嘶，像睡著了一般。火紅的嫁衣襯著肌膚如雪集晚。梳的紋絲不亂的頭發(fā)上掠廓，一...
開封第一講書人閱讀 51,146評論 1贊 297
城市分裂傳說
那天，我揣著相機與錄音甩恼，去河邊找鬼。笑死，一個胖子當著我的面吹牛条摸，可吹牛的內(nèi)容都是我干的悦污。我是一名探鬼主播，決...
沈念sama閱讀 40,032評論 3贊 417
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼钉蒲，長吁一口氣：“原來是場噩夢啊……” “哼切端！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起顷啼，我...
開封第一講書人閱讀 38,896評論 0贊 274
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤踏枣，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后钙蒙，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體茵瀑，經(jīng)...
沈念sama閱讀 45,311評論 1贊 310
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 37,536評論 2贊 332
?白月光啟示錄
正文我和宋清朗相戀三年躬厌，在試婚紗的時候發(fā)現(xiàn)自己被綠了马昨。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 39,696評論 1贊 348
活死人
序言：一個原本活蹦亂跳的男人離奇死亡扛施，死狀恐怖鸿捧，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情疙渣，我是刑警寧澤匙奴，帶...
沈念sama閱讀 35,413評論 5贊 343
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站妄荔，受9級特大地震影響泼菌，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜懦冰，卻給世界環(huán)境...
茶點故事閱讀 41,008評論 3贊 325
男人毒藥：我在死后第九天來索命
文/蒙蒙一灶轰、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧刷钢，春花似錦笋颤、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,659評論 0贊 22
一樁弒父案伴澄，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至阱缓，卻和暖如春非凌，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背荆针。一陣腳步聲響...
開封第一講書人閱讀 32,815評論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工敞嗡，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留颁糟，地道東北人。一個月前我還...
沈念sama閱讀 47,698評論 2贊 368
代替公主和親
正文我出身青樓喉悴，卻偏偏與公主長得像棱貌，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子箕肃，可洞房花燭夜當晚...
茶點故事閱讀 44,592評論 2贊 353

淺析Python解釋器的設計（一）

一些鋪墊（扯淡）

語法分析樹

抽象語法樹**** (AST)

推薦閱讀更多精彩內(nèi)容

抽象語法樹 (AST)