Compiler | 3 Lexical Analyzer

This note is partially copied from Dragon Book - Compilers Principals Techniques and Tools (2nd Edition) Chapter 3 Lexical Analyzer.

3.1 The Role of the Lexical Analyzer

At the first phase of a compiler, the main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in the source program. The stream of tokens is sent to the parser for syntax analysis.

Strip out comments and whitespace (blank, newline, tab, and perhaps other characters that are used to separate tokens in the input).
Correlate error messages generated by the compiler with the source program.
The expansion of macros.

Sometimes, lexical analyzers are divided into a cascade of two processes:

Scanning consists of the simple processes that do not require tokenization of the input, such as deletion of comments and compaction of consecutive whitespace characters into one.
Lexical analysis proper is the more complex portion, where the scanner produces the sequence of tokens as output.

3.1.2 Tokens, Patterns, and Lexemes

When discussing lexical analysis, we use three related but distinct terms:

A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or a sequence of input characters denoting an identifier. The token names are the input symbols that the parser processes.
A pattern is a description of the form that the lexemes of a token may take. In the case of a keyword as a token, the pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the pattern is a more complex structure that is matched by many strings.
A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.

In many programming languages, the following classes cover most or all of the tokens:

One token for each keyword. The pattern for a keyword is the same as the keyword itself.
Tokens for the operators, either individually or in classes such as the token comparison (< or > or <= or >= or == or !=).
One token representing all identifiers.
One or more tokens representing constants, such as numbers and literal strings.
Tokens for each punctuation symbols, such as left and right parentheses, comma, and semicolon.

3.2 Input Buffering

3.2.1 Buffer Pairs

Because of the amount of time taken to process characters and the large number of characters that must be processed during the compilation of a large source program, specialized buffering techniques have been developed to reduce the amount of overhead required to process a single input character. An important scheme involves two buffers that are alternatively reloaded, as suggested in Fig 3.3.

Each buffer is of the same size N, and N is usually the size of a disk block, e.g. 4096 bytes.

Two pointers to the input are maintained:

Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are attempting to determine.
Pointer forward scans ahead until a pattern match is found.

Advancing forward requires that we first test whether we have readched the end of one of the buffers, and if so, we must reload the other buffer from the input, and move forward to the beginning of the newly loaded buffer.

3.2.2 Sentinels

The sentinel is a special character that cannot be part of the source program, and a natural choice is the character eof.

3.3 Specification of Tokens

regular expressions

3.3.1 Strings and Languages

An alphabet is any finite set of symbols. Typical examples of symbols are letters, digits, and punctuation. The set {0, 1} is the binary alphabet. ASCII is an important example of an alphabet; it is used in many software systems. Unicode, which includes approximately 100,000 characters from alphabets around the world, is another example of an alphabet.

A string over an alphabet is a finite sequence of symbols drawn from that alphabet. In language theories, the terms "sentence" and "word" are often used as synonyms for "string". The length of a string s, usually written |s|, is the number of occurrences of symbols in s. The empty string, denoted ε, is the string of length zero.

A language is any countable set of strings over some fixed alphabet. Note that the definition of "language" does not require that any meaning be ascribed to the strings in the language.

3.3.2 Operations on Languages

3.3.3 Regular Expressions

The regular expressions are built recursively out of smaller regular expressions, using the rules described below. Each regular expression r denotes a language L(r), which is also defined recursively from the languages denoted by r's subexpressions. Here are the rules that define the regular expressions over some alphabet ∑ and the languages that those expressions denote.

BASIS: There are two rules that form the basis:

ε is a regular expression, and L(ε) is {ε}, that is, the language whose sole member is the empty string.
If a is a symbol in ∑, then a is a regular expression, and L(a) = {a}, that is, the language with one string, of length one, with a in its one position.

INDUCTION: There are four parts to the induction whereby larger regular expressions are built from smaller ones. Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively.

(r) | (s) is a regular expression denoting the language L(r) ∪ L(s).
(r) (s) is a regular expression denoting the language L(r)L(s).
(r)* is a regular expression denoting (L(r))*.
(r) is a regular expression denoting L(r). This last rule says that we can add additional pairs of parentheses around expressions without changing the language they denote.

As defined, regular expressions often contain unnecessary pairs of parentheses. We may drop certain pairs of parentheses if we adopt the conventions that:

a) The unary operator * has highest precedence and is left associative.
b) Concatenation has second highest precedence and is left associative.
c) | has lowest precedence and is left associative.

A language that can be defined by a regular expression is called a regular set.

3.3.4 Regular Definitions

3.3.5 Extensions of Regular Expressions

One or more instances. If r us a regular expression, then (r)⁺ denotes the language (L(r))⁺.
Zero or one instance. r? is equivalent to r|ε, or put in another way, L(r?) = L(r) ∪ {ε}.
Character classes. A regular expression a₁|a₂|···|a_n, where the a_i's are each symbols of the alphabet, can be replaced by the shorthand [a₁a₂···a_n]. More importantly, when a₁|a₂|···|a_n form a logical sequence, e.g., consecutive uppercase letters, lowercase letters, or digits, we can replace them by a₁-a_n, that is, just the first and last separated by a hyphen. Thus, [abc] is shorthand for a|b|c, and [a-z] is shorthand for a|b|···|z.

3.4 Recognition of Tokens

3.4.1 Transition Diagrams

Transition diagrams have a collection of nodes or circles, called states. Each state represents a condition that could occur during the process of scanning the input looking for a lexeme that matches one if several patterns. We may think of a state as summarizing all we need to know about what characters we have seen between the lexemeBegin pointer and the forward pointer (as in the situation of Fig 3.3).

Edges are directed from one state of the transition diagram to another. Each edge is labeled by a symbol or set of symbols. If we are in some state s, and the next input symbol is a, we look for an edge out of state s labeled by a. If we find such an edge, we advance the forward pointer and enter the state of the transition diagram to which that edge leads. We shall assume that all out transition diagrams are deterministic, meaning that there is never more than one edge out of a given state with a given symbol among its labels.

Some conventions about transition diagrams:

Certain states are said to be accepting, or final. These states indicate that a lexeme has been found, although the actual lexeme may not consist of all positions between the lexemeBegin and forward pointers. We always indicate an accepting state by a double circle, and if there is an action to be taken —— typically returning a token and an attribute value to the parse —— we shall attach that action to the accepting state.
In addition, if it is necessary to retract the forward pointer one position (i.e., the lexeme does not include the symbol that got us to the accepting state), then we shall additionally place a * near that accepting state. (If necessary, we can attach any number of *'s to the accepting state.)
One state is designated the start state, or initial state; it is indicated by an edge, labeled "start", entering from nowhere. The transition diagram always begins in the start state before any input symbols have been read.

3.4.2 Recognition of Reserved Words and Identifiers

There are two ways that we can handle reserved words that look like identifiers:

Install the reserved words in the symbol table initially. A field of the symbol-table entry indicates that these strings are never ordinary identifiers, and tells which token they represent.
Create separate transition diagrams for each keyword.

3.4.4 Architecture of a Transition-Diagram-Based Lexical Analyzer

A variable state is holding the number of the current state for a transition diagram. A switch based on the value of state takes us to code for each of the possible states, where we find the action of that state. Often, the code for a state is itself a switch statement or multiway branch that determines the next state by reading and examining the next input character.

To place the simulation of one transition diagram in perspective, let us consider the ways code like Fig 3.18 could fit into the entire lexical analyzer.

We could arrange for the transition diagrams for each token to be tried sequentially.
We could run the various transition diagrams "in parallel", feeding the next input character to all of them and allowing each one to make whatever transitions it required. Note that a normal rule is to take the longest prefix of the input that matches any pattern.
The preferred approach is to combine all the transition diagrams into one. We allow the transition diagram to read input until there is no possible next state, and then take the longest lexeme that matched any pattern.

3.5 The Lexical-Analyzer Generator Lex

3.6 Finite Automata

Finite automata are essentially graphs like transition diagrams, with a few differences:

Finite automata are recognizers; they simply say "yes" or "no" about each possible input string.
Finite automata come in two flavors:
(a) Nondeterministic finite automata (NFA) have no restrictions on the labels of their edges. A symbol can label several edges out of the same state, and ε, the empty string, is a possible label.
(b) Deterministic finite automata (DFA) have, for each state, and for each symbol of its input alphabet exactly one edge with that symbol leaving that state.

Both deterministic and nondeterministic finite automata are capable of recognizing the same languages. In fact these languages are exactly the same languages, called the regular languages, that regular expressions can describe.

3.6.1 Nondeterministic Finite Automata

A nondeterministic finite automata (NFA) consists of:

A finite set of states S.
A set of input symbols ∑, the input alphabet. We assume that ε, which stands for the empty string, is never a member of ∑.
A transition function that gives, for each state, and for each symbol in ∑∪{ε} a set of next states.
A state s₀ from S that is distinguished as the start state (or initial state).
A set of states F, a subset of S, that is distinguished as the accepting states (or final states).

We can represent either an NFA or DFA by a transition graph, where the nodes are states and the labeled edges represent the transition function. There is an edge labeled a from state s to state t if and only if t is one of the next states for state s and input a. This graph is very much like a transition diagram, except:

a) The same symbol can label edges from one state to several different states and

b) An edge may be labeled by ε, the empty string, instead of, or in addition to, symbols from the input alphabet.

3.6.2 Transition Tables

We can also represent an NFA by a transition table, whose rows correspond to states, and whose columns correspond to the input symbols and ε. The entry for a given state and input is the value of the transition function applied to those arguments. If the transition function has no in formation about that state-input pair, we put ? in the table for the pair.

3.6.3 Acceptance of Input Strings by Automata

An NFA accepts input string x if and only if there is some path in the transition graph from the start state to one of the accepting states, such that the symbols along the path spell out x.

Remember that an NFA accepts a string as long as some path labeled by that string leads from the start to accepting state. The existance of other paths leading to a nonaccepting state is irrelevant.

The language defined (or accepted) by an NFA is the set of strings labeled some path from the start to an accepting state. We may use L(A) to stand for the language accepted by automaton A.

3.6.4 Deterministic Finite Automata

A deterministic finite automaton (DFA) is a special case of an NFA where:

There are no movers on input ε, and
For each state s and input symbol a, there is exactly one edge out of s labeled a.

While NFA is an abstract representation of an algorithm to recognize the strings of a certain language, the DFA is a simple, concrete algorithm for recognizing strings. It is fortunate indeed that every regular expression and every NFA can be converted to a DFA accepting the same language, because it is the DFA that we really implement or simulate when building lexical analyzers.

3.7 From Regular Expressions to Automata

3.7.1 Conversion of an NFA to a DFA

The general idea behind the subset construction is that each state of the constructed DFA corresponds to a set of NFA states. After reading the input a₁a₂···a_n, the DFA is in that state which corresponds to the set of states that the NFA can reach, from its start state, following paths labeled a₁a₂···a_n.

It is possible that the number of DFA states is exponential in the number of NFA states, which could lead to difficulties when we try to implement this DFA. However, part of the power of the automaton-based approach to lexical analysis is that for real languages, the NFA and DFA have approximately the same number of states, and the exponential behavior is not seen.

(It is rare in my notes to put an example, but the following one is an exception for that this example provides a concrete supply for the above algorithm.)

3.7.2 Simulation of an NFA

3.7.3 Efficiency of NFA Simulation

O (k(m+n)), in which k is the length of the input, m+n is the size of the transition graph.

3.7.4 Construction of an NFA from a Regular Expression

最后編輯于：2018.08.28 18:06:05

?著作權歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市安皱，隨后出現(xiàn)的幾起案子容握，更是在濱河造成了極大的恐慌疯溺，老刑警劉巖榛了，帶你破解...
沈念sama閱讀 219,188評論 6贊 508
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異卤唉，居然都是意外死亡计盒，警方通過查閱死者的電腦和手機，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,464評論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門藏杖，熙熙樓的掌柜王于貴愁眉苦臉地迎上來将塑，“玉大人，你說我怎么就攤上這事制市√” “怎么了？”我有些...
開封第一講書人閱讀 165,562評論 0贊 356
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵祥楣，是天一觀的道長开财。經(jīng)常有香客問我，道長误褪，這世上最難降的妖魔是什么责鳍？我笑而不...
開封第一講書人閱讀 58,893評論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮兽间，結果婚禮上历葛，老公的妹妹穿的比我還像新娘。我一直安慰自己嘀略，他們只是感情好恤溶，可當我...
茶點故事閱讀 67,917評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著帜羊，像睡著了一般咒程。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上讼育，一...
開封第一講書人閱讀 51,708評論 1贊 305
城市分裂傳說
那天帐姻，我揣著相機與錄音，去河邊找鬼奶段。笑死饥瓷，一個胖子當著我的面吹牛，可吹牛的內(nèi)容都是我干的痹籍。我是一名探鬼主播呢铆，決...
沈念sama閱讀 40,430評論 3贊 420
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼蹲缠！你這毒婦竟也來了棺克？” 一聲冷哼從身側響起鳖宾，我...
開封第一講書人閱讀 39,342評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎逆航，沒想到半個月后，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體渔肩，經(jīng)...
沈念sama閱讀 45,801評論 1贊 317
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡因俐，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 37,976評論 3贊 337
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發(fā)現(xiàn)自己被綠了周偎。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片抹剩。...
茶點故事閱讀 40,115評論 1贊 351
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖蓉坎，靈堂內(nèi)的尸體忽然破棺而出澳眷，到底是詐尸還是另有隱情，我是刑警寧澤蛉艾，帶...
沈念sama閱讀 35,804評論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布钳踊，位于F島的核電站，受9級特大地震影響勿侯，放射性物質(zhì)發(fā)生泄漏拓瞪。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點故事閱讀 41,458評論 3贊 331
男人毒藥：我在死后第九天來索命
文/蒙蒙一助琐、第九天我趴在偏房一處隱蔽的房頂上張望祭埂。院中可真熱鬧，春花似錦兵钮、人聲如沸蛆橡。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,008評論 0贊 22
一樁弒父案掘譬，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽泰演。三九已至，卻和暖如春屁药，著一層夾襖步出監(jiān)牢的瞬間粥血，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 33,135評論 1贊 272
情欲美人皮
我被黑心中介騙來泰國打工酿箭，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留复亏，地道東北人。一個月前我還...
沈念sama閱讀 48,365評論 3贊 373
代替公主和親
正文我出身青樓缭嫡，卻偏偏與公主長得像缔御，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子妇蛀，可洞房花燭夜當晚...
茶點故事閱讀 45,055評論 2贊 355