Tasks of lexer. CISC 5920: Compiler Construction Chapter 2 Lexical Analysis. Tokens and lexemes. Buffering

Tasks of lexer CISC 5920: Compiler Construction Chapter 2 Lexical Analysis Arthur G. Werschulz Fordham University Department of Computer and Information Sciences Copyright Arthur G. Werschulz, 2017. All rights reserved. Spring, 2017 Scan source code, convert into tokens, e.g.: Example: if (big < x[i]) big = x[i]; becomes KEYWORD LPAREN ID LT ID LBRACK ID BRACK RPAREN ID GETS ID LBRACK ID RBRACK SEMI Remove comments Case conversion (where applicable) Remove white space (fortran special case) Interpret compiler directives Communicate with symbol table Prepare output listing 1 / 51 2 / 51 Tokens and lexemes Buffering Go back to if (big < x[i]) big = x[i]; For parsing purposes: All identifiers are the same. All relational operators are the same. But eventually must distinguish. Distinguish between tokens and lexemes: Token id relop ( ) if then = [ ] Lexeme id < ( ) if then = [ ] x i Lexer may need to back up when scanning input data How? Read text chunks into a buffer array, divided in half: if (big < x[i ] big = After first half has been processed (wraps around) x[i] ; ] big = Only refresh one half if you re really done with it for may be beginning of fork. typically, by seeing a separator (whitespace, punctuation) Need two indices for the buffer: beginning of lexeme, current char 3 / 51 4 / 51

Finite-state automata (FSAs) State diagrams and state tables for FSAs Can use regular expressions to define lexemes, for instance: identifier: l(l d) integer constant: (ɛ + )dd floating-point number: (ɛ + )(0 d ).(0 d Finite-state automaton: system with finite set of states, rules for state transition upon inputs Game plan: RE-based lexeme description FSA to recognize lexemes code for recognizing lexemes Example: FSA for 10 candy machine Inputs: s (select),n (nickel), d (dime), q (quarter) States: 0 (no money), 1 (5 ), 2 (10 ), 3 (overpayment) Can draw a digraph (nodes are states, edges are transitions) Can represent via a state table Current Inputs state n d q s 0 1 2 3 0 1 2 3 3 1 0 : give candy and change 2 3 3 3 0 3 3 3 3 0 How about an FSA for an identifier? 5 / 51 6 / 51 Formal definition of FSA Formal definition of FSA (cont d) Example: Σ = {a, b}, Q = {1, 2, 3}, q 0 = 1, F = {3} A (deterministic) finite state machine M is given by a quintuple M = (Σ, Q, q 0, F, N), where Σ is a finite set (alphabet) of input symbols Q is a finite set of states q 0 Q: start state F Q: set of accepting or final states N : Q Σ Q is the state-transition function Interpretation:: N(q i, x) = q j means if M is in state q i and the current input is x, the next state is q j. Can represent N by the state table (one row/state). N a b 1 2 3 2 1 2 3 1 3 Example: Σ = {a, b}, Q = {1, 2, 3, 4}, q 0 = 1, F = {3} N a b 1 2 3 2 1 2 3 1 3 4 1 2 State 4 is unreachable (can remove without loss of generality) 7 / 51 8 / 51

Acceptance Use FSA for recognizing tokens w = x 0 x 1... x n 1 Σ is accepted by M if we have with q n F. Example (previous FSA): Is abab accepted? Is ababa accepted? q i+1 = N(q i, x i ) (0 i n 1) L(M) = { w Σ : w is accepted by M }... language accepted by M Languages L i and L j are equivalent if L(M i ) = L(M j ). Example: If one FSA per token, then L(M) = { lexemes corresponding to M s token } Acceptance (cont d) Coding an FSA? Suppose that char table[n_states][n_symbols]; and that int char_to_column(char ch); gives the column number for character ch. Then... state = 0; for (int i = 0; i < w.size(); i++) state = table(state, char_to_column(ch); computes the state that M reaches for the input string w. 9 / 51 10 / 51 Non-deterministic finite-state automata NFAs (cont d) A nondeterministic finite state machine M is given by a quintuple M = (Σ, Q, q 0, F, N), where Σ is a finite set (alphabet) of input symbols Q is a finite set of states q0 Q: start state F Q: set of accepting or final states N : Q (Σ {ɛ}) P(Q) is the nondeterministic state-transition function How does NFA differ from DFA? State can change without reading a character. Transition can be to a set of states (i.e., more than one state) Why? It s easier to build an NFA to recognize REs than a DFA. It s straightforward to convert NFA into DFA. Example: NFA with state table a b 1 {1,2} {3} 2 {1} {2,3} 3 {1,2} {3} Does it accept aab? Only need one good path to accept, but all paths must be bad to reject. 11 / 51 12 / 51

NFAs: ɛ-transitions NSAs: Equivalence Example: Let M have state table a b ɛ 1 {1,2} {3} {2,3} 2 {1} {2,3} {1} 3 {1,2} {3} {} M goes to accept-state 2 if no input! Only need one good path to accept, but all paths must be bad to reject. Transitions between states: unpredictable Transitions between sets of states: predictable For any NFA M, we can construct a DFA M for which L(M) = L(M ). The M -states correspond to sets of M-states DFAs and NFAs accept the same languages. Use? In our proposed workflow Tokens REs NFA DFA Since we re at the end o the chain, we continue working backwards through the chain... 13 / 51 14 / 51 The subset construction Ken Thompson, AT&T Bell Labs To start with, suppose that there are no ɛ-transitions. Given: the NFA M = (Σ, Q, q 0, F, N) Our DFA M = (Σ, Q, q 0, F, N ), where Q = P(Q) q 0 = [q 0 ] (use brackets, not braces, for subsets in M. [qi1,..., q in ] F iff q ij F for some index j If N({q i1,..., q in }, x) = {q k1,..., q km }), then N ([q i1,..., q in ], x) = [q k1,..., q km ]). The subset construction (cont d) a b 1 {1, 2} {3} 2 {1} {2, 3} 3 {1, 2} {3} Disadvantages: Q = 2 Q unreachable states a b [1] [1,2] [3] [2] [1] [2,3] [3] [1,2] [3] [1,2] [1,2] [2,3] [1,3] [1,2] [2,3] [2,3] [1,2] [2,3] [1,2,3] [1,2] [2,3] [] [] [] ( : unreachable state) Can we do better? Yes! Do one state at a time! 15 / 51 16 / 51

The subset construction (cont d) The subset construction (cont d) Algorithm: create start state [q 0 ] Q = {[q 0 ]} while ( uncompleted row r in table for M ) do x = [s 1,..., s k ] = state for row r for a Σ do T = N({s 1,..., s k }, a) y = [T ] if y Q then Q = Q {y} add rule N (x, a) = y to M -transition rules identify accepting states in M a b ɛ 1 {1, 2} {3} {3} 2 {1} {2, 3} 3 {1, 2} {3} becomes a b [1] [1,2] [3] [1,2] [1,2] [2,3] [3] [1,2] [3] [2,3] [1,2] [2,3] 17 / 51 18 / 51 The subset construction (cont d) The subset construction (cont d) What about handling ɛ-transitions? For [q 0 ] and for each new M -state, also include the ɛ-closure of the state, i.e., the set of all sets reachable from said state via ɛ-transitions. Revised algorithm: Q = [q 0 ] while ( uncompleted row r in table for M ) do x = [s 1,..., s k ]] = state for row r for a Σ do T = N({s 1,..., s k }, a) y = [T ] if y Q then Q = Q {y} add rule N (x, a) = y to M -transition rules identify accepting states in M a b ɛ 1 {1, 2} {3} {2, 3} 2 {1} {2, 3} {... } 3 {1, 2} {3} {1, 2} becomes a b [1,2,3] [1,2,3] [1,2,3] 19 / 51 20 / 51

Regular expressions Lexing, parsing: use tables, rather than customized code Building DFA by hand: difficult, error-prone Need a mechanism: How to represent a token? Regular expressions Program to turn representation into DFA? Unix lex: regexp NFA DFA Examples of regular expressions b 4 = bbbb a n : n instances of a in a row a : any concatenation of a s (the Kleene star operation) b + : any nonempty concatenation of b s ab bc: ab or bc Regular expressions (cont d) Formal definition: Let Σ be an alphabet. Then: x Σ = x is a regular expression. ɛ is a regular expression. If R is a regular expression, then R is a regular expression. If R and S are regular expressions, then RS (sometimes written R S and R S are regular expressions. Nothing else is a regular expression. Here: L(RS) = { vw : v L(R), w L(S) } L(R S) = L(R) L(S) L(R ) = { w 1... w n : n 0 and w 1,... w n L(R) } L(R + ) = { w 1... w n : n 1 and w 1,... w n L(R) } 21 / 51 22 / 51 Regular expressions (cont d) Example: Let Σ = {a, b,..., z}. a Σ = a is a regular expression. b and c are regular expressions. ab is a regular expression. (ab b) is a regular expression. (ab b) is a regular expression. Regular expressions (cont d) Regular expressions R and S are equivalent if L(R) = L(S). L(R) = L(S) L(R) L(S) L(S) L(R). Useful equivalences: R(ST ) = (RS)T R (S T ) = (R S)T R S = S R R(S T ) = RS RT R + = RR R = ɛ R + Rɛ = ɛr = R but RS SR! If L is a a language such that L(E) = L for a regular expression E, then L is a regular language. Not all languages are regular! However, tokens are regular. 23 / 51 24 / 51

Regular expressions and finite automata Regular expressions are defined inductively. Thompson s construction inductively builds NFA to recognize regular expressions: recognizes ɛ recognizes a Σ recognizes R S recognizes R S recognizes R + recognizes R Pumping lemma Theorem Let M = (Σ, Q, q 0, F, N) be a finite automaton, with n = Q. Then there exists k n such that the following holds: If w L(M) with w k, then there exist x, y, z Σ, with y ɛ, such that w = xyz and xy z L(M). Proof Let w = w 1... w n. There exist states q 0,..., q n Q such that N(q i, w i ) = q i+1 (0 i n 1). Since Q = n, there exist i, j {0,..., n} such that i < j and q i = q j. Exercise: Design an NFA to recognize a(a bc). 25 / 51 26 / 51 Pumping lemma (cont d) Pumping lemma (cont d) Proof (cont d). Let x = w 1,..., w i y = w i+1... w j z = w j+1... w n Then y ɛ and M accepts xy z. Now let S be the set of all positive k Z such that w L(M) w k = x, y, z Σ with y ɛ such that w = xyz xy z L(M). S is nonempty since n S. Hence S has a minimal element k. Example Example: Let Σ = { (, ) } and let L () = { ( k ) k : k N }. We show that L is not regular. Proof. Suppose that L () = L(M) for some FSA M. Let n = Q. Consider ( k ) k for some k n. Write ( k ) k = xyz where y ɛ and xy z L (). Regardless of whether y (, y ), or y = ( l ) m for some l, m > 0, one can find xy z that are unbalanced, contradicting our assumption that L () = L(M). Conclusion: Lexer can t check for balanced parentheses! 27 / 51 28 / 51

Application to lexical analysis Application to lexical analysis (cont d) NFA for X = ab Let Σ = {a, b, c}. Suppose there are exactly two tokens: X = ab and Y = (a c). To build: a DFA recognizing X Y : 1. Build NFA for X. 2. Build NFA for Y. 3. Build NFA M for X Y. NFA for Y = (a c) 29 / 51 30 / 51 Application to lexical analysis (cont d) Application to lexical analysis (cont d) NFA for X Y = (ab ) (a c) We want to build DFA M corresponding to to the NFA M: Let N and N denote transition functions for M and M. T is start state for M. [T ] = [AGHIKNTU] is start state for M. Inputs State [AGHIKNTU] [BCFHIJKMNU] [ ] [HIKLMNU] [BCF [HIJKMNU]] [HIJKMNU] [CDEFU] [HIKLMNU] [ ] [ ] [ ] [ ] [HIKLMNU] [HIJKMNU] [ ] [HIKLMNU] [HIJKMNU] [HIJKMNU] [ ] [HIKLMNU] [CDEFU] [ ] [CDEFU] [ ] 31 / 51 32 / 51

Application to lexical analysis (cont d) State minimization Relabeling, we get: Inputs State 2 5 6 4 3 3 3 3 4 5 3 4 5 5 3 4 6 3 6 3 Use state minimization techniques (Appendix A) to remove equivalent states. Two states q i, q j Q are equivalent if language L and two states q m, q n F such that L : q i q m and L : q j q n. Two non-equivalent states are distinguishable. Reduce machine to one for which all state pairs are distinguishable. 1. Initially assume all state pairs distinguishable (until proven otherwise). 2. Only look at single input symbols: a Σ, N(q i, a) F N(q j, a) F = q i q j a Σ, N(q i, a) N(q j, a) = q i q j 33 / 51 34 / 51 Distinguishability matrix D: (n 1) (n 1) upper triangular bit matrix: d i,j = 1 iff q i q j Here, n = Q. Start with at least one pair of distinguishable states, say, q i F and q j F ; we set d i,j = 1 for same. Consider all unmarked entries. Suppose states are p and q. Then a Σ : N(p, a) F N(q, a) F So look at p-row, q-row in state table: (a) Rows are identical or equivalent:leave entry unmarked. (b) Rows differ by known distinguishable states: States are distinguishable. (c) Rows differ by states whose distinguishability is unknown: Don t know yet. More on case (c): Suppose we have p r t v q s u w where we know nothing about {(r, s), (t, u), (v, w)}. Distinguishability of (p, q) depends on what we later learn about these pairs. Put (p, q) on list linked to each pair. Once we find a pair to be non-equivalent, mark each pair on that list as also being non-equivalent. Algorithm terminates when all entries are checked: If row or column for q contains a zero, we ve found a state equivalent to q. The equivalence classes of states are the states of the new machine. 35 / 51 36 / 51

Go back to original problem Then Inputs State 2 5 6 4 3 3 3 3 4 5 3 4 5 5 3 4 6 3 6 3 1 0 1 0 0 0 2 1 0 0 0 2 5 6 4 Since 3 6 = 1 2, we now have 1 1 1 0 0 0 2 1 0 0 0 37 / 51 38 / 51 Since 2? 5, we have Since 2? 4, we also have 4 5 3 4 (2, 5) (1, 4) 5 4 3 4 So 3 6 = 1 6. Hence 6 3 6 3 1 1 1 0 0 1 2 1 0 0 0 (2, 5) (1, 4) (1, 5) 39 / 51 40 / 51

So 6 3 = 2 4. Hence 2 5 6 4 4 5 3 4 1 1 1 0 0 1 2 1 1 0 0 2 5 6 4 5 5 3 4 So 6 3 = 2 5. Since (2, 5) (1, 4) (1, 5), we now have 1 4 and 1 5. Hence 1 1 1 1 1 1 2 1 1 1 0 41 / 51 42 / 51 2 5 6 4 6 3 6 3 4 5 3 4 5 5 3 4 So 5 3 = 2 6, we now have So 4 5, and so we now have 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 43 / 51 44 / 51

Examine the remaining unmarked pairs (4, 6) and (5, 6) in like manner, we find that 4 6 and 5 6. So we finally have 1 1 1 1 1 1 2 1 1 1 1 4 0 1 5 1 We can delete either state 4 or state 5. Let s delete state 5, leaving us with states 1, 2, 3, 4, and 6: Inputs State 2 4 6 4 3 3 3 3 4 4 3 4 6 3 6 3 Finally, relabel state 6 as state 5, getting Inputs State 2 4 5 4 3 3 3 3 4 4 3 4 5 3 5 3 45 / 51 46 / 51 Recognizing tokens Modifications to basic FSA for lexing program source: 1. Ignore whitespace, except when it delimits a token. Use an extra state for this. 2. Whenever we reach an accepting state, announce a token. Don t enter accepting state until entire token is read. 3. Exactly one accepting state per token; state identifies token type. 4. Treat keywords as identifiers, but do a table lookup in symbol table. Other considerations: Some tokens are prefixes of others. Can t recognize identifier until past its end. May need to back up one character. Comments: may have multi-character delimiters (for instance, /*... */ or //... ). Quotation marks within quotes (\ vs. ). Stripped-down Pascal lexer Need to identify the following tokens: Identifiers, constants, labels Keywords (such as for). Simple operators (such as <). Compound operators (such as <=). Multi-character tokens whose prefix is a token. Comment syntax. Turbo Pascal allows {... } as well as (*... *). Compiler directives (analogous to #include). 47 / 51 48 / 51

Stripped-down Pascal lexer (cont d) Stripped-down Pascal lexer (cont d) 49 / 51 Coding an FSA Basic outline of pseudocode: repeat get next input char find new state table entry if (new state is final for some token) then begin isolate token pass to parser decrement cp if necessary end until no more input 51 / 51 50 / 51