Tasks of lexer. CISC 5920: Compiler Construction Chapter 2 Lexical Analysis. Tokens and lexemes. Buffering

Similar documents
Deterministic Finite Automaton (DFA)

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata

COSE312: Compilers. Lecture 2 Lexical Analysis (1)

Chapter 5. Finite Automata

Regular Languages. Problem Characterize those Languages recognized by Finite Automata.

Uses of finite automata

Closure Properties of Regular Languages. Union, Intersection, Difference, Concatenation, Kleene Closure, Reversal, Homomorphism, Inverse Homomorphism

Theory of computation: initial remarks (Chapter 11)

T (s, xa) = T (T (s, x), a). The language recognized by M, denoted L(M), is the set of strings accepted by M. That is,

CS 154, Lecture 3: DFA NFA, Regular Expressions

CISC4090: Theory of Computation

Theory of Computation (II) Yijia Chen Fudan University

September 7, Formal Definition of a Nondeterministic Finite Automaton

Regular Expression Unit 1 chapter 3. Unit 1: Chapter 3

Theory of Computation p.1/?? Theory of Computation p.2/?? Unknown: Implicitly a Boolean variable: true if a word is

Lexical Analysis. Reinhard Wilhelm, Sebastian Hack, Mooly Sagiv Saarland University, Tel Aviv University.

CS 154. Finite Automata, Nondeterminism, Regular Expressions

Properties of Regular Languages. BBM Automata Theory and Formal Languages 1

CFLs and Regular Languages. CFLs and Regular Languages. CFLs and Regular Languages. Will show that all Regular Languages are CFLs. Union.

Languages, regular languages, finite automata

UNIT II REGULAR LANGUAGES

CSE443 Compilers. Dr. Carl Alphonce 343 Davis Hall

acs-04: Regular Languages Regular Languages Andreas Karwath & Malte Helmert Informatik Theorie II (A) WS2009/10

3515ICT: Theory of Computation. Regular languages

September 11, Second Part of Regular Expressions Equivalence with Finite Aut

HKN CS/ECE 374 Midterm 1 Review. Nathan Bleier and Mahir Morshed

Inf2A: Converting from NFAs to DFAs and Closure Properties

Before we show how languages can be proven not regular, first, how would we show a language is regular?

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

Decision, Computation and Language

UNIT-II. NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: SIGNIFICANCE. Use of ε-transitions. s t a r t. ε r. e g u l a r

UNIT-III REGULAR LANGUAGES

Finite Automata and Regular languages

jflap demo Regular expressions Pumping lemma Turing Machines Sections 12.4 and 12.5 in the text

Computational Theory

CS311 Computational Structures Regular Languages and Regular Expressions. Lecture 4. Andrew P. Black Andrew Tolmach

CS 154. Finite Automata vs Regular Expressions, Non-Regular Languages

Sri vidya college of engineering and technology

Automata and Languages

Lecture 3: Nondeterministic Finite Automata

CPSC 421: Tutorial #1

TDDD65 Introduction to the Theory of Computation

Non-deterministic Finite Automata (NFAs)

Theory of computation: initial remarks (Chapter 11)

CMSC 330: Organization of Programming Languages. Pushdown Automata Parsing

Formal Languages. We ll use the English language as a running example.

NPDA, CFG equivalence

Compilers. Lexical analysis. Yannis Smaragdakis, U. Athens (original slides by Sam

Ogden s Lemma for CFLs

Lecture 17: Language Recognition

Nondeterministic Finite Automata

CS 455/555: Finite automata

Introduction to Language Theory and Compilation: Exercises. Session 2: Regular expressions

CONCATENATION AND KLEENE STAR ON DETERMINISTIC FINITE AUTOMATA

PS2 - Comments. University of Virginia - cs3102: Theory of Computation Spring 2010

CMSC 330: Organization of Programming Languages. Theory of Regular Expressions Finite Automata

Finite Automata. BİL405 - Automata Theory and Formal Languages 1

CSE 135: Introduction to Theory of Computation Nondeterministic Finite Automata (cont )

Formal Languages. We ll use the English language as a running example.

Nondeterministic finite automata

1.3 Regular Expressions

Finite Automata Part Two

Nondeterministic Finite Automata

Deterministic Finite Automata. Non deterministic finite automata. Non-Deterministic Finite Automata (NFA) Non-Deterministic Finite Automata (NFA)

CMPSCI 250: Introduction to Computation. Lecture #22: From λ-nfa s to NFA s to DFA s David Mix Barrington 22 April 2013

UNIT-VIII COMPUTABILITY THEORY

Finite Automata Part Two

Examples of Regular Expressions. Finite Automata vs. Regular Expressions. Example of Using flex. Application

Automata and Formal Languages - CM0081 Finite Automata and Regular Expressions

Automata Theory. Lecture on Discussion Course of CS120. Runzhe SJTU ACM CLASS

Johns Hopkins Math Tournament Proof Round: Automata

Lecture Notes On THEORY OF COMPUTATION MODULE -1 UNIT - 2

CSC236 Week 10. Larry Zhang

Theory of Computation (I) Yijia Chen Fudan University

Theory of Languages and Automata

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

CSCE 551 Final Exam, Spring 2004 Answer Key

Formal Languages, Automata and Models of Computation

Outline. Nondetermistic Finite Automata. Transition diagrams. A finite automaton is a 5-tuple (Q, Σ,δ,q 0,F)

Finite Automata and Formal Languages TMV026/DIT321 LP4 2012

Let us first give some intuitive idea about a state of a system and state transitions before describing finite automata.

Finite Automata. Finite Automata

CMSC 330: Organization of Programming Languages

CS 530: Theory of Computation Based on Sipser (second edition): Notes on regular languages(version 1.1)

Languages. Non deterministic finite automata with ε transitions. First there was the DFA. Finite Automata. Non-Deterministic Finite Automata (NFA)

How do regular expressions work? CMSC 330: Organization of Programming Languages

Java II Finite Automata I

Pushdown Automata. Notes on Automata and Theory of Computation. Chia-Ping Chen

Automata: a short introduction

Closure under the Regular Operations

Decidability (What, stuff is unsolvable?)

Applied Computer Science II Chapter 1 : Regular Languages

CS 121, Section 2. Week of September 16, 2013

CS 154, Lecture 2: Finite Automata, Closure Properties Nondeterminism,

Computer Sciences Department

Constructions on Finite Automata

COMP4141 Theory of Computation

Automata and Computability. Solutions to Exercises

CS243, Logic and Computation Nondeterministic finite automata

1 More finite deterministic automata

Transcription:

Tasks of lexer CISC 5920: Compiler Construction Chapter 2 Lexical Analysis Arthur G. Werschulz Fordham University Department of Computer and Information Sciences Copyright Arthur G. Werschulz, 2017. All rights reserved. Spring, 2017 Scan source code, convert into tokens, e.g.: Example: if (big < x[i]) big = x[i]; becomes KEYWORD LPAREN ID LT ID LBRACK ID BRACK RPAREN ID GETS ID LBRACK ID RBRACK SEMI Remove comments Case conversion (where applicable) Remove white space (fortran special case) Interpret compiler directives Communicate with symbol table Prepare output listing 1 / 51 2 / 51 Tokens and lexemes Buffering Go back to if (big < x[i]) big = x[i]; For parsing purposes: All identifiers are the same. All relational operators are the same. But eventually must distinguish. Distinguish between tokens and lexemes: Token id relop ( ) if then = [ ] Lexeme id < ( ) if then = [ ] x i Lexer may need to back up when scanning input data How? Read text chunks into a buffer array, divided in half: if (big < x[i ] big = After first half has been processed (wraps around) x[i] ; ] big = Only refresh one half if you re really done with it for may be beginning of fork. typically, by seeing a separator (whitespace, punctuation) Need two indices for the buffer: beginning of lexeme, current char 3 / 51 4 / 51

Finite-state automata (FSAs) State diagrams and state tables for FSAs Can use regular expressions to define lexemes, for instance: identifier: l(l d) integer constant: (ɛ + )dd floating-point number: (ɛ + )(0 d ).(0 d Finite-state automaton: system with finite set of states, rules for state transition upon inputs Game plan: RE-based lexeme description FSA to recognize lexemes code for recognizing lexemes Example: FSA for 10 candy machine Inputs: s (select),n (nickel), d (dime), q (quarter) States: 0 (no money), 1 (5 ), 2 (10 ), 3 (overpayment) Can draw a digraph (nodes are states, edges are transitions) Can represent via a state table Current Inputs state n d q s 0 1 2 3 0 1 2 3 3 1 0 : give candy and change 2 3 3 3 0 3 3 3 3 0 How about an FSA for an identifier? 5 / 51 6 / 51 Formal definition of FSA Formal definition of FSA (cont d) Example: Σ = {a, b}, Q = {1, 2, 3}, q 0 = 1, F = {3} A (deterministic) finite state machine M is given by a quintuple M = (Σ, Q, q 0, F, N), where Σ is a finite set (alphabet) of input symbols Q is a finite set of states q 0 Q: start state F Q: set of accepting or final states N : Q Σ Q is the state-transition function Interpretation:: N(q i, x) = q j means if M is in state q i and the current input is x, the next state is q j. Can represent N by the state table (one row/state). N a b 1 2 3 2 1 2 3 1 3 Example: Σ = {a, b}, Q = {1, 2, 3, 4}, q 0 = 1, F = {3} N a b 1 2 3 2 1 2 3 1 3 4 1 2 State 4 is unreachable (can remove without loss of generality) 7 / 51 8 / 51

Acceptance Use FSA for recognizing tokens w = x 0 x 1... x n 1 Σ is accepted by M if we have with q n F. Example (previous FSA): Is abab accepted? Is ababa accepted? q i+1 = N(q i, x i ) (0 i n 1) L(M) = { w Σ : w is accepted by M }... language accepted by M Languages L i and L j are equivalent if L(M i ) = L(M j ). Example: If one FSA per token, then L(M) = { lexemes corresponding to M s token } Acceptance (cont d) Coding an FSA? Suppose that char table[n_states][n_symbols]; and that int char_to_column(char ch); gives the column number for character ch. Then... state = 0; for (int i = 0; i < w.size(); i++) state = table(state, char_to_column(ch); computes the state that M reaches for the input string w. 9 / 51 10 / 51 Non-deterministic finite-state automata NFAs (cont d) A nondeterministic finite state machine M is given by a quintuple M = (Σ, Q, q 0, F, N), where Σ is a finite set (alphabet) of input symbols Q is a finite set of states q0 Q: start state F Q: set of accepting or final states N : Q (Σ {ɛ}) P(Q) is the nondeterministic state-transition function How does NFA differ from DFA? State can change without reading a character. Transition can be to a set of states (i.e., more than one state) Why? It s easier to build an NFA to recognize REs than a DFA. It s straightforward to convert NFA into DFA. Example: NFA with state table a b 1 {1,2} {3} 2 {1} {2,3} 3 {1,2} {3} Does it accept aab? Only need one good path to accept, but all paths must be bad to reject. 11 / 51 12 / 51

NFAs: ɛ-transitions NSAs: Equivalence Example: Let M have state table a b ɛ 1 {1,2} {3} {2,3} 2 {1} {2,3} {1} 3 {1,2} {3} {} M goes to accept-state 2 if no input! Only need one good path to accept, but all paths must be bad to reject. Transitions between states: unpredictable Transitions between sets of states: predictable For any NFA M, we can construct a DFA M for which L(M) = L(M ). The M -states correspond to sets of M-states DFAs and NFAs accept the same languages. Use? In our proposed workflow Tokens REs NFA DFA Since we re at the end o the chain, we continue working backwards through the chain... 13 / 51 14 / 51 The subset construction Ken Thompson, AT&T Bell Labs To start with, suppose that there are no ɛ-transitions. Given: the NFA M = (Σ, Q, q 0, F, N) Our DFA M = (Σ, Q, q 0, F, N ), where Q = P(Q) q 0 = [q 0 ] (use brackets, not braces, for subsets in M. [qi1,..., q in ] F iff q ij F for some index j If N({q i1,..., q in }, x) = {q k1,..., q km }), then N ([q i1,..., q in ], x) = [q k1,..., q km ]). The subset construction (cont d) a b 1 {1, 2} {3} 2 {1} {2, 3} 3 {1, 2} {3} Disadvantages: Q = 2 Q unreachable states a b [1] [1,2] [3] [2] [1] [2,3] [3] [1,2] [3] [1,2] [1,2] [2,3] [1,3] [1,2] [2,3] [2,3] [1,2] [2,3] [1,2,3] [1,2] [2,3] [] [] [] ( : unreachable state) Can we do better? Yes! Do one state at a time! 15 / 51 16 / 51

The subset construction (cont d) The subset construction (cont d) Algorithm: create start state [q 0 ] Q = {[q 0 ]} while ( uncompleted row r in table for M ) do x = [s 1,..., s k ] = state for row r for a Σ do T = N({s 1,..., s k }, a) y = [T ] if y Q then Q = Q {y} add rule N (x, a) = y to M -transition rules identify accepting states in M a b ɛ 1 {1, 2} {3} {3} 2 {1} {2, 3} 3 {1, 2} {3} becomes a b [1] [1,2] [3] [1,2] [1,2] [2,3] [3] [1,2] [3] [2,3] [1,2] [2,3] 17 / 51 18 / 51 The subset construction (cont d) The subset construction (cont d) What about handling ɛ-transitions? For [q 0 ] and for each new M -state, also include the ɛ-closure of the state, i.e., the set of all sets reachable from said state via ɛ-transitions. Revised algorithm: Q = [q 0 ] while ( uncompleted row r in table for M ) do x = [s 1,..., s k ]] = state for row r for a Σ do T = N({s 1,..., s k }, a) y = [T ] if y Q then Q = Q {y} add rule N (x, a) = y to M -transition rules identify accepting states in M a b ɛ 1 {1, 2} {3} {2, 3} 2 {1} {2, 3} {... } 3 {1, 2} {3} {1, 2} becomes a b [1,2,3] [1,2,3] [1,2,3] 19 / 51 20 / 51

Regular expressions Lexing, parsing: use tables, rather than customized code Building DFA by hand: difficult, error-prone Need a mechanism: How to represent a token? Regular expressions Program to turn representation into DFA? Unix lex: regexp NFA DFA Examples of regular expressions b 4 = bbbb a n : n instances of a in a row a : any concatenation of a s (the Kleene star operation) b + : any nonempty concatenation of b s ab bc: ab or bc Regular expressions (cont d) Formal definition: Let Σ be an alphabet. Then: x Σ = x is a regular expression. ɛ is a regular expression. If R is a regular expression, then R is a regular expression. If R and S are regular expressions, then RS (sometimes written R S and R S are regular expressions. Nothing else is a regular expression. Here: L(RS) = { vw : v L(R), w L(S) } L(R S) = L(R) L(S) L(R ) = { w 1... w n : n 0 and w 1,... w n L(R) } L(R + ) = { w 1... w n : n 1 and w 1,... w n L(R) } 21 / 51 22 / 51 Regular expressions (cont d) Example: Let Σ = {a, b,..., z}. a Σ = a is a regular expression. b and c are regular expressions. ab is a regular expression. (ab b) is a regular expression. (ab b) is a regular expression. Regular expressions (cont d) Regular expressions R and S are equivalent if L(R) = L(S). L(R) = L(S) L(R) L(S) L(S) L(R). Useful equivalences: R(ST ) = (RS)T R (S T ) = (R S)T R S = S R R(S T ) = RS RT R + = RR R = ɛ R + Rɛ = ɛr = R but RS SR! If L is a a language such that L(E) = L for a regular expression E, then L is a regular language. Not all languages are regular! However, tokens are regular. 23 / 51 24 / 51

Regular expressions and finite automata Regular expressions are defined inductively. Thompson s construction inductively builds NFA to recognize regular expressions: recognizes ɛ recognizes a Σ recognizes R S recognizes R S recognizes R + recognizes R Pumping lemma Theorem Let M = (Σ, Q, q 0, F, N) be a finite automaton, with n = Q. Then there exists k n such that the following holds: If w L(M) with w k, then there exist x, y, z Σ, with y ɛ, such that w = xyz and xy z L(M). Proof Let w = w 1... w n. There exist states q 0,..., q n Q such that N(q i, w i ) = q i+1 (0 i n 1). Since Q = n, there exist i, j {0,..., n} such that i < j and q i = q j. Exercise: Design an NFA to recognize a(a bc). 25 / 51 26 / 51 Pumping lemma (cont d) Pumping lemma (cont d) Proof (cont d). Let x = w 1,..., w i y = w i+1... w j z = w j+1... w n Then y ɛ and M accepts xy z. Now let S be the set of all positive k Z such that w L(M) w k = x, y, z Σ with y ɛ such that w = xyz xy z L(M). S is nonempty since n S. Hence S has a minimal element k. Example Example: Let Σ = { (, ) } and let L () = { ( k ) k : k N }. We show that L is not regular. Proof. Suppose that L () = L(M) for some FSA M. Let n = Q. Consider ( k ) k for some k n. Write ( k ) k = xyz where y ɛ and xy z L (). Regardless of whether y (, y ), or y = ( l ) m for some l, m > 0, one can find xy z that are unbalanced, contradicting our assumption that L () = L(M). Conclusion: Lexer can t check for balanced parentheses! 27 / 51 28 / 51

Application to lexical analysis Application to lexical analysis (cont d) NFA for X = ab Let Σ = {a, b, c}. Suppose there are exactly two tokens: X = ab and Y = (a c). To build: a DFA recognizing X Y : 1. Build NFA for X. 2. Build NFA for Y. 3. Build NFA M for X Y. NFA for Y = (a c) 29 / 51 30 / 51 Application to lexical analysis (cont d) Application to lexical analysis (cont d) NFA for X Y = (ab ) (a c) We want to build DFA M corresponding to to the NFA M: Let N and N denote transition functions for M and M. T is start state for M. [T ] = [AGHIKNTU] is start state for M. Inputs State [AGHIKNTU] [BCFHIJKMNU] [ ] [HIKLMNU] [BCF [HIJKMNU]] [HIJKMNU] [CDEFU] [HIKLMNU] [ ] [ ] [ ] [ ] [HIKLMNU] [HIJKMNU] [ ] [HIKLMNU] [HIJKMNU] [HIJKMNU] [ ] [HIKLMNU] [CDEFU] [ ] [CDEFU] [ ] 31 / 51 32 / 51

Application to lexical analysis (cont d) State minimization Relabeling, we get: Inputs State 2 5 6 4 3 3 3 3 4 5 3 4 5 5 3 4 6 3 6 3 Use state minimization techniques (Appendix A) to remove equivalent states. Two states q i, q j Q are equivalent if language L and two states q m, q n F such that L : q i q m and L : q j q n. Two non-equivalent states are distinguishable. Reduce machine to one for which all state pairs are distinguishable. 1. Initially assume all state pairs distinguishable (until proven otherwise). 2. Only look at single input symbols: a Σ, N(q i, a) F N(q j, a) F = q i q j a Σ, N(q i, a) N(q j, a) = q i q j 33 / 51 34 / 51 Distinguishability matrix D: (n 1) (n 1) upper triangular bit matrix: d i,j = 1 iff q i q j Here, n = Q. Start with at least one pair of distinguishable states, say, q i F and q j F ; we set d i,j = 1 for same. Consider all unmarked entries. Suppose states are p and q. Then a Σ : N(p, a) F N(q, a) F So look at p-row, q-row in state table: (a) Rows are identical or equivalent:leave entry unmarked. (b) Rows differ by known distinguishable states: States are distinguishable. (c) Rows differ by states whose distinguishability is unknown: Don t know yet. More on case (c): Suppose we have p r t v q s u w where we know nothing about {(r, s), (t, u), (v, w)}. Distinguishability of (p, q) depends on what we later learn about these pairs. Put (p, q) on list linked to each pair. Once we find a pair to be non-equivalent, mark each pair on that list as also being non-equivalent. Algorithm terminates when all entries are checked: If row or column for q contains a zero, we ve found a state equivalent to q. The equivalence classes of states are the states of the new machine. 35 / 51 36 / 51

Go back to original problem Then Inputs State 2 5 6 4 3 3 3 3 4 5 3 4 5 5 3 4 6 3 6 3 1 0 1 0 0 0 2 1 0 0 0 2 5 6 4 Since 3 6 = 1 2, we now have 1 1 1 0 0 0 2 1 0 0 0 37 / 51 38 / 51 Since 2? 5, we have Since 2? 4, we also have 4 5 3 4 (2, 5) (1, 4) 5 4 3 4 So 3 6 = 1 6. Hence 6 3 6 3 1 1 1 0 0 1 2 1 0 0 0 (2, 5) (1, 4) (1, 5) 39 / 51 40 / 51

So 6 3 = 2 4. Hence 2 5 6 4 4 5 3 4 1 1 1 0 0 1 2 1 1 0 0 2 5 6 4 5 5 3 4 So 6 3 = 2 5. Since (2, 5) (1, 4) (1, 5), we now have 1 4 and 1 5. Hence 1 1 1 1 1 1 2 1 1 1 0 41 / 51 42 / 51 2 5 6 4 6 3 6 3 4 5 3 4 5 5 3 4 So 5 3 = 2 6, we now have So 4 5, and so we now have 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 43 / 51 44 / 51

Examine the remaining unmarked pairs (4, 6) and (5, 6) in like manner, we find that 4 6 and 5 6. So we finally have 1 1 1 1 1 1 2 1 1 1 1 4 0 1 5 1 We can delete either state 4 or state 5. Let s delete state 5, leaving us with states 1, 2, 3, 4, and 6: Inputs State 2 4 6 4 3 3 3 3 4 4 3 4 6 3 6 3 Finally, relabel state 6 as state 5, getting Inputs State 2 4 5 4 3 3 3 3 4 4 3 4 5 3 5 3 45 / 51 46 / 51 Recognizing tokens Modifications to basic FSA for lexing program source: 1. Ignore whitespace, except when it delimits a token. Use an extra state for this. 2. Whenever we reach an accepting state, announce a token. Don t enter accepting state until entire token is read. 3. Exactly one accepting state per token; state identifies token type. 4. Treat keywords as identifiers, but do a table lookup in symbol table. Other considerations: Some tokens are prefixes of others. Can t recognize identifier until past its end. May need to back up one character. Comments: may have multi-character delimiters (for instance, /*... */ or //... ). Quotation marks within quotes (\ vs. ). Stripped-down Pascal lexer Need to identify the following tokens: Identifiers, constants, labels Keywords (such as for). Simple operators (such as <). Compound operators (such as <=). Multi-character tokens whose prefix is a token. Comment syntax. Turbo Pascal allows {... } as well as (*... *). Compiler directives (analogous to #include). 47 / 51 48 / 51

Stripped-down Pascal lexer (cont d) Stripped-down Pascal lexer (cont d) 49 / 51 Coding an FSA Basic outline of pseudocode: repeat get next input char find new state table entry if (new state is final for some token) then begin isolate token pass to parser decrement cp if necessary end until no more input 51 / 51 50 / 51