Lexical Analysis: DFA Minimization & Wrap Up

Similar documents
Lexical Analysis Part II: Constructing a Scanner from Regular Expressions

Compilers. Lexical analysis. Yannis Smaragdakis, U. Athens (original slides by Sam

Lexical Analysis. DFA Minimization & Equivalence to Regular Expressions

Computer Science 160 Translation of Programming Languages

Deterministic Finite Automaton (DFA)

Syntax Analysis, VI Examples from LR Parsing. Comp 412

Lexical Analysis. Reinhard Wilhelm, Sebastian Hack, Mooly Sagiv Saarland University, Tel Aviv University.

Optimizing Finite Automata

Syntax Analysis, VII The Canonical LR(1) Table Construction. Comp 412

Introduction to Turing Machines. Reading: Chapters 8 & 9

Parsing VI LR(1) Parsers

Tasks of lexer. CISC 5920: Compiler Construction Chapter 2 Lexical Analysis. Tokens and lexemes. Buffering

Lecture 3: Finite Automata

Plan for Today and Beginning Next week (Lexical Analysis)

Before we show how languages can be proven not regular, first, how would we show a language is regular?

CS5371 Theory of Computation. Lecture 12: Computability III (Decidable Languages relating to DFA, NFA, and CFG)

Foundations of

Lecture 4: Finite Automata

Lecture 3: Finite Automata

CMPSCI 250: Introduction to Computation. Lecture #23: Finishing Kleene s Theorem David Mix Barrington 25 April 2013

Examples of Regular Expressions. Finite Automata vs. Regular Expressions. Example of Using flex. Application

Lecture 4 Nondeterministic Finite Accepters

LR(1) Parsers Part III Last Parsing Lecture. Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved.

Foundations of

UNIT II REGULAR LANGUAGES

Intro to Theory of Computation

LR(1) Parsers Part II. Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved.

COM364 Automata Theory Lecture Note 2 - Nondeterminism

Compiler Design. Spring Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro Diniz

Lecture 3: Nondeterministic Finite Automata

Compiling Techniques

CSE 105 Theory of Computation Professor Jeanne Ferrante

CSE 105 THEORY OF COMPUTATION

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata

Theory of Computation Lecture 1. Dr. Nahla Belal

Closure under the Regular Operations

UNIT-III REGULAR LANGUAGES

Time Complexity (1) CSCI Spring Original Slides were written by Dr. Frederick W Maier. CSCI 2670 Time Complexity (1)

PS2 - Comments. University of Virginia - cs3102: Theory of Computation Spring 2010

Takeaway Notes: Finite State Automata

Compiler Construction Lent Term 2015 Lectures (of 16)

Closure Properties of Regular Languages. Union, Intersection, Difference, Concatenation, Kleene Closure, Reversal, Homomorphism, Inverse Homomorphism

Compiler Construction Lent Term 2015 Lectures (of 16)

Compiler Design. Spring Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

More on Regular Languages and Non-Regular Languages

TDDD65 Introduction to the Theory of Computation

Compiler Construction Lectures 13 16

Introduction to Language Theory and Compilation: Exercises. Session 2: Regular expressions

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

Parsing -3. A View During TD Parsing

The Pumping Lemma: limitations of regular languages

Computational Models - Lecture 4

CFLs and Regular Languages. CFLs and Regular Languages. CFLs and Regular Languages. Will show that all Regular Languages are CFLs. Union.

V Honors Theory of Computation

Decision, Computation and Language

CS415 Compilers Syntax Analysis Bottom-up Parsing

CS 301. Lecture 18 Decidable languages. Stephen Checkoway. April 2, 2018


CS415 Compilers. Lexical Analysis and. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Turing Machines Part Three

Great Theoretical Ideas in Computer Science. Lecture 4: Deterministic Finite Automaton (DFA), Part 2

3515ICT: Theory of Computation. Regular languages

CS243, Logic and Computation Nondeterministic finite automata

CS415 Compilers Syntax Analysis Top-down Parsing

COSE312: Compilers. Lecture 2 Lexical Analysis (1)

Constructions on Finite Automata

NODIA AND COMPANY. GATE SOLVED PAPER Computer Science Engineering Theory of Computation. Copyright By NODIA & COMPANY

CS 154. Finite Automata, Nondeterminism, Regular Expressions

Theory of Computation

Undecidable Problems and Reducibility

Introduction to Theory of Computing

Nondeterminism. September 7, Nondeterminism

September 11, Second Part of Regular Expressions Equivalence with Finite Aut

Theory of Languages and Automata

Lecture 17: Language Recognition

UNIT-VIII COMPUTABILITY THEORY

Chapter 5. Finite Automata

CSE 105 THEORY OF COMPUTATION

Congratulations you're done with CS103 problem sets! Please evaluate this course on Axess!

CS20a: Turing Machines (Oct 29, 2002)

Announcements. Final exam review session tomorrow in Gates 104 from 2:15PM 4:15PM Final Friday Four Square of the quarter!

CSC173 Workshop: 13 Sept. Notes

CS 154, Lecture 2: Finite Automata, Closure Properties Nondeterminism,

Pushdown Automata (2015/11/23)

Context Free Languages. Automata Theory and Formal Grammars: Lecture 6. Languages That Are Not Regular. Non-Regular Languages

Acknowledgments 2. Part 0: Overview 17

Theory of Computation (II) Yijia Chen Fudan University

Theory of Computation

CSE 311: Foundations of Computing. Lecture 23: Finite State Machine Minimization & NFAs

Turing Machines Part One

Finite Automata - Deterministic Finite Automata. Deterministic Finite Automaton (DFA) (or Finite State Machine)

Context-Free Grammars (and Languages) Lecture 7

CPSC 421: Tutorial #1

Turing Machines (TM) Deterministic Turing Machine (DTM) Nondeterministic Turing Machine (NDTM)

CpSc 421 Final Exam December 15, 2006

Part I: Definitions and Properties

Complexity Theory Part I

CSE 322, Fall 2010 Nonregular Languages

Automata and Computability. Solutions to Exercises

Material Covered on the Final

Transcription:

Lexical Analysis: DFA Minimization & Wrap Up

Automating Scanner Construction PREVIOUSLY RE NFA (Thompson s construction) Build an NFA for each term Combine them with -moves NFA DFA (subset construction) The Cycle of Constructions RE NFA DFA minimal DFA Build the simulation TODAY DFA Minimal DFA Hopcroft s algorithm DFA RE (not really part of scanner construction) All pairs, all paths problem Union together paths from s 0 to a final state

DFA Minimization The Big Picture Discover sets of equivalent states Represent each such set with just one state

DFA Minimization The Big Picture Discover sets of equivalent states Represent each such set with just one state Two states are equivalent if and only if: The set of paths leading to them are equivalent α Σ, transitions on α lead to equivalent states (DFA) α-transitions to distinct sets states must be in distinct sets

DFA Minimization The Big Picture Discover sets of equivalent states Represent each such set with just one state Two states are equivalent if and only if: The set of paths leading to them are equivalent α Σ, transitions on α lead to equivalent states (DFA) α-transitions to distinct sets states must be in distinct sets A partition P of S Each s S is in exactly one set p i P The algorithm iteratively partitions the DFA s states

DFA Minimization Details of the algorithm Group states into maximal size sets, optimistically Iteratively subdivide those sets, as needed States that remain grouped together are equivalent Initial partition, P 0, has two sets: {D F } & {D-D F } (DFA =(Q,Σ,δ,q 0,F)) Splitting a set ( partitioning a set by a ) Assume q i, & q s, and δ(q j i,a) = q x, & δ(q j,a) = q y If q x & q y are not in the same set, then s must be split q i has transition on a, q j does not a splits s One state in the final DFA cannot have two transitions on a

DFA Minimization The algorithm P { D F, {D-D F }} while ( P is still changing) T Ø for each set p P P T T T Split(p) Split(S) for each α Σ if α splits S into s 1 and s 2 then return { s 1,s 2 } return S Why does this work? Partition P 2 D Starts with 2 subsets of D {D F } and {D-D F } While loop takes P i P i+1 by splitting 1 or more sets P i+1 is at least one step closer to the partition with D sets Maximum of D splits Note that Partitions are never combined Initial partition ensures that final states are intact This is a fixed-point algorithm!

Key Idea: Splitting S around α Original set S S α T α α R Q S has transitions on α to R, Q, & T The algorithm partitions S around α

Key Idea: Splitting S around α Original set S S 1 α T S 2 α S 2 is everything in S - S 1 α R Q Could we split S 2 further? Yes, but it does not help asymptotically

DFA Minimization What about a ( b c ) *? q a 0 q 1 b q 4 q 5 q q 3 q 2 8 q 9 q c 6 q 7 First, the subset construction: -closure (move(s,*)) b NFA sta tes a b c s 0 q 0 q 1, q 2, q 3, none none q 4, q 6, q 9 s 1 q 1, q 2, q 3, none q 5, q 8, q 9, q 7, q 8, q 9, q 4, q 6, q 9 q 3, q 4, q 6 q 3, q 4, q 6 s 2 q 5, q 8, q 9, none s 2 s 3 q 3, q 4, q 6 s 3 q 7, q 8, q 9, none s 2 s 3 q 3, q 4, q 6 b a s 0 s 1 c b s 2 s 3 c c Final states

DFA Minimization Then, apply the minimization algorithm b Split on Curr ent Part ition a b c P 0 { s 1, s 2, s 3 } {s 0 } none none none b a s 0 s 1 c b s 2 c s 3 final states c To produce the minimal DFA a s 0 s 1 b c In lecture 4, we observed that a human would design a simpler automaton than Thompson s construction & the subset construction did. Minimizing that DFA produces the one that a human would design!

Abbreviated Register Specification Start with a regular expression r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 The Cycle of Constructions RE NFA DFA minimal DFA

Abbreviated Register Specification Thompson s construction produces r 0 r 1 s 0 r 2 r 8 s f r 9 The Cycle of Constructions To make it fit, we ve eliminated the - transition between r and 0...9. RE NFA DFA minimal DFA

Abbreviated Register Specification The subset construction builds s 0 r 0 1 9 s f0 8 2 s f 1 s f8 s f2 s f9 This is a DFA, but it has a lot of states The Cycle of Constructions RE NFA DFA minimal DFA

Abbreviated Register Specification The DFA minimization algorithm builds s 0 r 0,1,2,3,4, 5,6,7,8,9 s f This looks like what a skilled compiler writer would do! The Cycle of Constructions RE NFA DFA minimal DFA

Limits of Regular Languages Advantages of Regular Expressions Simple & powerful notation for specifying patterns Automatic construction of fast recognizers Many kinds of syntax can be specified with REs Example an expression grammar Term [a-za-z] ([a-za-z] [0-9]) * Op + - / Expr ( Term Op ) * Term Of course, this would generate a DFA If REs are so useful Why not use them for everything?

Limits of Regular Languages Not all languages are regular RL s CFL s CSL s You cannot construct DFA s to recognize these languages L = { p k q k } (parenthesis languages) L = { wcw r w Σ * } Neither of these is a regular language (nor an RE) But, this is a little subtle. You can construct DFA s for Strings with alternating 0 s and 1 s ( 1 ) ( 01 ) * ( 0 ) Strings with an even number of 0 s and 1 s RE s can count bounded sets and bounded differences

What can be so hard? Poor language design can complicate scanning Reserved words are important if then then then = else; else else = then Insignificant blanks do 10 i = 1,25 do 10 i = 1.25 (PL/I) (Fortran & Algol68) String constants with special characters (C, C++, Java, ) newline, tab, quote, comment delimiters, Finite closures Limited identifier length (Fortran 66 & Basic) Adds states to count length

Building Faster Scanners from the DFA Table-driven recognizers waste effort Read (& classify) the next character Find the next state Assign to the state variable Trip through case logic in action() Branch back to the top char next character; state s 0 ; call action(state,char); while (char eof) state δ(state,char); call action(state,char); char next character; We can do better Encode state & actions in the code Do transition tests locally Generate ugly, spaghetti-like code if Τ(state) = final then report acceptance; else report failure; Takes (many) fewer operations per input character

Building Faster Scanners from the DFA A direct-coded recognizer for r Digit Digit * goto s 0 ; s 0 : word Ø; char next character; if (char = r ) then goto s 1 ; else goto s e ; s 1 : word word + char; char next character; if ( 0 char 9 ) then goto s 2 ; else goto s e ; s2: word word + char; char next character; if ( 0 char 9 ) then goto s 2 ; else if (char = eof) then report success; else goto s e ; s e : print error message; return failure; Many fewer operations per character Almost no memory operations Even faster with careful use of fall-through cases

Building Faster Scanners Hashing keywords versus encoding them directly Some (well-known) compilers recognize keywords as identifiers and check them in a hash table Encoding keywords in the DFA is a better idea O(1) cost per transition Avoids hash lookup on each identifier It is hard to beat a well-implemented DFA scanner

Building Scanners The point All this technology lets us automate scanner construction Implementer writes down the regular expressions Scanner generator builds NFA, DFA, minimal DFA, and then writes out the (table-driven or direct-coded) code This reliably produces fast, robust scanners For most modern language features, this works You should think twice before introducing a feature that defeats a DFA-based scanner The ones we ve seen (e.g., insignificant blanks, non-reserved keywords) have not proven particularly useful or long lasting