Scanner. Specifying patterns. Specifying patterns. Operations on languages. A scanner must recognize the units of syntax Some parts are easy:

Similar documents
2. Lexical Analysis. Oscar Nierstrasz

CS415 Compilers. Lexical Analysis and. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Regular Expressions (RE) Regular Expressions (RE) Regular Expressions (RE) Regular Expressions (RE) Kleene-*

Lexical Analysis Part III

Java II Finite Automata I

AUTOMATA AND LANGUAGES. Definition 1.5: Finite Automaton

Convert the NFA into DFA

Lecture 08: Feb. 08, 2019

Minimal DFA. minimal DFA for L starting from any other

ɛ-closure, Kleene s Theorem,

Regular expressions, Finite Automata, transition graphs are all the same!!

First Midterm Examination

Finite Automata-cont d

1 Nondeterministic Finite Automata

Homework 3 Solutions

First Midterm Examination

CMSC 330: Organization of Programming Languages

Chapter 2 Finite Automata

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. Comparing DFAs and NFAs (cont.) Finite Automata 2

CS 301. Lecture 04 Regular Expressions. Stephen Checkoway. January 29, 2018

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. NFA for (a b)*abb.

Finite-State Automata: Recap

CHAPTER 1 Regular Languages. Contents

Automata Theory 101. Introduction. Outline. Introduction Finite Automata Regular Expressions ω-automata. Ralf Huuck.

3 Regular expressions

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

Formal languages, automata, and theory of computation

Let's start with an example:

Anatomy of a Deterministic Finite Automaton. Deterministic Finite Automata. A machine so simple that you can understand it in less than one minute

CSCI 340: Computational Models. Kleene s Theorem. Department of Computer Science

CS 373, Spring Solutions to Mock midterm 1 (Based on first midterm in CS 273, Fall 2008.)

FABER Formal Languages, Automata and Models of Computation

NFAs and Regular Expressions. NFA-ε, continued. Recall. Last class: Today: Fun:

a,b a 1 a 2 a 3 a,b 1 a,b a,b 2 3 a,b a,b a 2 a,b CS Determinisitic Finite Automata 1

Deterministic Finite Automata

Designing finite automata II

Nondeterminism and Nodeterministic Automata

Lexical Analysis Finite Automate

Harvard University Computer Science 121 Midterm October 23, 2012

Finite Automata. Informatics 2A: Lecture 3. John Longley. 22 September School of Informatics University of Edinburgh

80 CHAPTER 2. DFA S, NFA S, REGULAR LANGUAGES. 2.6 Finite State Automata With Output: Transducers

Chapter Five: Nondeterministic Finite Automata. Formal Language, chapter 5, slide 1

Theory of Computation Regular Languages. (NTU EE) Regular Languages Fall / 38

Non-deterministic Finite Automata

Lecture 6 Regular Grammars

CHAPTER 1 Regular Languages. Contents. definitions, examples, designing, regular operations. Non-deterministic Finite Automata (NFA)

Fundamentals of Computer Science

CISC 4090 Theory of Computation

Speech Recognition Lecture 2: Finite Automata and Finite-State Transducers

Theory of Computation Regular Languages

5. (±±) Λ = fw j w is string of even lengthg [ 00 = f11,00g 7. (11 [ 00)± Λ = fw j w egins with either 11 or 00g 8. (0 [ ffl)1 Λ = 01 Λ [ 1 Λ 9.

1.4 Nonregular Languages

Compiler Design. Fall Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Assignment 1 Automata, Languages, and Computability. 1 Finite State Automata and Regular Languages

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2016

Finite Automata. Informatics 2A: Lecture 3. Mary Cryan. 21 September School of Informatics University of Edinburgh

Review for the Midterm

NFAs continued, Closure Properties of Regular Languages

1.3 Regular Expressions

Homework 4. 0 ε 0. (00) ε 0 ε 0 (00) (11) CS 341: Foundations of Computer Science II Prof. Marvin Nakayama

Formal Languages and Automata

Converting Regular Expressions to Discrete Finite Automata: A Tutorial

Speech Recognition Lecture 2: Finite Automata and Finite-State Transducers. Mehryar Mohri Courant Institute and Google Research

Non-deterministic Finite Automata

Name Ima Sample ASU ID

More on automata. Michael George. March 24 April 7, 2014

CSCI 340: Computational Models. Transition Graphs. Department of Computer Science

The University of Nottingham SCHOOL OF COMPUTER SCIENCE A LEVEL 2 MODULE, SPRING SEMESTER LANGUAGES AND COMPUTATION ANSWERS

CS375: Logic and Theory of Computing

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2014

Overview HC9. Parsing: Top-Down & LL(1) Context-Free Grammars (1) Introduction. CFGs (3) Context-Free Grammars (2) Vertalerbouw HC 9: Ch.

CS 310 (sec 20) - Winter Final Exam (solutions) SOLUTIONS

Non-Deterministic Finite Automata. Fall 2018 Costas Busch - RPI 1

1 From NFA to regular expression

Coalgebra, Lecture 15: Equations for Deterministic Automata

NFA DFA Example 3 CMSC 330: Organization of Programming Languages. Equivalence of DFAs and NFAs. Equivalence of DFAs and NFAs (cont.

CS 330 Formal Methods and Models

CMPSCI 250: Introduction to Computation. Lecture #31: What DFA s Can and Can t Do David Mix Barrington 9 April 2014

Closure Properties of Regular Languages

CSE : Exam 3-ANSWERS, Spring 2011 Time: 50 minutes

Worked out examples Finite Automata

SWEN 224 Formal Foundations of Programming WITH ANSWERS

Formal Language and Automata Theory (CS21004)

Grammar. Languages. Content 5/10/16. Automata and Languages. Regular Languages. Regular Languages

CS 275 Automata and Formal Language Theory

Context-Free Grammars and Languages

Languages & Automata

CS 314 Principles of Programming Languages

Some Theory of Computation Exercises Week 1

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

CS 311 Homework 3 due 16:30, Thursday, 14 th October 2010

CSC 311 Theory of Computation

PART 2. REGULAR LANGUAGES, GRAMMARS AND AUTOMATA

Thoery of Automata CS402

Table of contents: Lecture N Summary... 3 What does automata mean?... 3 Introduction to languages... 3 Alphabets... 3 Strings...

CS 330 Formal Methods and Models Dana Richards, George Mason University, Spring 2016 Quiz Solutions

In-depth introduction to main models, concepts of theory of computation:

Talen en Automaten Test 1, Mon 7 th Dec, h45 17h30

Closure Properties of Regular Languages

State Minimization for DFAs

Transcription:

Scnner Specifying ptterns source code tokens scnner prser IR A scnner must recognize the units of syntx Some prts re esy: errors mps chrcters into tokens the sic unit of syntx x = x + y; ecomes <id, x> = <id, x> + <id, y> ; chrcter string vlue for token is lexeme typicl tokens: numer, id, +, -, *, /, do, end elimintes white spce (ts, lnks, comments) key issue is speed use specilized recognizer (s opposed to lex) white spce <ws> ::= <ws> <ws> \t \t keywords nd opertors specified s literl ptterns: do, end comments opening nd closing delimiters: /* */ Copyright c 2007 y Antony L. Hosking. Permission to mke digitl or hrd copies of prt or ll of this work for personl or clssroom use is grnted without fee provided tht copies re not mde or distriuted for profit or commercil dvntge nd tht copies er this notice nd full cittion on the first pge. To copy otherwise, to repulish, to post on servers, or to redistriute to lists, requires prior specific permission nd/or fee. Request permission to pulish from hosking@cs.purdue.edu. CS502 Scnning CS502 Scnning 2 Specifying ptterns A scnner must recognize the units of syntx Other prts re much hrder: identifiers lphetic followed y k lphnumerics (, $, &,... ) numers integers: 0 or digit from -9 followed y digits from 0-9 decimls: integer. digits from 0-9 rels: (integer or deciml) E (+ or -) digits from 0-9 complex: ( rel, rel ) Opertions on lnguges Opertion Definition union of L nd M L M = {s s L or s M} written L M conctention of L nd M LM = {st s L nd t M} written LM Kleene closure of L L = S! i=0 L i written L positive closure of L L + = S! i= L i written L + We need powerful nottion to specify these ptterns CS502 Scnning 3 CS502 Scnning 4

Regulr expressions Ptterns re often specified s regulr lnguges Nottions used to descrie regulr lnguge (or regulr set) include oth regulr expressions nd regulr grmmrs Regulr expressions (over n lphet "):. is RE denoting the set {} 2. if ", then is RE denoting {} 3. if r nd s re REs, denoting L(r) nd L(s), then: (r) is RE denoting L(r) (r) (s) is RE denoting L(r) S L(s) (r)(s) is RE denoting L(r)L(s) (r) is RE denoting L(r) If we dopt precedence for opertors, the extr prentheses cn go wy. We ssume closure, then conctention, then lterntion s the order of precedence. CS502 Scnning 5 Exmples identifier letter ( c... z A B C... Z) digit (0 2 3 4 5 6 7 8 9) id letter ( letter digit ) numers integer (+ ) (0 ( 2 3... 9) digit ) deciml integer. ( digit ) rel ( integer deciml ) E (+ ) digit complex ( rel, rel ) Numers cn get much more complicted Most progrmming lnguge tokens cn e descried with REs We cn use REs to uild scnners utomticlly CS502 Scnning 6 Algeric properties of REs Axiom Description r s = s r is commuttive r (s t) = (r s) t is ssocitive (rs)t = r(st) conctention is ssocitive r(s t)=rs rt conctention distriutes over (s t)r = sr tr r = r is the identity for conctention r = r r =(r ) reltion etween nd r = r is idempotent Exmples Let " = {,}. denotes {,} 2. ( )( ) denotes {,,,} i.e., ( )( )= 3. denotes {,,,,...} 4. ( ) denotes the set of ll strings of s nd s (including ) i.e., ( ) =( ) 5. denotes {,,,,,,...} CS502 Scnning 7 CS502 Scnning 8

Recognizers From regulr expression we cn construct deterministic finite utomton (DFA) Recognizer for identifier: letter other 0 2 3 error digit other letter digit identifier letter ( c... z A B C... Z) digit (0 2 3 4 5 6 7 8 9) id letter ( letter digit ) ccept CS502 Scnning 9 Code for the recognizer chr next chr(); stte 0; /* code for stte 0 */ done flse; token vlue "" /* empty string */ while( not done ) { clss chr clss[chr]; stte next stte[clss,stte]; switch(stte) { cse : /* uilding n id */ token vlue token vlue + chr; chr next chr(); rek; cse 2: /* ccept stte */ token type = identifier; done = true; rek; cse 3: /* error */ token type = error; done = true; rek; } } return token type; CS502 Scnning 0 Tles for the recognizer Two tles control the recognizer chr clss: next stte: z A Z 0 9 other vlue letter letter digit other clss 0 2 3 letter digit 3 other 3 2 Automtic construction Scnner genertors utomticlly construct code from RE-like descriptions construct DFA use stte minimiztion techniques emit code for the scnner (tle driven or direct code ) A key issue in utomtion is n interfce to the prser lex is scnner genertor supplied with UNIX To chnge lnguges, we cn just chnge tles CS502 Scnning emits C code for scnner provides mcro definitions for ech token (used in the prser) CS502 Scnning 2

Grmmrs for regulr lnguges Cn we plce restriction on the form of grmmr to ensure tht it descries regulr lnguge? Provle fct: For ny RE r, grmmr g such tht L(r)=L(g) Grmmrs tht generte regulr sets re clled regulr grmmrs: They hve productions in one of 2 forms:. A A 2. A where A is ny non-terminl nd is ny terminl symol More regulr lnguges Exmple: the set of strings contining n even numer of zeros nd n even numer of ones s 0 s 0 0 s 2 s 3 0 0 The RE is (00 ) ((0 0)(00 ) (0 0)(00 ) ) These re lso clled type 3 grmmrs (Chomsky) CS502 Scnning 3 CS502 Scnning 4 More regulr expressions Wht out the RE ( )? s 0 s s 2 s 3 Stte s 0 hs multiple trnsitions on! nondeterministic finite utomton s 0 {s 0,s } {s 0 } s {s 2 } s 2 {s 3 } Finite utomt A non-deterministic finite utomton (NFA) consists of:. set of sttes S = {s 0,...,s n } 2. set of input symols " (the lphet) 3. trnsition function move mpping stte-symol pirs to sets of sttes 4. distinguished strt stte s 0 5. set of distinguished ccepting or finl sttes F A Deterministic Finite Automton (DFA) is specil cse of n NFA:. no stte hs -trnsition, nd 2. for ech stte s nd input symol, there is t most one edge lelled leving s A DFA ccepts x iff. unique pth through the trnsition grph from s 0 to finl stte such tht the edges spell x. CS502 Scnning 5 CS502 Scnning 6

DFAs nd NFAs re equivlent. DFAs re clerly suset of NFAs 2. Any NFA cn e converted into DFA, y simulting sets of simultneous sttes: ech DFA stte corresponds to set of NFA sttes possile exponentil lowup NFA to DFA using the suset construction: exmple s 0 s s 2 s 3 {s 0 } {s 0,s } {s 0 } {s 0,s } {s 0,s } {s 0,s 2 } {s 0,s 2 } {s 0,s } {s 0,s 3 } {s 0,s 3 } {s 0,s } {s 0 } s 0 s 0 s s 0 s 2 s 0 s 3 CS502 Scnning 7 CS502 Scnning 8 Constructing DFA from regulr expression RE to NFA DFA minimized N() RE DFA N() NFA moves N(A) A RE NFA w/ moves uild NFA for ech term connect them with moves NFA w/ moves to DFA construct the simultion the suset construction DFA minimized DFA merge comptile sttes DFA RE construct R k ij = Rk ik (R k kk ) R k S kj R k ij N(A B) N(AB) N(A ) N(B) N(A) A N(B) B N(A) A B CS502 Scnning 9 CS502 Scnning 20

RE to NFA: exmple ( ) 0 2 3 4 5 2 3 4 5 7 8 9 0 6 6 7 NFA to DFA: the suset construction Input: NFA N Output: DFA D with sttes Dsttes nd trnsitions Dtrns such tht L(D)=L(N) Method: Let s e stte in N nd T e set of sttes, define: Opertion -closure(s) -closure(t ) move(t,) Definition set of NFA sttes rechle from NFA stte s on -trnsitions lone set of NFA sttes rechle from some NFA stte s in T on -trnsitions lone set of NFA sttes to which there is trnsition on input symol from some NFA stte s in T dd stte T = -closure(s 0 ) unmrked to Dsttes while unmrked stte T in Dsttes mrk T for ech input symol U = -closure(move(t,)) if U Dsttes then dd U to Dsttes unmrked Dtrns[T,]=U endfor endwhile -closure(s 0 ) is the strt stte of D A stte of D is finl if it contins t lest one finl stte in N CS502 Scnning 2 CS502 Scnning 22 NFA to DFA using suset construction: exmple 2 Limits of regulr lnguges 2 3 Not ll lnguges re regulr One cnnot construct DFAs to recognize these lnguges: 0 6 4 5 7 8 9 0 L = {p k q k } L = {wcw r w " } A = {0,,2,4,7} D = {,2,4,5,6,7,9} B = {,2,3,4,6,7,8} E = {,2,4,5,6,7,0} C = {,2,4,5,6,7} C A B C B B D C B C D B E E B C Note: neither of these is regulr expression! (DFAs cnnot count!) But, this is little sutle. One cn construct DFAs for: lternting 0 s nd s ( )(0) ( 0) sets of pirs of 0 s nd s (0 0) + A B D E CS502 Scnning 23 CS502 Scnning 24

So wht is hrd? Lnguge fetures tht cn cuse prolems: reserved words PL/I hd no reserved words if then then then = else; else else = then; significnt lnks FORTRAN nd Algol68 ignore lnks do 0 i =,25 do 0 i =.25 string constnts specil chrcters in strings newline, t, quote, comment delimiter finite closures some lnguges limit identifier lengths dds sttes to count length FORTRAN 66 6 chrcters How d cn it get? INTEGERFUNCTIONA 2 PARAMETER(A=6,B=2) 3 IMPLICIT CHARACTER*(A-B)(A-B) 4 INTEGER FORMAT(0),IF(0),DO9E 5 00 FORMAT(4H)=(3) 6 200 FORMAT(4 )=(3) 7 DO9E= 8 DO9E=,2 9 IF(X)= 0 IF(X)H= IF(X)300,200 2 300 CONTINUE 3 END C this is comment $ FILE() 4 END These cn e swept under the rug in the lnguge design CS502 Scnning 25 Exmple due to Dr. F.K. Zdeck of IBM Corportion CS502 Scnning 26 Scnning MiniJv White spce:, \t, \n, \r, \f Tokens: Opertors, keywords (strightforwrd; I ve done them for you) Identifiers (strightforwrd) Integers (strightforwrd) Strings (tricky for escpes) CS502 Scnning 27