Scnner Specifying ptterns source code tokens scnner prser IR A scnner must recognize the units of syntx Some prts re esy: errors mps chrcters into tokens the sic unit of syntx x = x + y; ecomes <id, x> = <id, x> + <id, y> ; chrcter string vlue for token is lexeme typicl tokens: numer, id, +, -, *, /, do, end elimintes white spce (ts, lnks, comments) key issue is speed use specilized recognizer (s opposed to lex) white spce <ws> ::= <ws> <ws> \t \t keywords nd opertors specified s literl ptterns: do, end comments opening nd closing delimiters: /* */ Copyright c 2007 y Antony L. Hosking. Permission to mke digitl or hrd copies of prt or ll of this work for personl or clssroom use is grnted without fee provided tht copies re not mde or distriuted for profit or commercil dvntge nd tht copies er this notice nd full cittion on the first pge. To copy otherwise, to repulish, to post on servers, or to redistriute to lists, requires prior specific permission nd/or fee. Request permission to pulish from hosking@cs.purdue.edu. CS502 Scnning CS502 Scnning 2 Specifying ptterns A scnner must recognize the units of syntx Other prts re much hrder: identifiers lphetic followed y k lphnumerics (, $, &,... ) numers integers: 0 or digit from -9 followed y digits from 0-9 decimls: integer. digits from 0-9 rels: (integer or deciml) E (+ or -) digits from 0-9 complex: ( rel, rel ) Opertions on lnguges Opertion Definition union of L nd M L M = {s s L or s M} written L M conctention of L nd M LM = {st s L nd t M} written LM Kleene closure of L L = S! i=0 L i written L positive closure of L L + = S! i= L i written L + We need powerful nottion to specify these ptterns CS502 Scnning 3 CS502 Scnning 4
Regulr expressions Ptterns re often specified s regulr lnguges Nottions used to descrie regulr lnguge (or regulr set) include oth regulr expressions nd regulr grmmrs Regulr expressions (over n lphet "):. is RE denoting the set {} 2. if ", then is RE denoting {} 3. if r nd s re REs, denoting L(r) nd L(s), then: (r) is RE denoting L(r) (r) (s) is RE denoting L(r) S L(s) (r)(s) is RE denoting L(r)L(s) (r) is RE denoting L(r) If we dopt precedence for opertors, the extr prentheses cn go wy. We ssume closure, then conctention, then lterntion s the order of precedence. CS502 Scnning 5 Exmples identifier letter ( c... z A B C... Z) digit (0 2 3 4 5 6 7 8 9) id letter ( letter digit ) numers integer (+ ) (0 ( 2 3... 9) digit ) deciml integer. ( digit ) rel ( integer deciml ) E (+ ) digit complex ( rel, rel ) Numers cn get much more complicted Most progrmming lnguge tokens cn e descried with REs We cn use REs to uild scnners utomticlly CS502 Scnning 6 Algeric properties of REs Axiom Description r s = s r is commuttive r (s t) = (r s) t is ssocitive (rs)t = r(st) conctention is ssocitive r(s t)=rs rt conctention distriutes over (s t)r = sr tr r = r is the identity for conctention r = r r =(r ) reltion etween nd r = r is idempotent Exmples Let " = {,}. denotes {,} 2. ( )( ) denotes {,,,} i.e., ( )( )= 3. denotes {,,,,...} 4. ( ) denotes the set of ll strings of s nd s (including ) i.e., ( ) =( ) 5. denotes {,,,,,,...} CS502 Scnning 7 CS502 Scnning 8
Recognizers From regulr expression we cn construct deterministic finite utomton (DFA) Recognizer for identifier: letter other 0 2 3 error digit other letter digit identifier letter ( c... z A B C... Z) digit (0 2 3 4 5 6 7 8 9) id letter ( letter digit ) ccept CS502 Scnning 9 Code for the recognizer chr next chr(); stte 0; /* code for stte 0 */ done flse; token vlue "" /* empty string */ while( not done ) { clss chr clss[chr]; stte next stte[clss,stte]; switch(stte) { cse : /* uilding n id */ token vlue token vlue + chr; chr next chr(); rek; cse 2: /* ccept stte */ token type = identifier; done = true; rek; cse 3: /* error */ token type = error; done = true; rek; } } return token type; CS502 Scnning 0 Tles for the recognizer Two tles control the recognizer chr clss: next stte: z A Z 0 9 other vlue letter letter digit other clss 0 2 3 letter digit 3 other 3 2 Automtic construction Scnner genertors utomticlly construct code from RE-like descriptions construct DFA use stte minimiztion techniques emit code for the scnner (tle driven or direct code ) A key issue in utomtion is n interfce to the prser lex is scnner genertor supplied with UNIX To chnge lnguges, we cn just chnge tles CS502 Scnning emits C code for scnner provides mcro definitions for ech token (used in the prser) CS502 Scnning 2
Grmmrs for regulr lnguges Cn we plce restriction on the form of grmmr to ensure tht it descries regulr lnguge? Provle fct: For ny RE r, grmmr g such tht L(r)=L(g) Grmmrs tht generte regulr sets re clled regulr grmmrs: They hve productions in one of 2 forms:. A A 2. A where A is ny non-terminl nd is ny terminl symol More regulr lnguges Exmple: the set of strings contining n even numer of zeros nd n even numer of ones s 0 s 0 0 s 2 s 3 0 0 The RE is (00 ) ((0 0)(00 ) (0 0)(00 ) ) These re lso clled type 3 grmmrs (Chomsky) CS502 Scnning 3 CS502 Scnning 4 More regulr expressions Wht out the RE ( )? s 0 s s 2 s 3 Stte s 0 hs multiple trnsitions on! nondeterministic finite utomton s 0 {s 0,s } {s 0 } s {s 2 } s 2 {s 3 } Finite utomt A non-deterministic finite utomton (NFA) consists of:. set of sttes S = {s 0,...,s n } 2. set of input symols " (the lphet) 3. trnsition function move mpping stte-symol pirs to sets of sttes 4. distinguished strt stte s 0 5. set of distinguished ccepting or finl sttes F A Deterministic Finite Automton (DFA) is specil cse of n NFA:. no stte hs -trnsition, nd 2. for ech stte s nd input symol, there is t most one edge lelled leving s A DFA ccepts x iff. unique pth through the trnsition grph from s 0 to finl stte such tht the edges spell x. CS502 Scnning 5 CS502 Scnning 6
DFAs nd NFAs re equivlent. DFAs re clerly suset of NFAs 2. Any NFA cn e converted into DFA, y simulting sets of simultneous sttes: ech DFA stte corresponds to set of NFA sttes possile exponentil lowup NFA to DFA using the suset construction: exmple s 0 s s 2 s 3 {s 0 } {s 0,s } {s 0 } {s 0,s } {s 0,s } {s 0,s 2 } {s 0,s 2 } {s 0,s } {s 0,s 3 } {s 0,s 3 } {s 0,s } {s 0 } s 0 s 0 s s 0 s 2 s 0 s 3 CS502 Scnning 7 CS502 Scnning 8 Constructing DFA from regulr expression RE to NFA DFA minimized N() RE DFA N() NFA moves N(A) A RE NFA w/ moves uild NFA for ech term connect them with moves NFA w/ moves to DFA construct the simultion the suset construction DFA minimized DFA merge comptile sttes DFA RE construct R k ij = Rk ik (R k kk ) R k S kj R k ij N(A B) N(AB) N(A ) N(B) N(A) A N(B) B N(A) A B CS502 Scnning 9 CS502 Scnning 20
RE to NFA: exmple ( ) 0 2 3 4 5 2 3 4 5 7 8 9 0 6 6 7 NFA to DFA: the suset construction Input: NFA N Output: DFA D with sttes Dsttes nd trnsitions Dtrns such tht L(D)=L(N) Method: Let s e stte in N nd T e set of sttes, define: Opertion -closure(s) -closure(t ) move(t,) Definition set of NFA sttes rechle from NFA stte s on -trnsitions lone set of NFA sttes rechle from some NFA stte s in T on -trnsitions lone set of NFA sttes to which there is trnsition on input symol from some NFA stte s in T dd stte T = -closure(s 0 ) unmrked to Dsttes while unmrked stte T in Dsttes mrk T for ech input symol U = -closure(move(t,)) if U Dsttes then dd U to Dsttes unmrked Dtrns[T,]=U endfor endwhile -closure(s 0 ) is the strt stte of D A stte of D is finl if it contins t lest one finl stte in N CS502 Scnning 2 CS502 Scnning 22 NFA to DFA using suset construction: exmple 2 Limits of regulr lnguges 2 3 Not ll lnguges re regulr One cnnot construct DFAs to recognize these lnguges: 0 6 4 5 7 8 9 0 L = {p k q k } L = {wcw r w " } A = {0,,2,4,7} D = {,2,4,5,6,7,9} B = {,2,3,4,6,7,8} E = {,2,4,5,6,7,0} C = {,2,4,5,6,7} C A B C B B D C B C D B E E B C Note: neither of these is regulr expression! (DFAs cnnot count!) But, this is little sutle. One cn construct DFAs for: lternting 0 s nd s ( )(0) ( 0) sets of pirs of 0 s nd s (0 0) + A B D E CS502 Scnning 23 CS502 Scnning 24
So wht is hrd? Lnguge fetures tht cn cuse prolems: reserved words PL/I hd no reserved words if then then then = else; else else = then; significnt lnks FORTRAN nd Algol68 ignore lnks do 0 i =,25 do 0 i =.25 string constnts specil chrcters in strings newline, t, quote, comment delimiter finite closures some lnguges limit identifier lengths dds sttes to count length FORTRAN 66 6 chrcters How d cn it get? INTEGERFUNCTIONA 2 PARAMETER(A=6,B=2) 3 IMPLICIT CHARACTER*(A-B)(A-B) 4 INTEGER FORMAT(0),IF(0),DO9E 5 00 FORMAT(4H)=(3) 6 200 FORMAT(4 )=(3) 7 DO9E= 8 DO9E=,2 9 IF(X)= 0 IF(X)H= IF(X)300,200 2 300 CONTINUE 3 END C this is comment $ FILE() 4 END These cn e swept under the rug in the lnguge design CS502 Scnning 25 Exmple due to Dr. F.K. Zdeck of IBM Corportion CS502 Scnning 26 Scnning MiniJv White spce:, \t, \n, \r, \f Tokens: Opertors, keywords (strightforwrd; I ve done them for you) Identifiers (strightforwrd) Integers (strightforwrd) Strings (tricky for escpes) CS502 Scnning 27