CONTEXT FREE GRAMMAR AND PARSING STATIC ANALYSIS - PARSING Source language Scanner (lexical analysis) tokens Parser (syntax analysis) Syntatic structure Semantic Analysis (IC generator) Syntatic/sema ntic structure Code Generator Target language 2 Code Optimizer Syntax described formally Tokens organized into syntax tree that describes structure Error checking Symbol Table CS 540 Spri ng 2009 GM U 1
STATIC ANALYSIS - PARSING We can use context free grammars to specify the syntax of programming languages. 3 CS 540 Spri ng 2009 GM U CONTEXT FREE GRAMMARS Definition: A context free grammar is a formal model that consists of: A set of tokens or terminal symbols Vt A set of nonterminal symbols Vn Start symbol S (in Vn) Finite set of productions of form: A R1 Rn (n >= 0) where A in Vn, R in Vn Vt 4 2
CFG EXAMPLES Indicates a production V t = {+,-,0..9}, V n = {L,D}, s = {L} L L + D L D D D 0 9 Shorthand for multiple productions V t ={(,)}, V n = {L}, s = {L} L ( L ) L L e 5 LANGUAGES Regular Context free Context sensitive Type 0 A a B, C e A a aab agb a b 3
ANY REGULAR LANGUAGE CAN BE EXPREED USING A CFG Starting with a NFA: For each state Si in the NFA Create non-terminal Ai If transition (Si,a) = Sk, create production Ai a Ak If transition (Si,e) = Sk, create production Ai Ak If Si is a final state, create production Ai e If Si is the NFA start state, s = Ai What does the existence of this algorithm tell us about the relationship between regular and context free languages? 7 NFA TO CFG EXAMPLE ab*a a a 1 2 3 A1 a A2 A2 b A2 A2 a A3 A3 e b 8 4
WRITING GRAMMARS When writing a grammar (or RE) for some language, the following must be true: All strings generated are in the language. Your grammar produces all strings in the language. 9 TRY THESE: Integers divisible by 2 Legal postfix expressions Floating point numbers with no extra zeros Strings of 0,1 where there are more 0 than 1 10 5
PARSING The task of parsing is figuring out what the parse tree looks like for a given input and language. If a string is in the given language, a parse tree must exist. However, just because a parse tree exists for some string in a given language doesn t mean a given algorithm can find it. 11 PARSE TREES The parse tree for some string in a language that is defined by the grammar G as follows: The root is the start symbol of G The leaves are terminals or e. When visited from left to right, the leaves form the input string The interior nodes are non-terminals of G For every non-terminal A in the tree with children B1 Bk, there is some production A B1 Bk 12 6
PARSE TREE FOR (())() L L ( L ) L L e 13 L L ( ( L ) L ) ( ) L L e e e e PARSE TREE FOR (())() L L ( L ) L L e 14 L L ( ( L ) L ) ( L ) L e e e e 7
PARSING Parsing algorithms are based on the idea of derivations. General Algorithms: LL (top down) LR (bottom up) 15 SINGLE STEP DERIVATION Definition: Given a A b (with a,b in (Vn Vt)*) and a production A g, a A b a g b is a single step derivation. Examples: L + D L D + D ( L ) ( L ) ( ( L ) L ) ( L ) L L - D L (L) L 16 8
DERIVATIONS Definition: A sequence of the form: w0 w1 wn is a derivation of wn from w0 (w0 * wn) L ( L ) L ( ) L ( ) production L ( L ) L production L e production L e L * ( ) If wi has non-terminal symbols, it is referred to as sentential form. 17 L * (( )) ( ) L ( L ) L ( L ) ( L ) L ( L ) ( L ) ( ( L ) L ) ( L ) ( ( ) L ) ( L ) ( ( ) L ) ( ) ( ( ) ) ( ) production L ( L ) L production L ( L ) L production L e production L ( L ) L production L e production L e production L e 18 9
L(G), the language generated by grammar G is {w in Vt*: s * w, for start symbol s} Example: We ve just shown that both () and (())() are in L(G) for the previous grammar. 19 LEFTMOST DERIVATIONS A derivation where the leftmost nonterminal is always chosen If a string is in a given language (i.e. a derivation exists), then a leftmost derivation must exist Rightmost derivation defined as you would expect 20 10
LEFTMOST DERIVATION FOR (())() L production L ( L ) L ( L ) L production L ( L ) L ( ( L ) L ) L production L e ( ( ) L ) L production L e ( ( ) ) L ( ( ) ) ( L ) L ( ( ) ) ( ) L ( ( ) ) ( ) production L e production L ( L ) L production L e RIGHTMOST DERIVATION FOR (())() L production L ( L ) L ( L ) L production L ( L ) L ( L ) ( L ) L production L e ( L ) ( L ) production L e ( L ) ( ) production L ( L ) L ( ( L ) L ) ( ) production L e ( ( L ) ) ( ) production L e ( ( ) ) ( ) 11
AMBIGUITY Two (or more) parse trees or leftmost derivations for some string in the language E E + E E E E E 0 9 E E E + E E - E E 4 2 - E E + E 2 3 3 4 2 TWO LEFTMOST DERIVATIONS E E + E E E + E 2 E + E 2 3 + E 2 3 + 4 E E E E E + E 2 E + E 2 3 + E 2 3 + 4 24 12
An ambiguous grammar can sometimes be made unambiguous: E E + T E T T T 0 9 Precedence can be specified as well: E E + T E T T T T * F T / F F F ( E ) 0 9 enforces the correct associativity 2 INPUT: { SIMPLESTMT; SIMPLESTMT; } P { } S ; e S simplestmt S { } P 26 S S { simplestmt ; simplestmt ; } e 13
TOP DOWN (LL) PARSING P { } S ; e S simplestmt S { } end P 27 { simplestmt ; simplestmt ; } TOP DOWN (LL) PARSING P { } S ; e S simplestmt S { } end P 28 S { simplestmt ; simplestmt; } 14
TOP DOWN (LL) PARSING P { } S ; e S simplestmt S { } end P 29 S { simplestmt ; simplestmt; } TOP DOWN (LL) PARSING P { } S ; e S simplestmt S { } end P 30 S S { simplestmt ; simplestmt; } 15
TOP DOWN (LL) PARSING P { } S ; e S simplestmt S { } end P 31 S S { simplestmt ; simplestmt; } TOP DOWN (LL) PARSING P { } S ; e S simplestmt S { } end P 32 S S e { simplestmt ; simplestmt; } 16
BOTTOMUP (LR) PARSING P { } S ; e S simplestmt S { } end 33 S { simplestmt ; simplestmt ; } BOTTOMUP (LR) PARSING P { } S ; e S simplestmt S { } end 34 S S { simplestmt ; simplestmt ; } 17
BOTTOMUP (LR) PARSING P { } S ; e S simplestmt S { } end 35 S S { simplestmt ; simplestmt ; } e CS 540 Sp BOTTOMUP (LR) PARSING P { } S ; e S simplestmt S { } end 36 S S e { simplestmt ; simplestmt ; } CS 540 Spri ng 2009 GM U 18
BOTTOMUP (LR) PARSING P { } S ; e S simplestmt S { } end 37 S S { simplestmt ; simplestmt ; } e BOTTOMUP (LR) PARSING P { } S ; e S simplestmt S { } end P 38 S S { simplestmt ; simplestmt ; } e 19
BACKUS-NAUR FORM (BNF) BNF is a meta-language i.e. a language used to describe another language invented by John Backus to describe ALGOL 58 used by Peter Naur to describe ALGOL 60 BNF is equivalent to context-free grammars a BNF grammar is defined by a set of terminal symbols, a set of nonterminal symbols a set of rules a start symbol (one of the terminal symbols) BNF ELEMENTS terminal symbols are the lexemes of the target PL e.g., while, (, ) nonterminal symbols represent classes of syntactic structures rules they act like syntactic variables e.g., <statement> define how a nonterminal symbol can by developed into a sequence of nonterminal and terminal symbols e.g., <while_stmt> while ( <logic_expr> ) <stmt> 20
BNF RULES A rule has a left-hand side (LHS) then a right-hand side (RHS) There can be several rules for one LHS <stmt> <assignment> <stmt> begin <stmt_list> end Syntactic lists are described using recursion <ident_list> ident <ident_list> ident, <ident_list> A grammar is a finite nonempty set of rules EBNF Extended BNF (EBNF) is most often used avoids having numerous rules for the same LHS Extra meta-symbols (in addition to ) [ ] enclosed symbols are optional (1 or 0 times) e.g., <if_stmt> if ( <exp> ) <stmt> [ else <stmt> ] { } enclosed symbols can be repeated (0 to n times) e.g., <ident_list> ident {, ident } choice of one of the symbol sequences separated by e.g., <stmt> <assignment> begin <stmt_list> end ( ) groups enclosed symbols 21
BNF VS. EBNF 43 BNF <expr> <expr> + <term> <expr> <expr> - <term> <expr> <term> <term> <term> * <factor> <term> <term> / <factor> <term> <factor> <factor> <exp> ** <factor> <factor> <exp> <exp> ( <expr> ) <exp> id EBNF <expr> <term> { ( + - ) <term> } <term> <factor> { ( * / ) <factor> } <factor> <exp> [ ** <factor> ] <exp> ( <expr> ) id AUGMENTED EBNF another meta-symbol ::= (equal) instead of meta-symbols for repetitions + means one or more times * means zero or more times <ident> = <letter> + ( <letter> <digit> ) * rules can use iteration instead of recursion e.g.: <stmt_list> <stmt> <stmt> ; <stmt_list> can be formulated as <stmt_list> ::= <stmt> ( ; <stmt> ) * 22
A SMALL LANGUAGE IN EBNF <program> ::= begin <stmt_list> end <stmt_list> ::= <stmt> <stmt> ; <stmt_list> <stmt> ::= <var> = <expr> <expr> ::= <term> + <term> <term> - <term> <term> ::= <var> const <var> ::= a b c PARSING General Algorithms: LL (top down) LR (bottom up) Both algorithms are driven by the input grammar and the input to be parsed Two important sets: FIRST and FOLLOW 46 23
FIRST SETS FIRST(a) is the set of all terminal symbols that can begin some sentential form in a derivation that starts with a a a b FIRST(a) = {a in V t a * ab } { e } if a * e Example: S simple { S } FIRST(S) = { simple, { } 47 To compute FIRST across strings of terminals and non-terminals: FIRST(e) = { e } FIRST(Aa) = A if A is a terminal FIRST(A) FIRST(a) If A * e FIRST(A) otherwise 48 24
COMPUTING FIRST SETS Initially FIRST(A) (for all A) is empty 1. For productions A c b, where c in V t Add { c } to FIRST(A) 2. For productions A e Add { e } to FIRST(A) 3. For productions A a B b, where a * e and NOT (B * e) Add FIRST(aB) to FIRST(A) 4. For productions A a, where a * e Add FIRST(a) and { e } to FIRST(A) same thing as e FIRST(B) 49 EXAMPLE 1 S a S e S B B b B e B C C c C e C d FIRST(C) = {c,d} FIRST(B) = FIRST(S) = Start with the simplest non-terminal 50 25
EXAMPLE 1 S a S e S B B b B e B C C c C e C d FIRST(C) = {c,d} FIRST(B) = {b,c,d} FIRST(S) = CS 540 Spring 2009 GMU Now that we know FIRST(C) 51 EXAMPLE 1 S a S e S B B b B e B C C c C e C d FIRST(C) = {c,d} FIRST(B) = {b,c,d} FIRST(S) = {a,b,c,d} 52 26
EXAMPLE 2 P i c n T S Q P a S d S c S T R b e S e R n e T R S q FIRST(P) = {i,c,n} FIRST(Q) = FIRST(R) = {b,e} FIRST(S) = FIRST(T) = 53 EXAMPLE 2 P i c n T S Q P a S d S c S T R b e S e R n e T R S q FIRST(P) = {i,c,n} FIRST(Q) = {i,c,n,a,d} FIRST(R) = {b,e} FIRST(S) = FIRST(T) = 54 27
EXAMPLE 2 P i c n T S Q P a S d S c S T R b e S e R n e T R S q FIRST(P) = {i,c,n} FIRST(Q) = {i,c,n,a,d} FIRST(R) = {b,e} FIRST(S) = {e,b,n,e} FIRST(T) = 55 EXAMPLE 2 P i c n T S Q P a S d S c S T R b e S e R n e T R S q FIRST(P) = {i,c,n} FIRST(Q) = {i,c,n,a,d} FIRST(R) = {b,e} FIRST(S) = {e,b,n,e} FIRST(T) = {b,c,n,q} 56 28
EXAMPLE 3 S a S e S T S T R S e Q R r S r e Q S T e FIRST(S) = FIRST(R) = FIRST(T) = FIRST(Q) = 57 EXAMPLE 3 S a S e S T S T R S e Q R r S r e Q S T e FIRST(S) = {a} FIRST(R) = {r, e} FIRST(T) = {r,a, e} FIRST(Q) = {a, e} 58 29
FOLLOW SETS FOLLOW(A) is the set of terminals (including end of file - $) that may follow non-terminal A in some sentential form. FOLLOW(A) = {c in Vt S + Ac } {$} if S + A For example, consider L + (())(L)L Both ) and end of file can follow L NOTE: e is never in FOLLOW sets 59 COMPUTING FOLLOW(A) If A is start symbol, put $ in FOLLOW(A) Productions of the form B a A b, Add FIRST(b) {e} to FOLLOW(A) INTUITION: Suppose B AX and FIRST(X) = {c} S + a B b a A X b + a A c d b 60 30
3. Productions of the form B a A or B a A b where b * e Add FOLLOW(B) to FOLLOW(A) INTUITION: Suppose B Y A S + a B b a Y A b FOLLOW(B) Suppose B A X and X * e S + a B b a A X b * a A b FOLLOW(B) 61 Assume the first non-terminal is the start symbol EXAMPLE 4 S a S e B B b B C f C C c C g d e FIRST(C) = {c,d,e} FIRST(B) = {b,c,d,e} FIRST(S) = {a,b,c,d,e} FOLLOW(C) = FOLLOW(B) = FOLLOW(S) {$} Using rule #1 62 31
EXAMPLE 4 S a S e B B b B C f C C c C g d e FIRST(C) = {c,d,e} FIRST(B) = {b,c,d,e} FIRST(S) = {a,b,c,d,e} FOLLOW(C) {f,g} FOLLOW(B) {c,d,f} FOLLOW(S) {$,e} Using rule #2 63 EXAMPLE 4 S a S e B B b B C f C C c C g d e FIRST(C) = {c,d,e} FIRST(B) = {b,c,d,e} FIRST(S) = {a,b,c,d,e} FOLLOW(S) = {$} FOLLOW(B) = FIRST(Cf) {$} = {c,d,e,} {$} = {c,d,e,$} FOLLOW(C) = FOLLOW(B) {f,g} = {c,d,e,g,$} 32
EXAMPLE 5 S A B C A D A e a A B b c e C D d C D e b f c FOLLOW(S) = FOLLOW(A) = FOLLOW(B) = FOLLOW(C) = FOLLOW(D) = FIRST(D) = {e,f} FIRST(C) = {e,f} FIRST(B) = {b,c,e} FIRST(A) = {a, e} FIRST(S) = {a,b,c,e,f} 65 EXAMPLE 5 S A B C A D A e a A B b c e C D d C D e b f c FOLLOW(S) = {$} FOLLOW(A) = {b,c,e,f} FOLLOW(B) = {e,f} FOLLOW(C) = {$} FOLLOW(D) = {$} FIRST(D) = {e,f} FIRST(C) = {e,f} FIRST(B) = {b,c,e} FIRST(A) = {a, e} FIRST(S) = {a,b,c,e,f} 66 33
EXAMPLE 6 S ( A) e A T E E & T E e T ( A ) a b c FOLLOW(S) = FOLLOW(A) = FOLLOW(E) = FOLLOW(T) = FIRST(T) = {(,a,b,c} FIRST(E) = {&, e } FIRST(A) = {(,a,b,c} FIRST(S) = {(, e} 67 EXAMPLE 6 S ( A) e A T E E & T E e T ( A ) a b c FIRST(T) = {(,a,b,c} FIRST(E) = {&, e } FIRST(A) = {(,a,b,c} FIRST(S) = {(, e} FOLLOW(S) = {$} FOLLOW(A) = { ) } FOLLOW(E) = FOLLOW(A) = { ) } FOLLOW(T) = FIRST(E) FOLLOW(A) FOLLOW(E) = {&, )} CS 540 Spring 2009 GMU 68 34
EXAMPLE 7 E T E E + T E e T F T T * F T e F ( E ) id FOLLOW(E) = FOLLOW(E ) = FOLLOW(T) = FOLLOW(T ) = FOLLOW(F) = FIRST(F) = FIRST(T) = FIRST(E) = {(,id} FIRST(T ) = {*,e} FIRST(E ) = {+,e} 69 EXAMPLE 7 E T E E + T E e T F T T * F T e F ( E ) id FOLLOW(E) = {$,)} FOLLOW(E ) = FOLLOW(E) = {$,)} FOLLOW(T) = FIRST(E ) FOLLOW(E) FOLLOW(E ) = {+,$,)} FOLLOW(T ) = FOLLOW(T) = {+,$,)} FOLLOW(F) = FIRST(T ) FOLLOW(T) FOLLOW(T ) = {*,+,$,)} FIRST(F) = FIRST(T) = FIRST(E) = {(,id} FIRST(T ) = {*,e} FIRST(E ) = {+,e} 70 35