Context Free Grammars: Introduction

Context Free Grammars: Introduction Context free grammar (CFG) G = (V, Σ, R, S), where V is a set of grammar symbols Σ V is a set of terminal symbols R is a set of rules, where R (V Σ) V S is a distinguished non-terminal symbol - the start symbol Unlike RGs, CFGs can have any number of symbols on the right-hand side of a rule RGs CF Gs CFGs are more powerful than RGs because of the following 2 properties: 1. Recursion Rule is recursive if it is of the form X w 1 Y w 2, where Y w 3 Xw 4 and w 1, w 2, w 3, w 4 V Grammar is recursive if it has at least one recursive rule 2. Self-embedment Rule is self-embedding if it is of the form X w 1 Y w 2, where Y w 3 Xw 4 and w 1, w 2, w 3, w 4 Σ + Grammar is self-embedding if it has at least one self-embedding rule If grammar G is not self-embedding, L(G) is regular If every grammar that defines language L is self-embedding, L is not regular 1

Context Free Grammars: Designing CFGs The following heuristics can be helpful in designing CFGs to generate language L 1. If L is such that every string can be divided into 2 distinct regions that are related to each other, the regions must be generated in tandem 2. To generate strings with multiple regions that must occur in a fixed order - but where the regions are not related to each other - use rules of the form A BC... 3. To generate strings with 2 regions that are related to each other and must occur in a fixed order, start at ends of string and work inward. If there is an independent region in the middle, generate it after the surrounding regions have been generated 2

CFGs may have Context Free Grammars: Simplifying CFGs 1. Non-terminals that never terminate in terminal symbols 2. Symbols that are unreachable Want to clean up such grammars 1. Eliminating symbols that do not terminate Strategy: Work from terminal symbols towards nonterminal symbols Any non-terminals that are not encountered are non-terminating Algorithm: CFG removeunproductive (CFG G) { marked = G.Sigma; //productive symbols do { //mark terminals as productive oldmarked = marked; for (each X -> alpha in G.R) { marked = marked + X; for (each s in alpha) if (s not in marked) marked = marked - X; while (marked!= oldmarked); G.R = NULL; for (each X -> alpha in G.R) { G.R = G + X -> alpha; for (each symbol s in alpha) if (s not in marked) G.R = G - X -> alpha; if (X -> alpha in G.R) mark X; G.V = NULL; G.V = marked; G.Sigma = G.Sigma; G.S = G.S; return G ; 3

Context Free Grammars: Simplifying CFGs (2) 2. Eliminating unreachable symbols Strategy: Work from start symbol towards terminal symbols Algorithm: CFG removeunreachable (CFG G) { marked = G.S; //reachable symbols do { oldmarked = marked; for (each X -> alpha in G.R) if (X in marked) for (each s in alpha) marked = marked + s; while (marked!= oldmarked); G.V = G.V; for (each X in (G.V - G.Sigma)) if (X not in marked) G.V = G.V - X; G.Sigma = G.Sigma; for (each s in G.Sigma) if (s not in marked) G.Sigma = G.Sigma - s; G.R = G.R; for (each (X -> alpha) in G.R) if (X not in marked) G.R = G.R - (X -> alpha); G.S = G.S; return G ; 4

Context Free Grammars: Proving Grammar Is Correct To prove grammar G is correct, must show 1. G generates only stings in L(G) 2. G generates all stings in L(G) Most straightforward way to do first step is to perform the following. string generate (grammar G) { w = G.S; do { apply some rule r = X -> alpha A beta from G.R; while (w contains X in (G.V - G.Sigma)); return w; Proof constructed as follows: 1. Construct loop invariant I for above algorithm 2. Show that (a) I is true when loop starts (b) I holds on each iteration of loop (c) At termination, w L(G) 5

Context Free Grammars: Ambiguity Issues arise with CFGs that are not inherent in RGs 1. CFG derivation order not obvious from parse tree Not an issue for RGs there can only be a single NT on RHS Example: Consider grammar < S > < NP >< V P > < NP > < A >< N > < A >< N > < V P > < V >< NP > < V >< NP > < NP > < N > boy < N > man < N > telescope < A > the < A > a < V > saw with 6

Context Free Grammars: Ambiguity (2) Derivations of the man saw a boy S < NP >< V P > < A >< N >< V P > the < N >< V P > the < N >< V >< NP > the < N > saw < NP > the < N > saw < A >< N > the < N > saw < A > boy the < N > saw a boy the man saw a boy Vs S < NP >< V P > < A >< N >< V P > the < N >< V P > the man < V P > the man < V P > the man < V >< NP > the man saw < NP > the man saw < A >< N > the man saw a < N > the man saw a boy Left-most derivation: One in which left-most NT is always replaced Right-most derivation: One in which right-most NT is always replaced 7

2. CFG may be ambiguous Context Free Grammars: Ambiguity (3) Grammar is ambiguous if there is more than one parse tree for the same string Example: Grammar G is unambiguous iff, for all strings w L(G), there is only one rule that can be applied at any point in a left-most or right-most derivation RGs can be ambiguous, but problem is more serious for CGFs Structure modifies meaning of string Cannot guarantee an equivalent unambiguous grammar for one that is ambiguous Inherently ambiguous: Grammar for which there is no equivalent unambiguous grammar To reduce ambiguity 1. Eliminate ɛ rules ɛ rule is of form X ɛ Allows NTs that provide no structure NT X is nullable iff (a) There is a rule of form X ɛ R, OR (b) There is a rule of form X P QR... R, and P, Q, R,... are nullable A rule is modifiable iff it is of the form X αqβ and Q is nullable, for any α, β V 8

Context Free Grammars: Ambiguity (4) Algorithm: CFG removeeps (CFG G) { nullable = NULL; for (each rule (X -> alpha) in G.R) if (alpha == epsilon)) nullable = nullable + X; do { oldnullable = nullable; for (each rule (X -> alpha) in G.R) { if (alpha contains only NTs) { flag = TRUE; for (each Y in alpha) if (Y not in nullable) flag = FALSE; if (flag) nullable = nullable + X; while (oldnullable!= nullable); R = G.R; do { oldr = R ; for (each (X -> alpha Q beta) in R ) if (Q in nullable) if ((X -> alpha beta) not in R ) AND (alpha beta!= epsilon) AND (X!= alpha beta)) R = R + (X -> alpha beta); while (oldr!= R ); for(each X -> alpha in R ) if (alpha == epsilon) R = R - (X -> alpha); G = (G.V, G.Sigma, R, G.S); return G ; 9

Context Free Grammars: Ambiguity (5) Need to check special case in above algorithm: Language may include ɛ Add special rule for this case at end of algorithm: if (nullable(g.s)) { G.V = G.V + G.S ; G.R = G.R - (S -> epsilon); G.R = G.R + (S -> epsilon); G.R = G.R + (S -> S); 2. Eliminating symmetric recursive rules Symmetric recursive rule: (a) Contains 2 or more copies of LHS on RHS (b) RHS is symmetric (c) E.g., X X op X Grammars that have such rules will have additional rules for expanding the LHS To eliminate ambiguity, will force branching to left or right by adding new NT symbols Replication of original LHS symbol achieved via that symbol New symbol provides lower-level structure 3. Ambiguous attachment Problem associated with optional structures Given a recursive structure with such an optional structure, the problem is identifying to which level of recursion the structure should be attached Solution is to add additional rules to remove ambiguity Proving non-ambiguity The problem of determining whether a class of grammars is ambiguous or not is undecidable, in general Can determine this for a specific grammar To prove a grammar is unambiguous, must show that every string in the language has a single left-most (or right-most) derivation General technique simply demonstrates - for each NT - that a given set of terminals can only be generated in one way 10

1. Chomsky Normal Form Context Free Grammars: Normal Forms All rules are in one of forms (a) X a, where a Σ (b) X BC, where B, C V Σ Every CFG in CNF has branching factor of 2 Therefore, every derivation of string w has (a) w 1 applications of rules of form X BC, and (b) w applications of rules of form X a 2. Greibach Normal Form All rules are in form (a) X αβ, where α Σ, β (V Σ) One terminal generated per rule application Therefore, every derivation of string w has w rule applications 11

Context Free Grammars: Converting to Normal Forms Following strategy used: 1. Apply a transformation to grammar to eliminate an undesirable property Language must not change as a result of the transform 2. Repeat, making sure undesirable properties are not re-introduced by subsequent transformations 3. Continue until desired form is achieved Theorem 11.3: Rule Substitution Statement: Let G = (V, Σ, R, S) be a CFG with rules r i of form X αy β, where α, β V, Y V Σ Let Y γ 1 γ 2... γ n be a set of rules in R with Y as LHS Let R = (R r i ) {X αγ 1 β, X αγ 2 β,..., X αγ n β G = (V, Σ, R, S) Then, L(G ) = L(G) Proof: (See p 235) 12

Context Free Grammars: Converting to Normal Forms - CNF Theorem 11.1 CNF Statement: Given CFG G, there is a CNF grammar G 0 such that L(G 0 ) = L(G) ɛ Proof: By construction. Algorithm uses 4 basic steps CFG converttocnf (CFG G) { Eliminate epsilon rules; Eliminate unit productions; Eliminate rules where RHS > 1 and have terminal symbol on RHS; Eliminate rules where RHS > 2; return (G.V, G.Sigma, modified rules, G.S); 1. ɛ productions: See above 2. Unit productions Rules of form X Y, where Y V Σ Replace with rules of form X α, where α Σ or α > 1 Algorithm: CFG removeunits (CFG G) { R = G.R; visited = NULL; while (no unit productions in R ) { r = (X -> Y) in R ; R = R - r; visited = visited + r; for (each r = (Y -> beta) in R ) if ((X -> beta) not in visited) { R = R + (X -> beta); visited = visited + (X -> beta); 13

Context Free Grammars: Converting to Normal Forms - CNF (2) 3. Replacing where RHS > 1 and contain a terminal Algorithm: CFG removemixed (CFG G) { R = G.R; for (each a in G.Sigma) { create terminal symbol Ta; G.V = G.V + Ta; R = R + (Ta -> a); for (each r in R ) { if (length(r.rhs) > 1) for (each s in r.rhs) if (s in G.Sigma) replace s in r with Ts; return (G.V, G.Sigma, R, G.S); 14

Context Free Grammars: Converting to Normal Forms - CNF (3) 4. Replace rules where RHS > 2 with rules where RHS = 2 Algorithm: CFG removelong (CFG G) { R = NULL; for (each r in G.R) if (length(r.rhs) > 2) { n = length(r.rhs); r = (X -> Y 1 Y 2...Y n ); R = R + (X -> Y 1 R 1 ); G.V = G.V + R 1 ; for (i = 1; i < n - 2; i++) { G.V = G.V + R i+1 ; R = R + (R i -> Y i+1 R i+1 ); G.V = G.V + R n 2 ; R = R + (R n 2 -> Y n 1 Y n ); else R = R + r; return (G.V, G.Sigma, R, G.S); 15

Context Free Grammars: Converting to Normal Form - CNF Analysis Analysis 1. Removing ɛ rules Worst case: Consider rule of form X A 1, A 2,..., A k where each A i is nullable Result has 2 k 1 rules Since k n, grammar size O(2 n ) To make more tractable, use short rules Apply removelong first, guaranteeing rules of length no greater than 2 Insures ɛ removal is linear 2. Removing unit productions Halts in at most V Σ 2 steps Each step takes constant time, produces at most one new rule O(n 2 ) 3. Eliminating terminal symbols Grows linearly 4. Eliminating rules where RHS = 2 Grows linearly 5. Overall complexity: Time: O(n 2 ) Grammar: O(n 2 ) 16

Context Free Grammars: Converting to Normal Form - GNF Theorem 11.2 GNF Statement: Given CFG G, there is a GNF grammar G 0 such that L(G 0 ) = L(G) ɛ Proof: By construction (See Appendix D.1) Algorithm uses 3 basic steps CFG converttognf (CFG G) { G = converttocnf(g); Modify rules so that they are of the form S ɛ A cα, where c Σ, α (V Σ) A Bα, where B V Σ, α (V Σ) Modify rules so that they are in GNF; return (G.V, G.Sigma, modified rules, G.S); 1. Convert G to CNF (See CNF algorithm) 2. Modify the rules so that they are of the form S ɛ A cα, where c Σ, α (V Σ) A Bα, where B V Σ,, α (V Σ) (a) Number the NTs S is marked 1 Order of the remaining NTs is immaterial The algorithm considers rules in order of the values of the terminals on their LHSs (b) Left recusion is eliminated Direct left recursion results from rules of the form X Xα Strings are generated by adding symbols to the right of a derivation Strings grow from right to left Nonterminal symbols do not appear as the leftmost characters until the left recursive NT is replaced by one that is not 17

Context Free Grammars: Converting to Normal Form - GNF (2) To eliminate direct left recursion Consider rules A Au 1 Au 2... Au j and A v 1 v 2... v k, where the first symbol of u i A Replace these rules with the following Z u 1 u 2... u j u 1 Z u 2 Z... u j Z, where Z is a new symbol A v 1 v 2...v k v 1 Z v 2 Z... v k Z (c) The goal of this step is to insure that value(x) < value(y ) for rules of the form X Y γ, where Y V Σ NTs Y that violate the above restriction are replaced with RHSs of rules that have Y as their LHS Substitutions may introduce additional left recursion, which must be removed Variables introduced by the elimination of left recursion are not included in this processing At the end of this step, Rules with the highest-numbered NT on their LHS will have only terminal symbols on their RHS Rules with the NT numbered n on their LHS will have only a terminal symbol, or NT numbered > n, as the first symbol on their RHS 3. Modify rules to GNF Process rules - based on the number associated with their LHSs - from highest to lowest For rules of the form X Y γ, where Y V Σ, replace Y with ths RHSs of rules with Y on the LHS Note that this does not address rules introduced by eliminating left recursion 4. Apply the above step to the rules that have new symbols on their LHSs Note that while L(G 0 ) = L(G) ɛ, parse trees for a given string may differ based on creation of G 0 18