Context-Free Grammars and Languages

Context-Free Grammars and Languages Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 / 44

Outline Context-free grammars Parse trees 2 / 44

Palindrome Example Consider the language of palindromes, L = {w {0, 1} w = w R }, where a palindrome is a string that reads the same forward and backward (e.g., otto). Question: Any recursive definition of this L? Answer: Yes, there is! Exploiting the idea that if a string is a palindrome, it must begin and end with the same symbol, leading to: Basis: ɛ, 0, and 1 are palindromes. Induction: If w is a palindrome, so are 0w0 and 1w1. No string is a palindrome of 0 s and 1 s, unless it follows from this basis and induction rule. 3 / 44

Grammar: Palindrome Example G pal = ({S}, {0, 1}, S, P), S ɛ, S 0, S 1, S 0S0, S 1S1. 4 / 44

Context-Free Grammars Definition A grammar G = (V, T, S, P) is said to be context-free if all productions in P are of the form where A V and x (V T ). }{{} A }{{} x, head body No restrictions in the right-hand side of productions rules. A restriction in the left-hand side of production rules, allowing only single variable. 5 / 44

Example: Consider the grammar G = (V, T, S, P) with productions S asa bsb ɛ. A typical derivation in this grammar is S asa aasaa aabsbaa aabbaa. This make it clear that L(G) = {ww R w {a, b} }. We know this is not regular but is context-free. 6 / 44

Derivations Using Grammars Apply the productions of a CFG to infer that certain strings are in the language. There are two approaches to this inference: Recursive inference: Use productions from body to head Derivation: Use productions from head to body. Leftmost derivation Rightmost derivation See Fig. 5.2 and 5.3 for recursive inference and see Ex. 5.6 for derivation (pp. 178-179). 7 / 44

Consider the following CFG G = ({E, I }, {+,, (, ), a, b, 0, 1}, E, P) with productions 1. E I, 2. E E + E, 3. E E E, 4. E (E), 5. I a, 6. I b, 7. I Ia, 8. I Ib, 9. I I 0, 10. I I 1. 8 / 44

Context-Free Languages Definition A language is said to be context-free iff there is a context-free grammar G such that L = L(G), where L(G) = {w T S G w}. 9 / 44

The Language of G pal Theorem L(G pal ) = {w {0, 1} w = w R }. That is, w L(G pal ) iff w = w R for w {0, 1}. Proof ( if part ). Suppose w = w R. We prove by induction on w that w L(G pal ). Basis: w = 0 or w = 1. Then w is ɛ, 0, or 1. Since S ɛ 0 1 are productions, we conclude that S w in all base cases. G Induction: Suppose w 2. Since w = w R, we have w = 0x0 or w = 1x1 and x = x R. If w = 0x0, we know from the IH that S x. Then S 0S0 0x0 = w. The case for w = 1x1 is similar. 10 / 44

Proof ( only if part ). We assume that w L(G pal ) and must show that w = w R. Since w L(G pal ), we have S w. We prove by induction on the length of. Basis: The derivation S w is done in one step. Then w must be ɛ, 0, or 1, all palindromes. Induction: IH assumes S x in n steps where x = x R. Suppose the derivation takes n + 1 steps. Then we must have or By IH, w = w R. S 0S0 0x0 = w, S 1S1 1x1 = w. 11 / 44

Example: Show that L = {a n b m n m} is a CFL. Solution. Note that CFG G = ({S}, {a, b}, S, P) with productions leads to S asb ɛ, L(G) = {a n b n n 0}. In order to take care of the case for n > m, we first generate a string with an equal number of a s and b s, then add extra a s on the left, leading to S AS 1, S 1 as 1b ɛ, A aa a. We use a similar reasoning for the case n < m. Thus, the CFG for L is given by S AS 1 S 1B, S 1 as 1b ɛ, A aa a, B bb b. 12 / 44

Leftmost and Rightmost Derivations In CFGs that are not linear, a derivation may involve sentential forms with more than one variable. In such cases, we have a choice in the order in which variables are replaced. A derivation is said to be leftmost/rightmost if in each step the leftmost/rightmost variable in the sentential form is replaced. 13 / 44

Consider G = ({A, B, S}, {a, b}, S, P) with productions 1. S AB, 2. A aaa, 3. A ɛ, 4. B Bb, 5. B ɛ. The following two derivations (the same productions) produce the same sentence although the order in which the productions are applied is different. S S 1 1 AB 2 aaab 3 aab 4 aabb 5 aab, AB 4 ABb 2 aaabb 5 aaab 3 aab. Note that L(G) = {a 2n b m n 0, m 0}. 14 / 44

Parse Trees Definition An ordered tree for a CFG G, is a parse tree for G if and only if 1. The root is labeled S. 2. Every leaf has a label from T {ɛ}. 3. Every interior vertex (a vertex which is not a leaf) is labeled by a variable in V. 4. If a vertex is labeled A and its children are labeled a 1, a 2,..., a n, then P must contain A a 1 a 2 a n. 5. If a leaf is labeled ɛ, then it must be the only child of its parent. 15 / 44

More to Say about Parse Trees... Tells us the syntactic structure of w. An alternative representation to derivations and recursive inference. There can be several parse trees for the same string. (ambiguity) Ideally there should be only parse tree (the true structure) for each string, i.e., the language should be unambiguous. Unfortunately, we cannot always remove the ambiguity. 16 / 44

Example: In the grammar E I, E E + E, E E E, E (E), The following is the parse tree which shows the derivation E I + E. E E + E I 17 / 44

Example: In the grammar P ɛ 0 1 0P0 1P1. The following is the parse tree which shows the derivation P 0110. P 0 P 0 1 P 1 ǫ 18 / 44

The Yield of a Parse Tree The yield of a parse tree is the string of leaves from left to right. Important are those parse trees where: 1. The yield is a terminal string. 2. The root is labeled by the start symbol. We shall see the set of yields of these important parse trees is the language of the grammar. 19 / 44

The yield is a (a + b00). 20 / 44

Let G = (V, T, S, P) be a CFG and A V. We will show that the following are equivalent: 1. We can determine by recursive inference that w is in the language of variable A. 2. A w. 3. A w. 4. A rm w. 5. There is a parse tree of G with root A and yield w. 21 / 44

Inferences trees derivations 22 / 44

From Inferences to Trees Theorem Let G = (V, T, S, P) be a CFG. If the recursive inference procedure tells us that terminal string w is in the language of variable A, then, there is a parse tree with root A and yield w. Proof. We do an induction on the length of the inference. Basis: One step. Then we must have used a production A w. The desired parse tree is then A w 23 / 44

Induction: w is inferred in n + 1 steps. Suppose that the last step was based on a production A X 1 X 2 X k, where X i V T. We break w up as w 1 w 2 w k, where w i = X i when X i T and when X i V, then w i was previously inferred being in X i, in at most n steps. By the IH, there are parse trees i with root X i and yield w i. Then the following is a parse tree for G with root A and yield w: A X 1 X 2 X k w 1 w 2 w k 24 / 44

From Trees to Derivations We will show how to construct a leftmost derivation for a parse tree. Example: In the grammar of slide 6, there clearly is a derivation E I Ib ab. Then, for any α and β there is a derivation αeβ αi β αibβ αabβ. For example, suppose we have a derivation E E + E E + (E). We can choose α = E + ( and β =) and continue the derivation as E + (E) E + (I ) E + (Ib) E(ab). This is why CFG s are called context-free. 25 / 44

Theorem Let G = (V, T, S, P) be a CFG and suppose there is a parse tree with root labeled A and yield w. Then A w in G. Proof. We do an induction on the height of the parse tree. Basis: Height is 1. The tree must look like A w Consequently A w P and A w. 26 / 44

Induction: Height is n + 1. The tree must look like A X 1 X 2 X k w 1 w 2 w k Then w = w 1 w 2 w k, where 1. If X i T, then w i = X i. 2. If X i V, then X i w i in G by the IH. 27 / 44

Now we construct A w by an inner induction by showing that i : A w 1 w 2 w i X i+1 X i+2 X k. Basis (inner): Let i = 0. We already know that A X 1 X 2 X k. Induction (inner): Make the IH that A w 1 w 2 w i 1 X i X i+1 X k. 28 / 44

Case 1: X i T. Do nothing, since X i = w i gives us A w 1 w 2 w i X i+1 X i+2 X k. 29 / 44

Case 2: X i V. By the IH there is a derivation X i α 1 α 2 w i. By the context-free property of derivations we can proceed with A w 1 w 2 w i 1 X i X i+1 X k w 1 w 2 w i 1 α 1 X i+1 X k w 1 w 2 w i 1 α 2 X i+1 X k w 1 w 2 w i 1 w i X i+1 X k. 30 / 44

Example: Let s construct the leftmost derivation for the tree 31 / 44

Suppose we have inductively constructed the leftmost derivation E I a corresponding to the leftmost subtree, and the leftmost derivation E (E) (E + E) (I + E) (a + E) (a + I ) (a + I 0) (a + I 00) (a + b00) corresponding to the rightmost subtree. 32 / 44

For the derivation corresponding to the whole tree, we start with E E E and expand the first E with the first derivation and the second E with the second derivation: E E E I E a E a (E) a (E + E) a (I + E) a (a + E) a (a + I ) a (a + I 0) a (a + I 00) a (a + b00). 33 / 44

From Derivations to Recursive Inferences Observation: Suppose that A X 1 X 2 X k w = w 1 w 2 w k, where X i wi. w. Then The factor w i can be extracted from A w by looking at the expansion of X i only. Example: E a b + a and E }{{} E }{{} }{{} E + }{{}}{{} E X 1 X 2 X 3 We have X 4 X 5. E E E E E + E I E + E I I + E I I + I a I + I a b + I a b + a. By looking at the expansion of X 3 = E only, we can extract E I b. 34 / 44

Theorem Let G = (V, T, S, P) be a CFG. Suppose A G w. and that w is a string of terminals. Then we can infer that w is in the language of variable A. Proof. We do an induction on the length of the derivation A G w. Basis: One step. If A w, there must be a production A w in P. Then we G can infer that w is in the language of A. Induction: Suppose A G w in n + 1 steps. Write the derivation as A G X 1X 2 X k G w. As noted on the previous slide we can break w as w 1w 2 w k where X i G w i. Furthermore, X i G w i can use at most n steps. Now we have a production A X 1X 2 X k, and we know by the IH that we can infer w i to be in the language of X i. Therefore we can infer w 1w 2 w k to be in the language of A. 35 / 44

Ambiguity in Grammars and Languages: Example In the grammar E I, E E + E, E E E, E (E),, the sentential form E + E E has two derivations: E E + E E + E E, and E E E E + E E. 36 / 44

This gives us two parse trees: E E E + E E E E E + E E Left-hand side: The second and the third expressions ar multiplied and the result is added to the first expression. (e.g., 1+(2 3)=7) Right-hand side: Adds the first two expressions and multiplies the result by the third. (e.g. (1+2) 3=9) 37 / 44

Ambiguity in Grammars and Languages Definition A CFG G is said to be ambiguous if there exists some w L(G) that has at least two distinct parse trees. Definition A CFL L is said to be inherently ambiguous if all its grammars are ambiguous. Definition If L is a CFL for which there exists an unambiguous grammar, then L is said to be unambiguous. Even one grammar for L is unambiguous, then L is an unambiguous language. 38 / 44

Removing Ambiguity from Grammars Good news: Sometimes we can remove ambiguity by hand. Bad news: There is no algorithm to do it. More bad news: Some CFL s have only ambiguous CFG s. 39 / 44

Let us consider the grammar: There are two problems: E I E + E E E (E), I a b Ia Ib I 0 I 1. 1. There is no precedence between and +. 2. There is no grouping of sequences of operators, e.g., E + E + E meant to be (E + E) + E or E + (E + E). 40 / 44

Solution: We introduce more variables, each representing expressions that share a level of binding strength. 1. A factor is an expression that cannot be broken apart by an adjacent or +. Our factors are 1.1 Identifiers 1.2 A parenthesized expression 2. A term is an expression that cannot be broken by +. A term is a product of one or more factors. For instance a b can be broken by a1 or a1. It cannot be broken by +, since, e.g., a1 + a b is (by precedence rules) same as a1 + (a b), and a b + a1 is same as (a b) + a1. 3. The rest are expressions, i.e., they can be broken apart by or +. 41 / 44

We will let F stand for factors, T for terms, and E for expressions. Consider the following grammar: I a b Ia Ib I 0 I 1, F I (E), T F T F, E T E + T. Now the only parse tree for a + a a will be E E + T T T F F F I I I a a a 42 / 44

Why is the grammar shown in previous slide unambiguous? A factor is either an identifier or (E), for some expression E. The only parse tree for a sequence f 1 f 2 f n 1 f n of factors is the one that gives f 1 f 2 f n 1 f n as a term and f n as a factor, as in the parse tree on the next slide. An expression is a sequence t 1 + t 2 + + t n 1 + t n of terms t i. It can only be parse with t 1 + t 2 + + t n 1 + t n as an expression and t n as a term. 43 / 44

T T * F T F T T F F 44 / 44