FLAC Context-Free Grammars - PDF Free Download

FLAC Context-Free Grammars Klaus Sutner Carnegie Mellon Universality Fall 2017

1 Generating Languages Properties of CFLs

Generation vs. Recognition 3 Turing machines can be used to check membership in decidable sets. They can also be used to enumerate semidecidable sets, whence the classical notion of recursively enumerable sets. For languages L Σ there is a similar notion of generation. The idea is to set up a system of simple rules that can be used to derive all words in a particular formal language. These systems are typically highly nondeterministic and it is not clear how to find (efficient) recognition algorithms for the corresponding languages.

Noam Chomsky 4 Historically, these ideas go back to work by Chomsky in the 1950s. Chomsky was mostly interested natural languages: the goal is to develop grammars that differentiate between grammatical and and ungrammatical sentences. 1 The cat sat on the mat. 2 The mat on the sat cat. Alas, this turns out to be inordinately difficult, syntax and semantics of natural languages are closely connected and very complicated. But for artificial languages such as programming languages, Chomsky s approach turned out be perfectly suited.

Cat-Mat Example 5 Sentence Noun Phrase Verb Phrase Punctuation Determiner Noun Verb Prepositional Phrase. The cat sat Preposition Noun Phrase on Determiner Noun the mat

Mat-Cat Example 6 Noun Phrase Noun Phrase Prepositional Phrase Punctuation Determiner Noun Preposition Noun Phrase. The mat on Determiner Adjective Noun the sat cat

Killer App: Programming Languages 7 Many programming languages have a block structure like so: begin end begin end begin begin end begin end end Clearly, this is not a regular language and cannot be checked by a finite state machine. We need more compute power.

Generalizing 8 We have two rather different ways of describing regular languages: finite state machine acceptors regular expressions We could try to generalize either one of these. Let s start with the algebra angle and handle the machine model later.

Grammars 9 Definition A (formal) grammar is a quadruple G = V, Σ, P, S where V and Σ are disjoint alphabets, S V, and P is a finite set of productions or rules. the symbols of V are (syntactic) variables, the symbols of Σ are terminals, S is called the start symbol (or axiom). We often write Γ = V Σ for the complete alphabet of G.

Context Free Grammars 10 Definition (CFG) A context free grammar is a grammar where the productions have the form P V Γ It is convenient to write productions in the form where A V and α Γ. π : A α The idea is that we may replace A by α.

Naming Conventions 11 A, B, C... represent elements of V, S V is the start symbol, a, b, c... represent elements of Σ, X, Y, Z... represent elements of Γ, w, x, y... represent elements of Σ, α, β, γ... represent elements of Γ.

Derivations 12 Given a CFG G define a one-step relation αaβ 1 = αγβ if A γ P 1 = Γ Γ as follows: As usual, by induction define α = k+1 β if γ (α = k γ γ = 1 β) and α = β if k α = k β in which case one says that α derives or yields β. α is a sentential form if it can be derived from the start symbol S. To keep notation simple we ll often just write α = β.

Context Free Languages 13 Definition The language of a context free grammar G is defined to be L(G) = { x Σ S = x } Thus L(G) is the set of all sentential forms in Σ. We also say that G generates L(G). A language is context free (CFL) if there exists a context free grammar that generates it. Note that in a CFG one can replace a single syntactic variable A by strings over Γ independently of were A occurs; whence the name context free. Later on we will generalize to replacement rules that operate on a whole block of symbols (context sensitive grammars).

Example: Regular 14 Let G = {S, A, B}, {a, b}, P, S where the set P of productions is defined by: S aa ab A aa ab B bb b. A typical derivation is: It is not hard to see that S aa aaa aaab aaabb aaabb L(G) = a + b + Not too interesting, we already know how to deal with regular languages. Can you see the finite state machine hiding in the grammar? Is it minimal?

Derivation Graph 15 Derivations of length at most 6 in this grammar.

Labeled 16 S aa ab aaa aab ab abb aaaa aaab aab aabb abb abbb aaaaa aaaab aaab aaabb aabb aabbb abbb abbbb aaaaaa aaaaab aaaab aaaabb aaabb aaabbb aabbb aabbbb abbbb abbbbb

Example: Mystery 17 Let G = {A, B}, {a, b}, P, A where the set P of productions is defined by: A typical derivation is: A AA AB a B AA BB b. A AA AAB AABB AABAA aabaa In this case it is not obvious what the language of G is (assuming it has some easy description, it does). More next time when we talk about parsing.

Derivation Graph 18 Derivations of length at most 3 in this grammar. Three terminal strings appear at this point.

Depth 4 19

Example: Counting 20 Let G = {S}, {a, b}, P, S where the set P of productions is defined by: S asb ε A typical derivation is: S asb aasbb aaasbbb aaabbb Clearly, this grammar generates the language { a i b i i 0 } It is easy to see that this language is not regular.

Derivation Graph 21

Example: Palindromes 22 Let G = {S}, {a, b}, P, S where the set P of productions is defined by: S asa bsb a b ε A typical derivation is: S asa aasaa aabsbaa aababaa This grammar generates the language of palindromes. Exercise Give a careful proof of this claim.

Derivation Graph 23

Example: Parens 24 Let G = {S}, {(, )}, P, S where the set P of productions is defined by: S SS (S) ε A typical derivation is: S SS (S)S (S)(S) (S)((S)) ()(()) This grammar generates the language of well-formed parenthesized expressions. Exercise Give a careful proof of this claim.

Derivation Graph 25

Example: Expressions of Arithmetic 26 Let G = {E}, {+,, (, ), v}, P, E where the set P of productions is defined by: E E + E E E (E) v A typical derivation is: E E E E (E) E (E + E) v (v + v) This grammar generates a language of arithmetical expressions with plus and times. Alas, there are problems: the following derivation is slightly awkward. E E + E E + (E) E + (E E) v + (v v) Our grammar is symmetric in + and, it knows nothing about precedence.

Derivation Graph 27

Ambiguity 28 We may not worry about awkward, but the following problem is fatal: E E + E E + E E v + v v E E E E + E E v + v v There are two derivations for the same word v + v v. Since derivations determine the semantics of a string this is really bad news: a compiler could interpret v + v v in two different ways, producing different results.

Parse Trees 29 Derivation chains are hard to read, a better representation is a tree. Let G = V, Σ, P, S be a context free grammar. A parse tree of G (aka grammatical tree) is an ordered tree on nodes N, together with a labeling λ : N V Σ such that For all interior nodes x: λ(x) V, If x 1,..., x k are the children, in left-to-right order, of interior node x then λ(x) λ(x 1)... λ(x k ) is a production of G, λ(x) = ε implies x is an only child.

Derivation Trees 30 Here are the parse trees of the expressions grammar from above. E E E + E E E E E E + E Note that the trees provide a method to evaluate arithmetic expressions, so the existence of two trees becomes a nightmare.

Information Hiding 31 A parse tree typically represents several derivations: E E E v E + E v v represents for example θ 1 : E E E E E + E v E + E v v + E v v + v θ 2 : E E E E E + E E E + v E v + v v v + v θ 3 : E E E v E v E + E v v + E v v + v but not θ 4 : E E + E E E + E v E + E v v + E v v + v

Leftmost Derivations 32 Let G be a grammar and assume α 1 = β. We call this derivation step leftmost if α = xaα β = xγα x Σ A whole derivation is leftmost if it only uses leftmost steps. Thus, each replacement is made in the first possible position. Proposition Parse trees correspond exactly to leftmost derivations.

Ambiguity 33 Definition A CFG G is ambiguous if there is a word in the language of G that has two different parse trees. Alternatively, there are two different leftmost derivations. As the arithmetic example demonstrates, trees are connected to semantics, so ambiguity is a serious problem in a programming language.

Unambiguous Arithmetic 34 For a reasonable context free language it is usually possible to remove ambiguity by rewriting the grammar. For example, here is an unambiguous grammar for our arithmetic expressions. E E + T T T T F F F (E) v In this grammar, v + v v has only one parse tree. Here {E, T, F } are syntactic variables that correspond to expressions, terms and factors. Note that it is far from clear how to come up with these syntactic categories.

Inherently Ambiguous Languages 35 Alas, there are CFLs where this trick will not work: every CFG for the language is already ambiguous. Here is a well-known example: L = { a i b j c k i = j j = k; i, j, k 1 } L consists of two parts and each part is easily unambiguous. But strings of the form a i b i c i belong to both parts and introduce a kind of ambiguity that cannot be removed. BTW, { a i b i c i i 0 } is not context free. Exercise Show that L really is inherently ambiguous.

Generating Languages 2 Properties of CFLs

Regular Implies Context Free 37 Lemma Every regular language is context free. Proof. Suppose M = Q, Σ, δ; q 0, F is a DFA for L. Consider a CFG with V = Q and productions Let q 0 be the start symbol. p a q if δ(p, a) = q p ε if p F

Substitutions 38 Definition A substitution is a map σ : Σ P(Γ ). The idea is that for any word x Σ we can define its image under σ to be language σ(x 1) σ(x 2)... σ(x n) Likewise, σ(l) = x L σ(x). If σ(a) = {w} then we have essentially a homomorphism.

The Substitution Lemma 39 Lemma Let L Σ be a CFL and suppose σ : Σ P(Γ ) is a substitution such that σ(a) is context free for every a Σ. Then the language σ(l) is also context free. Proof. Let G = V, Σ, P, S and G a = V a, Γ, P a, S a be CFGs for the languages L and L a = σ(a) respectively. We may safely assume that the corresponding sets of syntactic variables are pairwise disjoint. Define G as follows. Replace all terminals a on the right hand side of a production in G by the corresponding variable S a. It is obvious that f(l(g )) = L where f is the homomorphism defined by f(s a) = a.

Proof, cont d 40 Now define a new grammar H as follows. The variables of H are V a Σ Va, the terminals are Σ, the start symbol is S and the productions are given by P a Σ P a Then the language generated by H is σ(l). It is clear that H derives every word in σ(l). For the opposite direction consider the parse trees in H.

Closure Properties 41 Corollary Suppose L, L 1, L 2 Σ are CFLs. Then the following languages are also context free: L 1 L 2, L 1 L 2 and L : context free languages are closed under union, concatenation and Kleene star. Proof. This follows immediately from the substitution lemma and the fact that the languages {a, b}, {ab} and {a} are trivially context free.

Non Closure 42 Proposition CFLs are not closed under intersection and complement. Consider L 1 = { a i b i c j i, j 0 } L 2 = { a i b j c j i, j 0 } We will see in a moment that L 1 L 2 = { a i b i c i i 0 } fails to be context free.

More Closure 43 Lemma Suppose L is a CFL and R is regular. Then L R is also context free. Proof. This will be easy once we have a machine model for CFLs (push-down automata), more later.

Dyck Languages 44 One can generalize strings of balanced parentheses to strings involving multiple types of parens. To this end one uses special alphabets with paired symbols: Γ = Σ { a a Σ } The Dyck language D k is generated by the grammar S SS a S a ε A typical derivation looks like so: S SS asas aasa as aasa aasa aaa aaa Exercise Find an alternative definition of a Dyck language.

A Characterization 45 Let us write D k for the Dyck language with k = Σ kinds of parens. For D 1 there is a nice characterization via a simple counting function. Define # ax to be the number of letters a in word x. f a(x) = # ax # ax Lemma A string x belongs to the Dyck language D 1 {a, a} iff f a(x) = 0 and for any prefix z of x: f a(z) 0.

A Paren Mountain 46 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Note that one can read off a proof for the correctness of the grammar S SS a S a ε from the picture.

k-parens 47 For D k we can still count but this time we need values in N k f(x) = (f a1 (x), f a2 (x),..., f ak (x)) Then we need f(x) = 0 and f(z) 0 for all prefixes z of x, just like for D 1. Alas, this is not enough: we also have make sure that proper nesting occurs between different types of parens. The critical problem is that we do not want abab.

Matching Pairs 48 Let x = x 1x 2... x n. Note that if x D k and x i = a then there is a unique minimal j > i such that f a(x[i]) = f a(x[j]) (why?). Intuitively, x j = a is the matching right paren for a = x i. Hence we obtain an interval [i, j] associated with the a in position i. Call the collection of all such intervals I a, a Σ. The critical additional condition for a balanced string is that none of the intervals in I a overlap, they are all nested or disjoint. Exercise Show that these conditions really describe the language D k.

Dyck vs. CF 49 In a strong sense, Dyck languages are the most general context free languages: all context free languages are built around the notion of matching parens, though this may not at all be obvious from their definitions (and, actually, not even from their grammars). Theorem (Chomsky-Schützenberger 1963) Every context free language L Σ has the form L = h(d R) where D is a Dyck language, R is regular and h is a homomorphism. The proof also relies on a machine model, more later.

Parikh Vectors 50 Suppose Σ = {a 1, a 2,..., a k }. For x Σ, the Parikh vector of x is defined by Lift to languages via #x = (# a1 x, # a2 x,..., # ak x) N k #L = { #x x L } N k In a sense, the Parikh vector gives the commutative version of a word: we just count all the letters, but ignore order entirely. For example, for the Dyck language D 1 over {a, a} we have #D 1 = { (i, i) i 0 }.

Semi-Linear Sets 51 A set A N k is semi-linear if it is the finite union of sets of the form { a 0 + i a ix i x i 0 } and a i N k fixed. In the special case k = 1, semi-linear sets are often called ultimately periodic: A = A t + (a + mn + A p) where A t {0,..., a 1} and A p {0,..., m 1} are the transient and periodic part, respectively. Observe that for any language L {a} : L is regular iff #L N is semi-linear.

Parikh s Theorem 52 Theorem For any context free language L, the Parikh set #L is semi-linear. Instead of a proof, consider the example of the Dyck language D 1 S SS asa ε Let A = #D 1, then A is the least set X N 2 such that S SS: X is closed under addition S asa: X is closed under x x + (1, 1) S ε: X contains (0, 0) Clearly, A = { (i, i) i 0 }.

Application Parikh 53 It follows immediately that every context free language over Σ = {a} is already regular. As a consequence, { a p p prime } is not context free. This type of argument also works for a slightly messier language like L = { a k b l k > l (k l k prime) } Note that in this case L and #L are essentially the same, so it all comes down to the set of primes not being semi-linear.

Markings 54 Another powerful method to show that a language fails to be context free is a generalization of the infamous pumping lemma for regular languages. Alas, this time we need to build up a bit of machinery. Definition Let w Σ, say, n = w. A position in w is a number p, 1 p n. A set K [n] of positions is called a marking of w. A 5-factorization of w consists of 5 words x 1, x 2, x 3, x 4, x 5 such that x 1x 2x 3x 4x 5 = w. Given a factorization and a marking of w let K(x i) = { p x 1... x i 1 < p x 1... x i } K Thus K(x i) simply consists of all the marked positions in block x i.

The Iteration Theorem 55 Theorem Let G = V, Σ, P, S be a CFG. Then there exists a number N = N(G) such that for all x L(G), K [ x ] a marking of x of cardinality at least N: there exists a 5-factorization x 1,..., x 5 of x such that, letting K i = K(x i), we have: K 1, K 2, K 3 or K 3, K 4, K 5 K 2 K 3 K 4 N. t 0 ( x 1x t 2x 3x t 4x 5 L(G) ). Proof. Stare at parse trees.

Non-Closure 56 Lemma { a i b i c i i 0 } is not context free. Proof. Recall that this shows non-closure under complements and intersections. So a CFG can count and compare two letters, as in but three letters are not manageable. L 1 = { a i b i c j i, j 0 } L 2 = { a i b j c j i, j 0 } The intuition behind this will become clear next time when we introduce a machine model.

Proof 57 Let N be as in the iteration theorem and set w = a N b N c N, K = [N + 1, 2N] (so the b s in the middle are marked). Then there is a factorization x 1,..., x 5 of w such that, letting K i = K(x i), we have: Case 1: K 1, K 2, K 3 Then x 1 = a N b i, x 2 = b j, x 3 = b k y where j > 0. But then x 1x 3x 5 / L, contradiction. Case 2: K 3, K 4, K 5 Then x 3 = yb i, x 4 = b j, x 5 = b k c N where j > 0. Again x 1x 3x 5 / L, contradiction.

More Non-Closure 58 It follows that { x {a, b, c} x a = x b = x c } is not context free: otherwise the intersection with a b c would also be context free. Exercise Show that the copy language L copy = { x x x Σ } fails to be context free. Compare this to the palindrome language { x x op x Σ }