REGular and Context-Free Grammars

REGular and Context-Free Grammars Nicholas Mainardi 1 Dipartimento di Elettronica e Informazione Politecnico di Milano nicholas.mainardi@polimi.it March 26, 2018 1 Partly Based on Alessandro Barenghi s material, largely enriched with some additional exercises

Grammars What are grammars? Another formalism to define a language Generative approach: the grammar points out how a sentence (i.e. an element of the language) is generated For some grammar classes, automated algorithms to derive the recogniser automaton are available (and pretty widely used!) Possible to define grammar classes corresponding to a precise computing model (FSA, [N D]PDA, TM)

Formalization A formal definition A grammar is defined by a 4-tuple (V n, V t, P, S) where: V n is the non-terminal symbol alphabet V t is the terminal symbol alphabet P V n + (V t V n ) is the set of syntactic productions S V n is the starting symbol, known as the axiom a production rule maps one or more symbols from V n into zero or more symbols from V = (V t V n ) More formally, a derivation α β, with α V n +, α = α 1 α 2 α 3 and β V, β = α 1 β 2 α 3 exists if and only if there is a p P such that p = α 2 β 2 indicates the reflexive and transitive closure of A grammar generates the language L G = {x x V t S x}

Notation Common Conventions Non terminal symbols are UPPERCASE, terminal symbols are lowercase S is the axiom of the grammar Single character symbols are used (no tokenization needed) Concatenation. mark is omitted Regular expressions in the RHS of the rule are not used except for the symbol employed to shorten the notation

Recognizer automaton Equivalences Grammar Language Generic Recognizer Type Class Rule Form Automaton 3 Regular A ab B ɛ FSA 2 Context Free A ABC NPDA 1 Context-sensitive α β, α β lin-bounded TM 0 Recursively enumerable any α β TM

First examples Warming up Regular: L = (aa) Sample grammar generating the language S ɛ (zero is even) S aa (when the first a is produced...) A as (make a pair and continue, or...) A a (make a pair and stop) Context Free: L = a n b n c m a m ; n 0, m 1 Sample grammar generating the language S S 1 S 2 (concatenation is easy) S 1 as 1 b ɛ (grow the pairs from within) S 2 cs 2 a ca (avoid generating no ca pairs)

Union With Grammars Union of languages is straightforward too! Big/Little Endian Encodings Consider the language L defined on the alphabet {0, 1, a}, as the union of 2 sublanguages L l and L b : L l = {(na n ) + 0 n 3} Here, n is written with little endian binary representation (first digit is the least significant one) L b = {(na n ) + 0 n 3} Here, n is written with big endian binary representation (first digit is the most significant one) For instance, for a sequence of 2 a: Little endian: 01aa Big Endian: 10aa

Union With Grammars Grammar for L l S l 0Z 1U Z 0N 0 1N 2 U 0N 1 1N 3 N 0 ɛ S l N 1 an 0 N 2 an 1 N 3 an 2 Grammar for L b S b 0Z 1U Z 0N 0 1N 1 U 0N 2 1N 3 N 0 ɛ S b N 1 an 0 N 2 an 1 N 3 an 2 Grammar for L = L l L b S S l S b

Union With Grammars Problem: There are conflicting non-terminal symbols in the sub-grammars! Hence, the grammar generates strings as 01a10a: S S b S b 0Z 0Z 01N 1 01N 1 01aN 0 01aN 0 01aS b 01aS b 01a1U 01a1U 01a10N 1 01a10N 1 01a10aN 0 01a10N 0 01a10a This derivation is possible since the merging operation transforms the U rule in: U 0N 1 0N 2 1N 3 Before performing union, the sets of non-terminal symbols of the sub-grammars must be disjoint

Union With Grammars L is a regular language defined on the alphabet Σ = {a, b, 0, 1} as: L 1 = {x = a.y y Σ y 0 = 2k + 1 y 1 = 2h, h, k 0} L 2 = {x = b.y y Σ y 0 = 2k y 1 = 2h + 1, h, k 0} L 3 = {x = (0 1).y y Σ } Idea: We can easily handle the 3 sub-languages with 3 sub-grammars, with axioms S 1, S 2, and S 3, and then choose among these sub-languages depending on the first character: Grammar for L S as 1 bs 2 0S 3 1S 3

Union With Grammars Grammar For L 1 S 1 as 1 bs 1 0O e 1E o O e 0S 1 1O o ao e bo e ɛ E o 0O o 1S 1 ae o be o O o 0E o 1O e ao o bo o Grammar For L 2 S 2 as 2 bs 2 0O 2 e 1E 2 o O 2 e 0S 2 1O 2 o ao 2 e bo 2 e E 2 o 0O 2 o 1S 2 ae 2 o be 2 o ɛ O 2 o 0E 2 o 1O 2 e ao o bo 2 o Grammar for L 3 is easier: S 3 as 3 bs 3 0S 3 1S 3 ɛ

A Left-Linear Version The grammar for the language L can be written in a more compact form using left-linear productions. With a left-linear grammar, we generate a string from the end to the beginning! Therefore, right linear grammars are often easier to be conceived Idea for this language: generates the string up to the second character, then allows replacement of the non-terminal symbol based on the parities of 0 and 1: 1 S Sa Sb O e 0 E o 1 0 1. The first character must be 0 or 1 2 O e O e a O e b S0 O o 1 a 0 1. The first character cannot be b 3 E o E o a E o b O o 0 S1 b 0 1. The first character cannot be a 4 O o O o a O o b E o 0 O e 0 0 1. The first character must be 0 or 1

Simplifying a Grammar Consider the language L = {s = c m.y y L Y m > 0} L Y = {a m.x x L X m > 0} ɛ L X = {b m.y y L Y m > 0} ɛ A Grammar for L 1 S cs cy 2 Y aa ax ɛ 3 X bb by ɛ 4 A aa ax 5 B bb by

Simplifying a Grammar: Transforming to FSA We can generate a FSA from a grammar: Q = V n q f I = V t δ(q, i) = q q i q P δ(q, i) = q f q i P F = {q q ɛ P} q f ND-FSA for L b c a start S Y A a B c b a X a b b

Simplifying a Grammar: Determinization Generally, it is likely that the automaton derived from a grammar is non-deterministic We can make it deterministic with the known algorithm: Deterministic FSA for L c a b start S c SY a AX b BY a

Simplifying a Grammar: FSA Minimization Given a FSA, it is always possible to get the FSA recognizing the same language with the minimum number of states There is an algorithm, which is based on the concept of indistinguishable states Indistinguishable States Given a FSA, 2 states q, q Q are indistinguishable if: q F q F i I(δ(q, i) = δ(q, i)) Idea of the algorithm: 1 Search for indistinguishable states 2 If there are indistinguishable states, merge them in a unique state and go back to 1 3 Otherwise, the minimum automaton has been reached NB Indistinguishability is transitive!

Simplifying a Grammar: FSA Minimization Are there any indistinguishable states in our FSA? Yes! AX and BY! Indeed: 1 AX F BY F 2 δ(ax, a) = δ(by, a) = AX 3 δ(by, b) = δ(ax, b) = BY Minimum Deterministic FSA for L c a, b c a start S SY F

Simplifying a Grammar: Getting Back to the Grammar Once we have our minimum deterministic FSA, we can transform it back to a grammar: V n = Q V t = I q i q P δ(q, i) = q q ɛ P q F Simplified Grammar for L 1 S cs y 2 S y cs y af ɛ 3 F af bf ɛ

Counting with grammars Position independent counting Target language L = (a b) +, x L, x a = x b Context-free language, we only need to count one kind of symbols S aabg bbag G aabg bbag ɛ A aab bba ɛ G B aab bba ɛ G or, in a more compact form, S agbg bgag G agbg bgag ɛ The arbitrary choice of the production rule allows to generate every combination

Counting With Grammars Example of derivations for the string bbaabaaababb: S bbag. It starts with a sequence of b bbag bbbaag. Dealing with the first sequence of b bbbaag bbaag. The sequence is finished, thus get rid of B bbaag bbaabbag. Another pair starting with b bbaabbag bbaabag.the sequence is finished, thus get rid of B bbaabag bbaabaaabg. This time we have a sequence of a bbaabaaabg bbaabaaaabbg. Dealing with the sequence of a bbaabaaaabbg bbaabaaagbbg. There are other sequences between the second a and the matching b, thus we need a new non-terminal G before the b bbaabaaagbbg bbaabaaabbagbbg. We need to generate a pair ba. bbaabaaabbagbbg bbaabaaabagbbg. The sequence is finished, thus get rid of B. bbaabaaabagbbg bbaabaaababb. We generate all the necessary A and B, thus we can get rid of G.

The Generative Approach of Grammars Consider the language L = {a m b n m n, m, n 0}. We want to write a grammar for this language. A Possible Solution 1 S asb aa bb (Generate balanced a, b pairs until an unbalanced character is generated) 2 A aa aab ɛ (Generate balanced and unbalanced a) 3 B bb abb ɛ (Generate balanced and unbalanced b) Basic idea of the design: non-terminal symbols A and B allows to recognize a string respectively with m n and with m n. In order to generate an A or B symbol, we generate an unbalanced a or b, thus ensuring that the generated strings have either m > n or m < n.

The Generative Approach of Grammars Grammars are a generative model! We do not need to allow a grammar to generate both balanced and unbalanced characters, but we can decide to split the generated string in 2 parts: the first where a and b are balanced, the second when we add either a or b. A Simplified Solution 1 S asb aa bb (Generate balanced a until an unbalanced character is generated) 2 A aa ɛ (Add unbalanced a after balanced a) 3 B bb ɛ (Add unbalanced b before balanced b) Easier to design the grammar, since we have two simpler separated generation phases

The Generative Approach of Grammars Consider the language: L = {ab n 1 ab n 2... ab n k i, n i > 0 k 2 j, h(1 j < h k n j = n h )} That is, the strings of the form (ab + )(ab + ) + where at least two substrings (ab + ) have the same number of b. A Grammar For L S GaXG (This defines the structure of a string) X bxb bgab (Generate the substrings with the same number of b) G ah ɛ (Generate a sequence (possibly empty) of substrings ab + ) H bh bg (Generate b + ) The strategy is again forcing the grammar to generate a substring with the require property (2 substrings with the same number of b) somewhere on the string, but without specifying when this substring should be generated Sooner or later the substring will be generated, and this is enough! Again, grammars have more free will than automata

Proofs on grammars Mathematical induction on generation Goal: prove that S 1S1 0S0 1 0 ɛ generates all, and only, the palindromes over Σ = {0, 1} The theorem we want to prove is a double implication: 1 x = w(0 1 ɛ)w R, w Σ = S x 2 S x = x = w(0 1 ɛ)w R, w Σ We will use thus two different induction steps : 1 Since the hypothesis is defined on words, induction on word length 2 Since the hypothesis is defined on the grammar, induction on the number of productions

Proofs on grammars - Part 1 Is a palindrome is generated by the grammar Base Case : ɛ (length 0) is a palindrome ɛ is generated by the grammar Induction Step: the theorem holds for x = k 1, k N, prove that it holds for x = k Split into two cases : 1 k is odd: x = u(0 1)u R. 2 k is even: x = ww R.

Proofs on grammars - Part 1 1 x = u(0 1)u R. By induction hypothesis, S uu R. We argue that S usu R. Indeed, the string uu R has an even number of characters, and since all the productions rewriting S generates 2 characters, we cannot stop generation with the productions S 1 or S 0, since we would get a string with an odd number of characters. Therefore, S ɛ must be employed to get uu R, in turn implying that the grammar generates the string usu R. We can get x from this string by applying one of the productions S 1 or S 0. 2 x = ww R = uaau R, a {0, 1}. By induction hypothesis, S uau R. From the previous case, we know also S usu R, since uau R is necessarily obtained through applying one of the S a productions. Then, if we apply one of the S asa productions and S ɛ one, we can generate x

Proofs on grammars - Part 2 Is generated by the grammar is a palindrome Base Case : the productions S 0 1 ɛ ɛ, 0, 1 are palindromes Induction Step: the theorem holds for S x w, x < k N, prove that it holds for S k w We need to check that all the grammar productions preserve the palindrome property. For a 0, 1, by inductive hypothesis: S k 1 x, x L = S k 2 wsw R = S k 1 wasaw R S k 1 wasaw R = S k waaw R L (using S ɛ). S k 1 wasaw R = S k wa(0 1)aw R L (using S 0 or S 1) All the possible productions leading to a valid word at k steps preserve the palindrome property, thus the theorem holds for k