Formal Languages, Grammars and Automata Lecture 5 Helle Hvid Hansen helle@cs.ru.nl http://www.cs.ru.nl/~helle/ Foundations Group Intelligent Systems Section Institute for Computing and Information Sciences 6 June 2014 Helle Hvid Hansen 6 June 2014 FLGA 1 / 19
Midterm Enquete Results lecture level count (very slow) 0 0 1 4 (good) 2 7 3 5 (very fast) 4 3 average: 2.4 Comments: exercise level count (very easy) 0 0 1 3 (appropriate) 2 9 3 6 (very hard) 4 0 average: 2.2 Sometimes too fast, sometimes too slow (3 students). Solutions online (2 students). Helle Hvid Hansen 6 June 2014 FLGA 2 / 19
Overview Introduction Applications of finite automata: text search, natural language processing, lexical analysis (parsing), biology, video games (PacMan), internet protocols (TCP),... (see notes on webpage). Programming languages (like Java etc.) and natural languages are generally not regular. Helle Hvid Hansen 6 June 2014 FLGA 3 / 19
Overview Introduction Applications of finite automata: text search, natural language processing, lexical analysis (parsing), biology, video games (PacMan), internet protocols (TCP),... (see notes on webpage). Programming languages (like Java etc.) and natural languages are generally not regular. Formal Languages Grammars (generators) Automata (acceptors) Helle Hvid Hansen 6 June 2014 FLGA 3 / 19
Today Introduction Topics: Context-free grammars and context-free languages Regular grammars. Motivation/Application: Compilation of programming languages. Helle Hvid Hansen 6 June 2014 FLGA 4 / 19
Introduction Compilation and Parsing source code (ASCII) token string lexical analysis (DFA) (context-free language) parsing parse tree (data structure) code generation executable code Helle Hvid Hansen 6 June 2014 FLGA 5 / 19
Generating Strings with Production Rules Example: S O E E λ aea beb O a b aoa bob Helle Hvid Hansen 6 June 2014 FLGA 6 / 19
Generating Strings with Production Rules Example: S O E E λ aea beb O a b aoa bob Productions, always start with S S E aea abeba abba Helle Hvid Hansen 6 June 2014 FLGA 6 / 19
Generating Strings with Production Rules Example: S O E E λ aea beb O a b aoa bob Productions, always start with S S E aea abeba abba S E beb baeab babebab babaeabab babaabab Helle Hvid Hansen 6 June 2014 FLGA 6 / 19
Generating Strings with Production Rules Example: S O E E λ aea beb O a b aoa bob Productions, always start with S S E aea abeba abba S E beb baeab babebab babaeabab babaabab S O bob bab Helle Hvid Hansen 6 June 2014 FLGA 6 / 19
Generating Strings with Production Rules Example: S O E E λ aea beb O a b aoa bob Productions, always start with S S E aea abeba abba S E beb baeab babebab babaeabab babaabab S O bob bab S O bob baoab babab Helle Hvid Hansen 6 June 2014 FLGA 6 / 19
Generating Strings with Production Rules Example: S O E E λ aea beb O a b aoa bob Productions, always start with S S E aea abeba abba S E beb baeab babebab babaeabab babaabab S O bob bab S O bob baoab babab We can generate exactly the set of words w with w = w R (w is a palindrome) Helle Hvid Hansen 6 June 2014 FLGA 6 / 19
Context-Free Grammar Def. A context-free grammar (CFG) G = (V, Σ, S, P) consists of V Σ S P a set of non-terminal symbols a set of terminal symbols a start symbol, S V a set of production rules of the form X w where X V, w (V Σ) Helle Hvid Hansen 6 June 2014 FLGA 7 / 19
Context-Free Grammar Def. A context-free grammar (CFG) G = (V, Σ, S, P) consists of V Σ S P a set of non-terminal symbols a set of terminal symbols a start symbol, S V a set of production rules of the form X w where X V, w (V Σ) Notation (Backus-Naur Form or BNF): Group together rules for the same non-terminal: is shorthand for three rules: E λ aea beb E λ, E aea, E beb Helle Hvid Hansen 6 June 2014 FLGA 7 / 19
Derivations and Language Let G = (V, Σ, S, P) be a context-free grammar, and let u, v, w (V Σ) be arbitrary. Given a string uxv (V Σ), we can apply a rule X w: uxv uwv A derivation of u from v is a sequence of rule applications: v v v u We say that u can be derived from v in G if there is a derivation of u from v in G, and write v u. Helle Hvid Hansen 6 June 2014 FLGA 8 / 19
Derivations and Language Let G = (V, Σ, S, P) be a context-free grammar, and let u, v, w (V Σ) be arbitrary. Given a string uxv (V Σ), we can apply a rule X w: uxv uwv A derivation of u from v is a sequence of rule applications: v v v u We say that u can be derived from v in G if there is a derivation of u from v in G, and write v u. Why context-free? A rule X w can be applied in any context to replace X with w. (There are also context-sensitive grammars) Helle Hvid Hansen 6 June 2014 FLGA 8 / 19
Language and Parsing The language generated by G is L(G) = {w Σ S w} A parser is an algorithm that determines for given G and w whether w L(G)? Note: in general there can be many derivations of a w in G. Helle Hvid Hansen 6 June 2014 FLGA 9 / 19
Language and Parsing The language generated by G is L(G) = {w Σ S w} A parser is an algorithm that determines for given G and w whether w L(G)? Note: in general there can be many derivations of a w in G. Def. A derivation is leftmost if in each step a rule is applied to the leftmost non-terminal. (Think of reading from left to right). Rightmost derivations are defined analogously. Lemma: For a CFG G and word w, w L(G) iff there is a leftmost derivation of w in G. (So we can restrict to searching for a leftmost derivation.) Helle Hvid Hansen 6 June 2014 FLGA 9 / 19
Another Example Introduction G : S asb λ, B b Derivations of aabb: d 1 : d 2 : d 3 : S asb aasbb aabb aabb aabb S asb asb aasbb aasbb aabb S asb asb aasbb aabb aabb (Beware of typo in derivations in [Silva], Example 5.2.4) Derivation d 1 is leftmost, d 2 is rightmost, d 3 is neither. Derivation Trees (on blackboard). Helle Hvid Hansen 6 June 2014 FLGA 10 / 19
Ambiguity Introduction Def. A CFG G is unambiguous if for each w L(G) there exists a unique leftmost derivation of w in G. Otherwise, G is ambiguous. Helle Hvid Hansen 6 June 2014 FLGA 11 / 19
Ambiguity Introduction Def. A CFG G is unambiguous if for each w L(G) there exists a unique leftmost derivation of w in G. Otherwise, G is ambiguous. Two syntactically correct strings can have different meanings. Examples: Time flies like an arrow; fruit flies like a banana The peasants are revolting ( revolting = disgusting/in rebellion) Helle Hvid Hansen 6 June 2014 FLGA 11 / 19
Ambiguity Introduction Def. A CFG G is unambiguous if for each w L(G) there exists a unique leftmost derivation of w in G. Otherwise, G is ambiguous. Two syntactically correct strings can have different meanings. Examples: Time flies like an arrow; fruit flies like a banana The peasants are revolting ( revolting = disgusting/in rebellion) Programming language should be given by unambiguous grammar. Helle Hvid Hansen 6 June 2014 FLGA 11 / 19
Regular and Context-Free Languages Def. A language L is context-free if there is a CFG G such that L = L(G). Theorem: Every regular language is context-free. Helle Hvid Hansen 6 June 2014 FLGA 12 / 19
Regular and Context-Free Languages Def. A language L is context-free if there is a CFG G such that L = L(G). Theorem: Every regular language is context-free. Proof: Let M = (Q, q 0, δ, F ) be a DFA that accepts L Σ. Define a context-free grammar G M as follows: V = Q(non-terminals are states) Σ (terminal symbols are letters of alphabet) S = q 0 P = {q aq δ(q)(a) = q } {q λ q F } Helle Hvid Hansen 6 June 2014 FLGA 12 / 19
Regular and Context-Free Languages Def. A language L is context-free if there is a CFG G such that L = L(G). Theorem: Every regular language is context-free. Proof: Let M = (Q, q 0, δ, F ) be a DFA that accepts L Σ. Define a context-free grammar G M as follows: V = Q(non-terminals are states) Σ (terminal symbols are letters of alphabet) S = q 0 P = {q aq δ(q)(a) = q } {q λ q F } L(G M ) = L(M) since computations correspond to derivations: a M : q 1 a 0 2 a q1 n qn F G M : q 0 a 1 q 1 a 1 a 2 q 2 a 1 a n q n a 1 a n λ Helle Hvid Hansen 6 June 2014 FLGA 12 / 19
Introduction Def. A regular grammar is a CFG G = (V, Σ, S, P) in which all rules have the form X uy or X u where X, Y V and u Σ (u consists only of terminals). Helle Hvid Hansen 6 June 2014 FLGA 13 / 19
Introduction Def. A regular grammar is a CFG G = (V, Σ, S, P) in which all rules have the form X uy or X u where X, Y V and u Σ (u consists only of terminals). Example: S abax Y X bx ay λ Y X ax bb Helle Hvid Hansen 6 June 2014 FLGA 13 / 19
Introduction Def. A regular grammar is a CFG G = (V, Σ, S, P) in which all rules have the form X uy or X u where X, Y V and u Σ (u consists only of terminals). Example: S abax Y X bx ay λ Y X ax bb Theorem: Every regular language is generated by a regular grammar (by the previous construction). Note: In some texts, a regular grammar is defined as having rules only of the form X ay or X λ. This corresponds to the difference between DFA and NFA-λ. Helle Hvid Hansen 6 June 2014 FLGA 13 / 19
and Regular Languages Theorem: If G is a regular grammar, then L(G) is a regular. Helle Hvid Hansen 6 June 2014 FLGA 14 / 19
and Regular Languages Theorem: If G is a regular grammar, then L(G) is a regular. Proof: Build an NFA-λ M G = (Q, q 0, δ, F ) as follows: State set Q = V plus some extra states (see below), q 0 = S, F = {X V X λ} plus some extra states (see below) Transitions: For each rule of the form X a 1 a 2 a n Y, add new states q 1, q 2,..., q n 1 to Q and transitions X a1 a q 2 a n 1 a 1 n qn 1 Y For each rule of the form X a 1 a 2 a n, add new states q 1, q 2,..., q n 1 to Q and transitions X a1 a q 2 a n 1 a 1 n qn 1 qn and add q n to F. For each rule of the form X Y add λ-transition X λ Y. Helle Hvid Hansen 6 June 2014 FLGA 14 / 19
and Regular Languages II...Proof continued: Again, derivations correspond to computations, e.g. in we have: G : G : S abax Y X bx ay λ Y X ax bb S Y ax aay aabb M G : S λ Y a X Hence we have: L(M G ) = L(G). a Y b q 1 b q n F Helle Hvid Hansen 6 June 2014 FLGA 15 / 19
Parsing Algorithm Introduction Given a CFG G and a word w, how do we determine w L(G)? The Cocke-Younger-Kasami (CYK) ALgorithm. Requires G to be in Chomsky normal form. Helle Hvid Hansen 6 June 2014 FLGA 16 / 19
Parsing Algorithm Introduction Given a CFG G and a word w, how do we determine w L(G)? The Cocke-Younger-Kasami (CYK) ALgorithm. Requires G to be in Chomsky normal form. Def. A CFG G is in Chomsky normal form if all its productions are of the form: X YZ or X a where X, Y, Z V and a Σ. (Note: λ / L(G).) Lemma: Every CFG G can be transformed into an equivalent CFG in Chomsky normal form. (See e.g. Sudkamp.) Helle Hvid Hansen 6 June 2014 FLGA 16 / 19
Introduction Idea: For each substring u of input word w, compute the set of non-terminals that can produce u. Example: G : S AB BA SS AC BD, A a, C SB, B b, D SA. Run algorithm on w = aabbab. (continue on blackboard, see notes on webpage) Helle Hvid Hansen 6 June 2014 FLGA 17 / 19
The CYK-Algorithm Introduction Helle Hvid Hansen 6 June 2014 FLGA 18 / 19
Context-Free Art: http://www.contextfreeart.org Helle Hvid Hansen 6 June 2014 FLGA 19 / 19