Lecture 17: Language Recognition

Lecture 17: Language Recognition Finite State Automata Deterministic and Non-Deterministic Finite Automata Regular Expressions Push-Down Automata Turing Machines

Modeling Computation When attempting to deal with theoretical issues concerning digital computers, we need a simplified definition of what a computer does. Such a description is called a computational model. The first computational model we will consider is called a finite automaton or finite state machine. A finite automaton is depicted below as a state control with a read head that scans an input string. At the end of the string the state control will indicate whether or not the scanned string is accepted or rejected as a member of the set of strings recognized by the finite automaton.

Formal Definition of Computation The sequence of state transitions shown above is formally defined as computation in the following manner: Let M = (Q,Σ,δ,q 0,F) be a finite automaton and w = w 1 w 2...w n be a string over the alphabet Σ. Then M accepts the string w if a sequence of states r 0,r 1,...,r n exists in Q with the following three conditions: 1. r 0 =q 0, 2. δ(r,w i+1 ) = r i+1 for i = 0,..., n-1, and 3. r n is an element F. These three conditions state that (1) the inital state must be the start state, (2) the sequence of state transitions must be valid transitions for the finite state machine M and (3) the final state of M after processing w must be an accept state.

Finite Automaton Formally, a finite Automaton (FA) is defined as a 5-tuple (Q,S, d, q0, F) where, (1) Q is a finite set called the states. (2) S is a finite set called the alphabet. (3) d:q x S -> Q is the transition function (4) q0 is an element of Q called the start state, and (5) F is a subset of Q called the set of accept states. The set of states Q is just a set of labels, q0..q3. The alphabet S is the set of symbols that can be understood by the FA. In this case the alphabet consists of the two symbols 0 and 1. This means that the FA will be able to read (scan the symbols in) strings comprised only of 0's and 1's.

State Transition Function The transition function is represented by the arrows in the state transition diagram. This is a discrete function that maps every possible combination of a state with a symbol from the alphabet to another state, as shown in the table below: Note that there is a transition out of each state for each symbol, so we have 4x2=8 transitions. The start state, q0 is the initial state of the FA. The set F is the set of accept states. In our example F = {q1, q3}. This means that the binary string being scanned is recognized or accepted by the FA if the final state (after scanning the binary string) is one of the accept states.

Let's try a few example strings. (Scan from left to right)

Regular Language The set of all strings A accepted by a finite state machine M is called the language of M. We say that M accepts or recognizes A. Sybollically we write, A={ω M accepts ω} We say that A is a regular language if it is recognized by a finite automaton.

Regular Operations The operations of union, concatenation and Kleene-closure (also called star) are called regular operations on regular languages. Union: A U B = {x x is an element of A or x is an element of B} Concatenation: AoB = {xy x is an element of A and y is an element of B} Kleene-closure: A* = {x 1 x 2...x k k>= 0 and each x i is an element of A}

Nondeterminism When there is exactly one transition from each state for each alphabet symbol a finite automaton is said to be deterministic. When there can be zero transitions or more than one transition from some states for certain alphabet symbols, then the finite automaton is said to be nondeterministic. Nondeterministic Finite Automata (NFA) are essential to the development of theoretical computer science because they greatly simplify several important proofs. Every deterministic finite automata (DFA) is automatically an NFA since NFAs are a more general form of DFAs. A nondeterministic finite automation is a 5-tuple (Q,Σ,δ,q 0,F), where, (1) Q is a finite set of states (2) Σ is a finite alphabet (3) δ: QxΣε -> P(Q) is the transition function (4) q 0 is an element of Q (the start state) (5) F is a subset of Q (the set of accept states).

An Example This definition is similar to the DFA except that the nondeterminism permits the transition from a state to more than one state for the same symbol in the alphabet. In general a nondeterministic transition can be from a state to any subset of the states in Q. This is why QxΣε maps to the power set of Q P(Q) rather than to just Q as for the DFA. Σε refers to the alphabet (set of symbols) with epsilon ε. An epsilon move is a transition from one state to another without reading a symbol from the candidate string. a b b a

Equivalence of NFA's and DFA's In terms of computational power, the DFA and the NFA are equivalent. This means that for every NFA there is a DFA that recognizes the same language. Since every DFA is an NFA the reverse is trivially true. Rather than reviewing the proof we will use the proof idea The methods in this to generate an equivalent DFA for an NFA as shown in the following example: Given the NFA of the previous examle, we will construct a DFA that recognizes the same language. 1. We will build a DFA with a set of states Q D whose members correspond to the members of P(Q). Since the NFA has states Q={q 0,q 1,q 2 } the DFA will have states corresponding to { φ, {q 0 }, {q 1 }, {q 2 }, {q 0,q 1 }, {q 0,q 2 }, {q 1,q 2 }, {q 0,q 1,q 2 }} and represented by Q D ={q φ, q 0, q 1, q 2, q 01,q 02, q 12, q 012 }.

2. The start state for the DFA is the state representing the subset of states in the NFA that includes the start state and all the states reachable from the start state on an epsilon move. In this case the start state is q 01. The accept states in the DFA are all the states derived from one or more of the accept states of the NFA. Recall that, in a DFA there is a transition from each state for each symbol in the alphabet Σ. Since states in an NFA may have no transitions for some symbols, we need a trap state in our DFA to dump the deadend transitions (i.e. those that do not exist in the NFA). This trap state is not an accept state and it has a transition to itself on every symbol.

3. Now we will define the DFA transitions for each of the NFA transitions. Transitions from q 0 - Starting with state q 0 we observe that there is no transistion in the NFA from q 0 reading the symbol a, therefore we will include a transition from state q 0 in the DFA to state q φ. There is a transistion from q 0 to q 2 in the NFA so we simply replicate this transition for the DFA. For convenience we show the NFA on the right below.

Transitions from q 1 - The NFA transitions from q 1 reading a and b are transferred to the DFA. Since there is no transition from q0 since

Transitions from q 0 and q 1 - The state q 01 in the DFA corresponds to the pair of states q 0 and q 1 in the NFA. Look at the NFA and determine all the states that can be reached from both of these states when reading the symbol a. In this case we note that the transtion from q1 to q0 for the symbol a also leads back to q1 on an epsilon move. Since states q 0 and q 1 can both be reached upon reading the symbol a, we note that this corresponds to the single state q 01 in the DFA.

Transitions from q 2 - State q 2 in the NFA has two transitions for the symbol a, one to state q 1 and one back to state q 2, so we generate a transition from state q 2 to state q 12 in the DFA. Threre is a single transition from q 2 to q 1 in the NFA so it is copied to the DFA.

Transitions from q 1 and q 2 - We observe that from states q 1 and q 2 every state in the NFA can be reached on transitions for the symbol a. We represent this set of transitions in the DFA with an a-transition from q 12 to q 012. There is only one transition for the symbol b from states q 1 and q 2 in the NFA, which we copy to the DFA.

Transitions from q 0 and q 2 - Transitions from q 0 and q 2 for both a and b go to states q 1 and q 2 in the NFA, therefore we generate transitions from state q 02 to state q 12 in the DFA for symbols a and b.

Transitions from states q 0, q 1 and q 2 - Generating the DFA transitions for state q 012 requires that we determine all the states that can be reached from any state in the NFA for each of the symbols a and b. Every state in the NFA is can be reached on an a-transition from some state, therefore we generate a transition from q012 to itself for the symbol a. There is a transition from some state to every state but q 0 for the symbol b, so we generate a b- transition from state q 012 to state q 12 in the DFA.

4. Next we remove any unreachable states. In this example, state q 0 and q 02 have no transitions into them, therefore they are unrechable and can be removed without affecting the language being recognized by this DFA.

Finally we can rearrange the positions of the nodes to simplify the graphical representation. NFA DFA

Regular Expressions Up to now we have defined regular languages using finite automata and we have shown that DFAs and NFAs are equivalent. Sometimes using a state transition diagram to describe a regular language is inconvenient. Regular expressions are algebraic forms for defining regular languages, that use regular operations on the symbols of the regular language to indicate the member strings of the lanugage. For example, the regular expression, (0+1)0* describes the regular language over the alphabet {0,1} consisting of all binay strings starting with a 0 or a 1 and followed by zero or more 0's. The regular operations include a choice or union operator (+ or ), concatenation (implied by sequence of symbols or shown as a fixed number of repetitions n e.g. xy n z), and star or Kleene closure (*) which means zero or more repetitions of a substring..

Regular Expressions are Equivalent to NFA's We wish to show that the regular expressions are equivalent to NFAs and DFAs in the ability to describe regular languages (i.e. they all have the same computational power to describe the class of languages called regular). The first step in this goal is to show that the class of regular languages are closed under the regular operations of union, concatenation and star. The formal proofs are covered in the textbook. Here will look at graphical representations of the proof ideas. We represent generic NFAs as collections of three types of states,. and we will use epsion moves to combine generic NFAs to construct new NFAs demonstrating closure for each of the regular operations.

Regular Languages Closed under Union Theorem: The class of regular languages is closed under the union operation. In the diagram above we see that the two NFAs, N1 and N2 can be combined into a new NFA, N that recognizes the union of the languages on N1 and N2. N is constructed by adding a new start state and connecting it to the start states of N1 and N2 with epsilon moves. The remaining structures of N1 and N2 are left unchanged. It should be clear that any string that is accepted by N1 or N2 will be accepted by N and that any string that is accepted by N would be accepted by N1 or N2 or both.

Regular Languages Closed under Concatenation Theorem: The class of regular languages is closed under the concatenation operation. In this construction we wish to show that N accepts any string that is the contatenation of a member string from N2 to the end of a member string from N1. N is created by connecting all the accept states in N1 to the start state of N2 with epsilon moves. We then convert all the accept states of N1 to non-accept states (there will be no accept states in the portion of N that was N1.) Note that this works whether or not the start states of N1 and N2 are accept states. For example, if N1 accepts the empty string but N2 does not, then N should not accept the empty string. On the other hand if both N1 and N2 accept the empty string then there will be an epsion move from the start state of N1 to the start state of N2 (which will still be an accept state).

Regular Languages Closed under Star Operation Kleene Closure Theorem: The class of regular languages is closed under the star operation. Converting an NFA, N1 into N1* can be accomplished by connecting all the accept states to the original start state through epsilon moves. Since N1* must accept the empty string we will add a new start state which is also an accept state and connect it to the original start state of N1 through an epsion move. You may wonder why we don't just make the original start state into an accept state. The reason this does not work is that the NFA, N1 could have a path of transitions back to the start state for some string of symbols that is not a member string. Converting the start state of N1 into an accept state would result in these strings being accepted.

Generating Regular Expressions Create a regular expression for the set of binary strings that contain at least two consecutive 0's. any sequence of 0's and 1's including nothing followed by two consecutive 0's followed by any sequence of 0's and 1's including nothing (0 + 1)* 00 (0 + 1)*

another Example Create a regular expression to generation all binary strings that do not contain two consecutive 0's. 0 1 01 10 11 011 101 110 0101 0110 1010 0111 1011 1101 1110 1111 (1 + 01)* (0 + λ) 01 01 1 01 1 1 01 01 1 1 01 1 01 1 1 1 1 01 0

Regular Expressions are not Unique L = { all strings with at least two consecutive 0 } r1 = (1 + 01)*(0 + λ) r = (1*011*)*(0 + λ ) + 1*(0 + 2 λ ) L r ) = L( r ) = ( 1 2 L r and 1 r 2 are equivalent regular expr.

Converting a Regular Expression to an NFA 0 (00 + 11)* 0 start 0 ε 0 ε ε 0 0 ε ε 1 1 ε

Pushdown Automata We will learn that some context-free grammars are not regular. The reason for this is that context-free grammars can be recursive. We can modify finite automata to be able to account for this recursive structure and to recognize context-free languages. The missing feature in finite automata is an accessible and arbitrarily large memory capacity. We include a stack memory component to an NFA to obtain the pushdown automata.

Formal Definition of Pushdown Automata A pushdown automata (PDA) reads symbols from the input string and pushes symbols to, and pops symbols from, a stack memory. The symbol set for the input string is not necessarily the same as the symbol set for the stack memory. Formally we define a pushdown automata as a 6-tuple (Q,Σ,Γ,δ,q 0,F) where 1. Q is the set of states, 2. Σ is the input symbol set, 3. Γ is the stack symbol set, 4. δ:q x Σε x Γε-> P(Q x Γε) is the transition function, 5. q 0 is an element of Q (the start state), 6. F is a subset of Q (the set of accept states). The symbol ε as a subscript on the input symbol set and the stack symbol set in the definition of the transition function means that the PDA can move (transition) without reading an input symbol or a stack symbol.

Pushdown Automata is more Computationally Powerful than Regular Expressions Example: Recall that the language {0 n 1 n n>= 0 } is not regular so there is no NFA that can recognize it. As an example of the increased computational power of pushdown automata we will describe a PDA that recognizes this non-regular language. Q = { q1, q2, q3, q4}, Σ = { 0,1} Γ = {0,$}, q start = q1, F = {q1,q4}, δ is defined in the table below:

Graphical Representation of the PDA Also, we can display the PDA graphically as, The left side of each transition label represents the current input symbol, the right side shows the stack symbol being popped off the top off the stack followed by the stack symbol to be pushed onto the stack. Input symbols are read in a left-to-right direction. The ε symbol indicates either an epsilon move (i.e. transition without reading input symbol or a stack symbol) or that the PDA should not write to the stack for the current transition.

Operation of the PDA Let's examine the operation of this PDA a little more closely. First of all, since the start state is an accept state, an empty input string is accepted. Since there is an epsilon move out of state q1, we also transition to state q2. On this transition a $ symbol is pushed onto the stack when the PDA transitions to state q2. Each 0 symbol read from the input string when the PDA is in state q2 causes a 0 to be pushed onto the stack. When a 1 is read from the input string in state q2 one of the zero's (if stack not empty) is popped off the stack and the PDA transitions to state q3 without pushing anything back onto the stack. For each 1 read from the input while in state q3, a 0 is popped off the stack. As soon as the special symbol $ reaches the top of the stack the PDA executes an epsilon move to state q4. If the end of the string has been reached at this time it will be accepted (PDA in an accept state). If another input symbol is read when the PDA is in states q3 and q4 there is no valid transition and all control paths are terminated.

An Example For example, consider the operation of this PDA while scanning the string 0001111. At the begining the non-deterministic nature of the PDA is realized as we transition from q1 to q2 on an epsilon move. Before we read the first symbol of the input string we have pushed a '$' onto the stack.

The first symbol in the input string is a '0' so the only valid transition is from q2 to q2. For this transition we do not pop a value from the stack but we push a '0' symbol onto the stack. This transition is repeated for the next two '0' symbols in the input string as shown below:

The next input symbol being read is a '1'. For this symbol the only valid transition is from q2 to q3. For this transition we pop a symbol '0' off the stack but we do not push a symbol onto the stack. Note that there is no epsilon move from q3 to q4 at this point since we are not seeing a '$' symbol at the top of the stack. For the next two input symbols (both of them '1's) we stay in state q3 and pop a '0' off the stack.

Since the '$' symbol is now at the top of the stack, there is a valid epsilon move from q3 to q4. This transition pops the '$' off the stack leaving it empty and placing the PDA in both states q3 and q4. If this were the end of the string we would accept it as a member of the language {0 n 1 n n>= 0 }, however there is another '1' symbol in the input string. Reading the final '1' symbol in the input string we find that there is no valid transition from either q3 or q4 on a '1' input symbol AND an empty stack. In this case we are left in no state, so we do not accept.

Theorem: Context-Free <=> Pushdown Automaton Theorem: A language is context free if and only if some pushdown automaton recognizes it. This theorem is typical of the if and only if theorems in that it requires a proof in two directions. Lemma: If a language is context free, then some pushdown automaton recognizes it. We will assume that A is a CFL. By definition A must have a corresponding CFG, G, that generates it. We need a procedure (an algorithm) for converting any CFG, G into an equivalent PDA, P. We can define such an algorithm: 1. Place the marker symbol $ and the start variable on the stack of the PDA P. 2. Repeat the following steps forever: a. If the top of stack is a variable symbol A, nondeterministically select one of the rules for A and substitute A by the string on the right-hand side of the rule. b. If the top of stack is a terminal symbol a, read the next symbol from the input and compare it to a. If they match then repeat. If they do not match, reject on this branch of the nondeterminism. c. If the top of stack is the symbol $, enter the accept state. Doing so accepts the input if it has all been read.

Lemma: If a pushdown automaton recognizes some language, then it is context free. Given a PDA, P we need to design a CFG, G that will generate all the strings accepted by P where P=(Q,Σ,Γ,δ,q0,{q accept }). The variables of G are {A pq p,q are in Q}. the start variable is Aqo,q accept. The rules of Γ are: 1. For each p,q,r,s in Q, t in Γ, and a,b in Σε, if δ(p,a,ε) contains (r,t) and δ(s,b,t) contains (q,ε) put the rule A pq -> aa rs b in G. 2. For each p,q,r in Q put the rule A pq -> A pr A rq in G. 3. Finally, for each p in Q put the rule A pp ->ε in G. We can prove that this construction works by demonstratring that A pq generates x if and only if (iff) x can bring P from p with empty stack to q with empty stack. Once again this proof is broken down into two separate cases corresponding to the two if..then conditions implied by the iff. Both of these proofs are by induction and can be found in a number of textbooks. We will assume that the proof is valid and, instead, spend our time using it to generate CFG's from PDA's and vice versa.

Designing Context-Free Grammars The design and construction of CFG's is a bit more involved than the creation of finite automata. We can use the following techniques to make this process simpler. Technique 1: Many CFG's can be constructed through the union of simpler CFG's. If your goal CFG can be broken down into a number of simpler CFG's you can combine them by creating a new rule S -> S1 S2... Sk, where the Si are the start variables for each of the simpler CFG's and S is the start variable for the goal CFG. Example: Create the grammar for the language L= { 0 n 1 n or 1 n 0 n n>=0}. S 1 -> 0S 1 1 ε is the grammar for { 0 n 1 n n>=0}. S 2 -> 1S 2 0 ε is the grammar for { 1 n 0 n n>=0}. so S -> S 1 S 2 S1 -> 0S 1 1 ε S 2 -> 1S 2 0 ε is the grammar for L.

Technique 2: When constructing a CFG for a language that is regular, you can first create a DFA for the language. You can convert any DFA into an equivalent CFG in the following way: 1. Make a variable R i for each state qi of the DFA. 2. Add the rule R i -> ar j to the CFG for each transition δ(q i,a) = q j in the DFA. 3. Add the rule R i -> ε if qi is an accept state of the DFA. 4. Make R 0 the start variable of the grammar, where q 0 is the start state of the DFA. Example: Create a grammar to recognize the language over {0,1} of all strings containing the substring 1101. First we create the DFA as shown below: Now, using these rules, we build the CFG: R 0 -> 0R 0 1R 1 R 1 -> 0R 0 1R 2 R 2 -> 0R 3 1R 2 R 3 -> 0R 0 1R 4 R 4 -> 0R 4 1R 4 ε

Technique 3: Some context-free languages are defined by the matching or comparing arbitrarily long substrings. In such cases the substrings are said to be linked. Consider the example of palindromes, P={ w w R aw is in P and a is in Σε), where w R is the reverse substring of w. The substrings w and w R are linked. In such situations we can construct a CFG using a rule of the form R -> urv, which simultaneously generates the linked substrings on either side of the variable R. Example: Create a CFG over {0,1} for the language {0 n 1 n n>=0} R 0 -> 0R1 ε.

Technique 4: In more practical CFG's such as grammars for computer languages or arithmetic expressions, the stirngs can contain recursively occurring structures. When generating such grammars you may place symbols that resolve to these recursive structures into the rules at the locations at which the recursive structures can appear. Example: Create a CFG that generates arithmetic expressions with the terminals a,b and c and the operations of addition (+) and multiplication (x). <EXPR> -> <EXPR> + <TERM> <TERM> <TERM> -> <TERM> x <FACTOR> <FACTOR> <FACTOR> -> ( <EXPR> ) a b c The parse trees above show how the grammar generates valid arithmetic expressions. The tree on the right includes a recursive component in which a <FACTOR> can generate another expression inside parentheses.

Theorem: The Pumping Lemma for Context-Free Languages. If A is a context-free language, then there is a number p (the pumping length) where, if s is any string in A of length at least p, then s may be divided into five pieces s = uvxyz satisfying the conditions: 1. For each i >= 0, uv i xy i z is an element of A, 2. vy > 0, and 3. vxy <= p.

Example: Use the pumping lemma for context-free languages to show that the language B={a n b n c n n >= 0} is not context free. Just as with the pumping lemma for regular expressions we will use proof by contradiction to show that our example language is not context-free. First assume that B is a context-free language, then let p be the pumping length for B. This p must exist by the pumping lemma. Now we select a string s = a p b p c p. Possible examples of s include abc aabbcc aaabbbccc aaaabbbbcccc We note that s is a member of B and that s >=p, as required by the pumping lemma. Now if we assume that s is context-free then we will be able to divide it into five sections s = uvxyz where either v or y is nonempty (condition 2). We can now consider one of two possibilities for the substrings v and y. Case 1: The substrings v and y each contain all of the same symbols. Whether v and y both contain the same symbols or different symbols at most there will be two of the three symbols in v and y. Therefore, pumping v and y will result in an unequal number of the symbols a, b and/or c. Case 2: One of the two substrings v and y (or both) contain more than one type of symbol. In this case pumping v and/or y creates symbols in the wrong order. One of these two cases must occur and since each results in a contradiction, so B is not context free.

Some Philosophical Questions Are natural languages context-free? Are context-free grammars useful for describing/understanding the grammar of a natural language? Not too long ago, the answer to the first question was clearly no, however, more recent studies have brought these earlier conclusions into question. In 1957, Noam Chomsky demonstrated that context-free grammars were inadequate to represent the grammar of natural languages (called transformational grammars). He showed that a context free grammar could represent only the 'base component' of natural language and that any natural language would need the addition of transformations. However, in 1982, Gazdar showed that a context free grammar could, in fact, encompass all the transformations suggested by Chomsky. Since that time it remains an open question if all natural languages can be represented by context free grammars or if CFG's are the best way to represent them even if it is possible. Pullum provides a good overview of the history of this development in formal linguistics.

References 1. Lecture 28. Push-Down Storage Automata and Context-Free Grammars - Ling 409 Lecture Notes, Partee, Lecture 28, Dec 8, 2003 2. Pullum, Geoffrey K. 1991. The Great Eskimo Snow Vocabulary Hoax and Other Irreverent Essays on the Study of Language. Chicago: The University of Chicago Press. book review.