UNIT II REGULAR LANGUAGES - PDF Free Download

1 UNIT II REGULAR LANGUAGES Introduction: A regular expression is a way of describing a regular language. The various operations are closure, union and concatenation. We can also find the equivalent regular expression for the given finite automata. The properties of the regular languages include the pumping lemma, closure properties, decision properties and minimization techniques. Objectives: To learn about Regular Expressions. To learn about Finite Automata and Regular Expressions. To learn about Applications of Regular Expressions. To learn about Regular Grammars. Regular Languages: Operations on Languages Let L, L 1, L 2 be subsets of Σ * Concatenation: L 1 L 2 = {xy x is in L 1 and y is in L 2 } Concatenating a language with itself: L 0 = {ε} L i = LL i-1, for all i >= 1 Kleene Closure: L * = L i = L 0 U L 1 U L 2 U Positive Closure: L + = L i = L 1 U L 2 U Question: Does L + contain ε? Kleene closure 1 Say, L ={a, abc, ba}, on Σ ={a,b,c} Then, L 3 2 = {aa, aabc, aba, abca, abcabc, abcba, baa, baabc, baba} L = {a, abc, ba}. L 1 L* = {ε, L, L, L,...} Regular Expressions 2 3 Highlights: 2

2 A regular expression is used to specify a language, and it does so precisely. Regular expressions are very intuitive. Regular expressions are very useful in a variety of contexts. Given a regular expression, an NFA-ε can be constructed from it automatically. Thus, so can an NFA, a DFA, and a corresponding program, all automatically! Definition of a Regular Expression Let Σ be an alphabet. The regular expressions over Σ are: Ø Represents the empty set { } ε Represents the set {ε} a Represents the set {a}, for any symbol a in Σ Let r and s be regular expressions that represent the sets R and S, respectively. r+s Represents the set R U S (precedence 3) rs Represents the set RS (precedence 2) r * Represents the set R* (highest precedence) (r) Represents the set R (not an op, provides precedence) If r is a regular expression, then L(r) is used to denote the corresponding language. Examples: Let Σ = {0, 1} (0 + 1)* All strings of 0 s and 1 s 0(0 + 1)* All strings of 0 s and 1 s, beginning with a 0 (0 + 1)*1 All strings of 0 s and 1 s, ending with a 1 (0 + 1)*0(0 + 1)* All strings of 0 s and 1 s containing at least one 0 (0 + 1)*0(0 + 1)*0(0 + 1)* All strings of 0 s and 1 s containing at least two 0 s (0 + 1)*01*01* All strings of 0 s and 1 s containing at least two 0 s (1 + 01*0)* All strings of 0 s and 1 s containing an even number of 0 s 1*(01*01*)* All strings of 0 s and 1 s containing an even number of 0 s (1*01*0)*1* All strings of 0 s and 1 s containing an even number of 0 s Equivalence of Regular Expressions and NFA-εs Note: Throughout the following, keep in mind that a string is accepted by an NFA-ε if there exists a path from the start state to a final state. Lemma 1: Let r be a regular expression. Then there exists an NFA-ε M such that L(M) = L(r). Furthermore, M has exactly one final state with no transitions out of it.

3 Proof: (by induction on the number of operators, denoted by OP(r), in r). Inductive Hypothesis: Suppose there exists a k 0 such that for any regular expression r where 0 OP(r) k, there exists an NFA-ε such that L(M) = L(r). Furthermore, suppose that M has exactly one final state. Inductive Step: Let r be a regular expression with k + 1 operators (OP(r) = k + 1), where k + 1 >= 1. Case 1) r = r 1 + r 2 Since OP(r) = k +1, it follows that 0<= OP(r 1 ), OP(r 2 ) <= k. By the inductive hypothesis there exist NFA-ε machines M 1 and M 2 such that L(M 1 ) = L(r 1 ) and L(M 2 ) = L(r 2 ). Furthermore, both M 1 and M 2 have exactly one final state. Construct M as: Case 2) r = r 1 r 2 Since OP(r) = k+1, it follows that 0<= OP(r 1 ), OP(r 2 ) <= k. By the inductive hypothesis there exist NFA-ε machines M 1 and M 2 such that L(M 1 ) = L(r 1 ) and L(M 2 ) = L(r 2 ). Furthermore, both M 1 and M 2 have exactly one final state. Construct M as: Case 3) r = r 1 *

4 Since OP(r) = k+1, it follows that 0<= OP(r 1 ) <= k. By the inductive hypothesis there exists an NFA-ε machine M 1 such that L(M 1 ) = L(r 1 ). Furthermore, M 1 has exactly one final state. Example: r = 0(0+1)* r = r 1 r 2 r 1 = 0 r 2 = (0+1)* r 2 = r 3 * r 3 = 0+1 r 3 = r 4 + r 5 r 4 = 0 r 5 = 1 Example: r = 0(0+1)* r = r 1 r 2 r 1 = 0 r 2 = (0+1)* r 2 = r 3 * r 3 = 0+1

5 r = r + r 3 4 5 r 4 = 0 r 5 = 1 Example: r = 0(0+1)* r = r 1 r 2 r 1 = 0 r 2 = (0+1)* r 2 = r 3 * r 3 = 0+1 r 3 = r 4 + r 5 r 4 = 0 r 5 = 1

6 Example: r = 0(0+1)* r = r 1 r 2 r 1 = 0 r 2 = (0+1)* r 2 = r 3 * r 3 = 0+1 r 3 = r 4 + r 5 r 4 = 0 r 5 = 1 Definitions Required Converting a DFA to a Regular Expression Let M = (Q, Σ, δ, q 1, F) be a DFA with state set Q = {q 1, q 2,, q n }, and define: R i,j = { x x is in Σ* and δ(q i,x) = q j } R i,j is the set of all strings that define a path in M from q i to q j. Note that states have been numbered starting at 1, not 0! Example:

7 R 2,3 = {0, 001, 00101, 011, } R 1,4 = {01, 00101, } R 3,3 = {11, 100, } Another definition: R k = { x x is in Σ* and δ(q,x) = q, and for no u where 1 u < x and i,j i j x = uv there is no case such that δ(q i,u) = q p where p>k} In words: R k is the set of all the strings that define a path in M from q to q but that i,j i j passes through no state numbered greater than k. Note that it may be true that i>k or j>k, only the intermediate states may not be >k.

8 R 4 2,3 = {0, 1000, 011, } R1 2,3 = {0} 111 is not in R 4 2,3 111 is not in R 1 2,3 101 is not in R 1 2,3 R 5 2,3 = R 2,3 Obeservations: 1) R n i,j = R i,j 2) R k-1 i,j is a subset of R k i,j 3) L(M) = R n 1,q = R 1,q 4) R 0 = Easily computed from the DFA! i,j 5) R k i,j = Rk-1 i,k (Rk-1 k,k )* Rk-1 k,j U Rk-1 i,j Notes on 5: 5) R k i,j = Rk-1 i,k (Rk-1 k,k )* R k-1 k,j U Rk-1 i,j Consider paths represented by the strings in R k i,j :

9 IF x is a string in R k then no state numbered > k is passed through when processing x i,j and either: q k is not passed through, i.e., x is in R k-1 i,j q k is passed through one or more times, i.e., x is in R k-1 i,k (Rk-1 k,k )* R k-1 k,j Lemma 2: Let M = (Q, Σ, δ, q 1, F) be a DFA. Then there exists a regular expression r such that L(M) = L(r). Proof: First we will show (by induction on k) that for all i,j, and k, where 1 i,j n and 0 k n, that there exists a regular expression r such that L(r) = R k i,j. Basis: k=0 R 0 contains single symbols, one for each transition from q to q, and possibly ε if i=j. i,j i j case 1) No transitions from q i to q j and i!= j r 0 i,j = Ø case 2) At least one (m 1) transition from q i to q j and i!= j r 0 i,j = a 1 + a 2 + a 3 + + a m for all 1 p m where δ(q i, a p ) = q j, case 3) No transitions from q i to q j and i = j r 0 i,j = ε

10 case 4) At least one (m 1) transition from q i to q j and i = j r 0 = a + a + a + + a + ε where δ(q, a ) = q i,j 1 2 3 m i p j for all 1 p m Inductive Hypothesis: Suppose that R k-1 can be represented by the regular expression i,j rk-1 for all i,j 1 i,j n, and some k 1. Inductive Step: Consider R k = i,j Rk-1 i,k (Rk-1 k,k )* R k-1 U k,j Rk-1. By the inductive hypothesis there exist i,j regular expressions r k-1 i,k, rk-1 k,k, rk-1 k,j, and rk-1 i,j 1 i,j, respectively. Thus, if we let generating R k-1, i,k Rk-1, k,k Rk-1, and Rkk,j r k i,j = rk-1 i,k (rk-1 k,k )* r k-1 k,j + rk-1 i,j then r k i,j is a regular expression generating Rk i,j,i.e., L(rk i,j ) = Rk i,j. Finally, if F = {q j1, q j2,, q jr }, then r n 1,j1 + rn 1,j2 + + rn 1,jr is a regular expression generating L(M). Note: not only does this prove that the regular expressions generate the regular languages, but it also provides an algorithm for computing it!

11 All remaining columns are computed from the previous column using the formula. r 1 2,3 = 0 (ε)* 1 + 1 = 01 + 1 = r 0 2,1 (r0 1,1 )* r0 1,3 + r0 2,3 r 2 1,3 = r 1 1,2 (r1 2,2 )* r1 2,3 + r1 1,3 = 0 (ε + 00)* (1 + 01) + 1 = 0*1

12 To complete the regular expression, we compute: r 3 1,2 + r3 1,3

13 Theorem: Let L be a language. Then there exists an a regular expression r such that L = L(r) if and only if there exits a DFA M such that L = L(M). Proof: (if) Suppose there exists a DFA M such that L = L(M). Then by Lemma 2 there exists a regular expression r such that L = L(r). (only if) Suppose there exists a regular expression r such that L = L(r). Then by Lemma 1 there exists a DFA M such that L = L(M). Corollary: The regular expressions define the regular languages. Note: The conversion from a regular expression to a DFA and a program accepting L(r) is now complete, and fully automated! Applications of Regular Expression 1.Regular expressions in Unix In the UNIX operating system various commands use an extended regular expressions language that provideshorthands for many common expressions. In this we can write character classes (A character class is a pattern that defines a set of characters and matches exactly one character from that set.) to represent large set of characters. There are some rules for forming this character classes: The dot symbol (.) is to represent any character. The regular expression a+b+c+ +z is represented by [abc z] Within a character class representation, - can be used to define a set of characters in terms of a range. For example, a-z defines the set of lower-case letters and A-Z defines the set of upper-case letters. The endpoints of a range may be specified in either order (i.e. both 0-9 and 9-0 define the set of digits). If our expression involves operators such as minus then we can place it first or last to avoid confusion with the range specifier. i.e. [-.0-9]. The special characters in UNIX regular language can be represented as characters using \ symbol i.e. \ provides the usual escapes within character class brackets. Thus [[\]] matches either [ or ], because \ causes the first ] in the character class representation to be taken as a normal character rather than the closing bracket of the representation. Special notations [: digit : ] same as [0-9] [: alpha:] same as [A-Za-z]

14 [: alnum :] same as [A-Za-z0-9] Operators Used in place of +? 0 or 1 of R? Means 0 or 1 occurrence of R 1 or more of R+ means 1 or more occurrence of R {n} n copies of R {3} means RRR ^ Compliment of If the first character after the opening bracket of a character class is ^, the set defined by the remainder of the class is complemented with respect to the computer's character set. Using this notation, the character class represented by. can be described as [^\n]. If ^ appears as any character of a class except the first, it is not considered to be an operator. Thus [^abc] matches any character except a, b, or c but [a^bc] or [abc^] matches a, b, c or ^. When more than one expression can match the current character sequence, a choice is made as follows: 1. The longest match is preferred. 2. Among rules, which match the same number of characters, the rule given first is preferred. 2.Lexical analysis Compilers in a nutshell Purpose: translate a program in some language (the source language) into a lower-level language (the target language). Phases: Lexical Analysis: Converts a sequence of characters into words, or tokens

15 Syntax Analysis: Converts a sequence of tokens into a parse tree Semantic Analysis: Manipulates parse tree to verify symbol and type information Intermediate Code Generation: Converts parse tree into a sequence of intermediate code instructions Optimization: Manipulates intermediate code to produce a more efficient program Final Code Generation: Translates intermediate code into final (machine/assembly) code Overview of Lexical Analysis Convert character sequence into tokens, skip comments & whitespace Handle lexical errors Efficiency is crucial Tokens are specified as regular expressions, e.g. IDENTIFIER=[a-zA-Z][a-zA-Z0-9]* Lexical Analyzers are implemented by regular expressions. There is a problem that more than one token may be recognized at once. Suppose the string else matches for regular expression as well as the expression for identifiers. This problem is resolved by giving priority to first expression listed.

16 Regular Grammars A grammar is a quadruple G = (V, T, S, P) where V is a finite set of variables T is a finite set of symbols, called terminals S is in V and is called the startsymbol P is a finite set of productions, which are rules of the form α β whereα and β are strings consisting of terminals and variables. A grammar is said to be right-linear if every production in P is of the form A xb or A x where A and B are variables (perhaps the same, perhaps the start symbol S) in V and x is any string of terminal symbols (including the empty string λ) An alternate (and better) definition of a right-linear grammar says that every production in P is of the form A ab A a or or S λ(to allow λ to be in the language) where A and B are variables (perhaps the same, but B can't be S) in V and a is any terminal symbol An alternate (and better) definition of a right-linear grammar says that every production in P is of the form A ab A a or or

17 S λ(to allow λ to be in the language) where A and B are variables (perhaps the same, but B can't be S) in V and a is any terminal symbol A grammar is said to be left-linear if every production in P is of the form A Bx or A x where A and B are variables (perhaps the same, perhaps the start symbol S) in V and x is any string of terminal symbols (including the empty string λ) The alternate definition of a left-linear grammar says that every production in P is of the form A Ba A a or or S λ where A and B are variables (perhaps the same, but B can't be S) in V and a is any terminal symbol Any left-linear or right-linear grammar is called a regular grammar. For brevity, we often write a set of productions such as A x 1 A x 2 A x 3 As A x 1 x 2 x 3 A derivation in grammar G is any sequence of strings in V and T, connected with

18 starting with S and ending with a string containing no variables where each subsequent string is obtained by applying a production in P is called a derivation. S x 1 x 2 x 3... x n abbreviated as: S x n S x 1 x 2 x 3... x n abbreviated as: S x n We say that x n is a sentence of the language generated by G, L(G). We say that the other x's are sentential forms. L(G) = {w w T* and S x n } We call L(G) the language generated by G L(G) is the set of all sentences over grammar G Example 1 S abs a is an example of a right-linear grammar. Can you figure out what language it generates? L = {w {a,b}* w contains alternating a's and b's, begins with an a, and ends with a b} {a} L((ab)*a) Example 2 S Aab A Aab ab B a is an example of a left-linear grammar. Can you figure out what language it generates?

19 L = {w {a,b}* w is aa followed by at least ab's} one set of alternating L(aaab(ab)*) Regular Grammars and NFA's We get a feel for this by example. Let S aa A abs b Regular Grammars and Regular Expressions Example: L(aab*a) We can easily construct a regular language for this expression: S aa A ab B b B a

20 Types of grammars:

21 Key Terms: Introduction to regular operators, regular languages, Precedence of regular operators Regular expressions, Formal definition of regular expressions, Equivalence of Regular Expressions and Finite Automata. Theorem for conversion from regular expression to epsilon FA. Application of regular expressions Algebraic Laws for Regular Expressions. Multiple choice questions: 1. The regular expression 01*. a) The language consisting of strings of length 2. b) The language consisting of all strings that are a single 0 followed by any number of 1 s c) The language consisting of all strings that is a single 0. d) The language consisting of all strings that are a single 1 followed by any number of 0 s. 2. Union and concatenation are associative. a)true b)false 3. The regular expression 10*. a) The language consisting of strings of length 2. b) The language consisting of all strings that are either a single 0 followed by any number of 1 s c) The language consisting of all strings that is a single 0. d) The language consisting of all strings that are a single 1 followed by any number of 0 s. 4. The Kleene closure is represented by a) L b)l+ c)l* d)l1 5. In a regular expression L(E+F) is equal to a) L(E) + L(F) b) L(E) U L(F) c) L(EUF) d) L(EnF) 6. Union of a regular expression is commutative. a)true b)false

22 7.. In a regular expression L(E*) is equal to a) L(E) * b) (L(E) *) c) (L(E) ) * d) all of the above 8. The regular expression operators are a)union, intersection and concatenation b) union, concatenation and closure c) closure, intersection and concatenation d) union, intersection and closure 9. Concatenation of a regular expression is commutative. a) Trueb)False 10. The regular expression (10)*. a) The language consisting of strings of length 2. b) The language consisting of all strings that are either a single 0 followed by any number of 1 s c) The language consisting of alternating strings that begin with 1 and end with 0. d) The language consisting of all strings that are a single 1 followed by any number of 0 s. 11. Every language is a regular language. a) Trueb) False 12. The inverse homomorphism of a regular language is regular a) Trueb) False 13. The positive closure is represented by a) L b) L+ c) L* d)l1 14. The regular expression for the set of strings that end with 1 and has no substring 00 is given by

23 a) (0+1)*0101(0+1)* b) 11(1+0+0)*11 c) (1+01)*(10+11)*1 d) none 2 marks Question and answer:

26 Summary: A regular expression is a string that describes the whole set of strings according to the syntax. Arden s theorem helps in checking the equivalence of two regular expressions. Lexical analyzers and text editors are the applications of regular expressions. Grammar: generative description of a language.