UNIT II REGULAR LANGUAGES

Similar documents
Sri vidya college of engineering and technology

UNIT-III REGULAR LANGUAGES

CS 154. Finite Automata, Nondeterminism, Regular Expressions

TAFL 1 (ECS-403) Unit- II. 2.1 Regular Expression: The Operators of Regular Expressions: Building Regular Expressions

This Lecture will Cover...

Finite Automata and Regular languages

Regular Expressions. Definitions Equivalence to Finite Automata

Deterministic Finite Automaton (DFA)

CS 154, Lecture 3: DFA NFA, Regular Expressions

Uses of finite automata

Chapter 6. Properties of Regular Languages

How do regular expressions work? CMSC 330: Organization of Programming Languages

Regular Expressions [1] Regular Expressions. Regular expressions can be seen as a system of notations for denoting ɛ-nfa

Examples of Regular Expressions. Finite Automata vs. Regular Expressions. Example of Using flex. Application

Context-Free Grammars (and Languages) Lecture 7

Ogden s Lemma for CFLs

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

DFA to Regular Expressions

Regular Expression Unit 1 chapter 3. Unit 1: Chapter 3

Harvard CS 121 and CSCI E-207 Lecture 10: CFLs: PDAs, Closure Properties, and Non-CFLs

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

Note: In any grammar here, the meaning and usage of P (productions) is equivalent to R (rules).

Lecture 7 Properties of regular languages

Closure Properties of Regular Languages. Union, Intersection, Difference, Concatenation, Kleene Closure, Reversal, Homomorphism, Inverse Homomorphism

Properties of Context-Free Languages. Closure Properties Decision Properties

CS 455/555: Finite automata

Concatenation. The concatenation of two languages L 1 and L 2

Computational Theory

September 11, Second Part of Regular Expressions Equivalence with Finite Aut

CMSC 330: Organization of Programming Languages. Regular Expressions and Finite Automata

AC68 FINITE AUTOMATA & FORMULA LANGUAGES DEC 2013

CMSC 330: Organization of Programming Languages

CS311 Computational Structures Regular Languages and Regular Expressions. Lecture 4. Andrew P. Black Andrew Tolmach

SYLLABUS. Introduction to Finite Automata, Central Concepts of Automata Theory. CHAPTER - 3 : REGULAR EXPRESSIONS AND LANGUAGES

What Is a Language? Grammars, Languages, and Machines. Strings: the Building Blocks of Languages

Finite Automata and Regular Languages

CS 121, Section 2. Week of September 16, 2013

Parsing Regular Expressions and Regular Grammars

THEORY OF COMPUTATION (AUBER) EXAM CRIB SHEET

NPDA, CFG equivalence

CS20a: summary (Oct 24, 2002)

Tasks of lexer. CISC 5920: Compiler Construction Chapter 2 Lexical Analysis. Tokens and lexemes. Buffering

Foundations of Informatics: a Bridging Course

Lexical Analysis. DFA Minimization & Equivalence to Regular Expressions

GEETANJALI INSTITUTE OF TECHNICAL STUDIES, UDAIPUR I

Regular expressions and Kleene s theorem

Properties of Context-Free Languages

Properties of Regular Languages. BBM Automata Theory and Formal Languages 1

Finite Automata and Formal Languages

Lecture 17: Language Recognition

TAFL 1 (ECS-403) Unit- III. 3.1 Definition of CFG (Context Free Grammar) and problems. 3.2 Derivation. 3.3 Ambiguity in Grammar

Nondeterministic Finite Automata

Closure Properties of Context-Free Languages. Foundations of Computer Science Theory

10. The GNFA method is used to show that

Context-Free Grammars and Languages. Reading: Chapter 5

Regular Languages. Problem Characterize those Languages recognized by Finite Automata.

CS 154. Finite Automata vs Regular Expressions, Non-Regular Languages

T (s, xa) = T (T (s, x), a). The language recognized by M, denoted L(M), is the set of strings accepted by M. That is,

Computational Models - Lecture 4

Lecture 11 Context-Free Languages

Formal Language and Automata Theory (CS21004)

Theory of Computation (II) Yijia Chen Fudan University

Theory of Languages and Automata

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

Incorrect reasoning about RL. Equivalence of NFA, DFA. Epsilon Closure. Proving equivalence. One direction is easy:

Lecture Notes On THEORY OF COMPUTATION MODULE -1 UNIT - 2

Author: Vivek Kulkarni ( )

What we have done so far

CS 154, Lecture 2: Finite Automata, Closure Properties Nondeterminism,

COSE312: Compilers. Lecture 2 Lexical Analysis (1)

Context Free Languages. Automata Theory and Formal Grammars: Lecture 6. Languages That Are Not Regular. Non-Regular Languages

Closure under the Regular Operations

Theory of Computation p.1/?? Theory of Computation p.2/?? Unknown: Implicitly a Boolean variable: true if a word is

PS2 - Comments. University of Virginia - cs3102: Theory of Computation Spring 2010

Pushdown Automata. Reading: Chapter 6

Miscellaneous. Closure Properties Decision Properties

3515ICT: Theory of Computation. Regular languages

CFLs and Regular Languages. CFLs and Regular Languages. CFLs and Regular Languages. Will show that all Regular Languages are CFLs. Union.

Chapter 5. Finite Automata

Regular expressions and Kleene s theorem

Automata Theory and Formal Grammars: Lecture 1

Finite Automata and Formal Languages TMV026/DIT321 LP4 2012

Equivalence of Regular Expressions and FSMs

CSE443 Compilers. Dr. Carl Alphonce 343 Davis Hall

CPS 220 Theory of Computation

Theory of computation: initial remarks (Chapter 11)

CMSC 330: Organization of Programming Languages. Theory of Regular Expressions Finite Automata

St.MARTIN S ENGINEERING COLLEGE Dhulapally, Secunderabad

Before We Start. The Pumping Lemma. Languages. Context Free Languages. Plan for today. Now our picture looks like. Any questions?

Languages. Languages. An Example Grammar. Grammars. Suppose we have an alphabet V. Then we can write:

Exam 1 CSU 390 Theory of Computation Fall 2007

COMP-330 Theory of Computation. Fall Prof. Claude Crépeau. Lec. 10 : Context-Free Grammars

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) October,

Part 3 out of 5. Automata & languages. A primer on the Theory of Computation. Last week, we learned about closure and equivalence of regular languages

1 More finite deterministic automata

Foundations of

MA/CSSE 474 Theory of Computation

(NB. Pages are intended for those who need repeated study in formal languages) Length of a string. Formal languages. Substrings: Prefix, suffix.

Computational Models - Lecture 4 1

Theoretical Computer Science

Transcription:

1 UNIT II REGULAR LANGUAGES Introduction: A regular expression is a way of describing a regular language. The various operations are closure, union and concatenation. We can also find the equivalent regular expression for the given finite automata. The properties of the regular languages include the pumping lemma, closure properties, decision properties and minimization techniques. Objectives: To learn about Regular Expressions. To learn about Finite Automata and Regular Expressions. To learn about Applications of Regular Expressions. To learn about Regular Grammars. Regular Languages: Operations on Languages Let L, L 1, L 2 be subsets of Σ * Concatenation: L 1 L 2 = {xy x is in L 1 and y is in L 2 } Concatenating a language with itself: L 0 = {ε} L i = LL i-1, for all i >= 1 Kleene Closure: L * = L i = L 0 U L 1 U L 2 U Positive Closure: L + = L i = L 1 U L 2 U Question: Does L + contain ε? Kleene closure 1 Say, L ={a, abc, ba}, on Σ ={a,b,c} Then, L 3 2 = {aa, aabc, aba, abca, abcabc, abcba, baa, baabc, baba} L = {a, abc, ba}. L 1 L* = {ε, L, L, L,...} Regular Expressions 2 3 Highlights: 2

2 A regular expression is used to specify a language, and it does so precisely. Regular expressions are very intuitive. Regular expressions are very useful in a variety of contexts. Given a regular expression, an NFA-ε can be constructed from it automatically. Thus, so can an NFA, a DFA, and a corresponding program, all automatically! Definition of a Regular Expression Let Σ be an alphabet. The regular expressions over Σ are: Ø Represents the empty set { } ε Represents the set {ε} a Represents the set {a}, for any symbol a in Σ Let r and s be regular expressions that represent the sets R and S, respectively. r+s Represents the set R U S (precedence 3) rs Represents the set RS (precedence 2) r * Represents the set R* (highest precedence) (r) Represents the set R (not an op, provides precedence) If r is a regular expression, then L(r) is used to denote the corresponding language. Examples: Let Σ = {0, 1} (0 + 1)* All strings of 0 s and 1 s 0(0 + 1)* All strings of 0 s and 1 s, beginning with a 0 (0 + 1)*1 All strings of 0 s and 1 s, ending with a 1 (0 + 1)*0(0 + 1)* All strings of 0 s and 1 s containing at least one 0 (0 + 1)*0(0 + 1)*0(0 + 1)* All strings of 0 s and 1 s containing at least two 0 s (0 + 1)*01*01* All strings of 0 s and 1 s containing at least two 0 s (1 + 01*0)* All strings of 0 s and 1 s containing an even number of 0 s 1*(01*01*)* All strings of 0 s and 1 s containing an even number of 0 s (1*01*0)*1* All strings of 0 s and 1 s containing an even number of 0 s Equivalence of Regular Expressions and NFA-εs Note: Throughout the following, keep in mind that a string is accepted by an NFA-ε if there exists a path from the start state to a final state. Lemma 1: Let r be a regular expression. Then there exists an NFA-ε M such that L(M) = L(r). Furthermore, M has exactly one final state with no transitions out of it.

3 Proof: (by induction on the number of operators, denoted by OP(r), in r). Inductive Hypothesis: Suppose there exists a k 0 such that for any regular expression r where 0 OP(r) k, there exists an NFA-ε such that L(M) = L(r). Furthermore, suppose that M has exactly one final state. Inductive Step: Let r be a regular expression with k + 1 operators (OP(r) = k + 1), where k + 1 >= 1. Case 1) r = r 1 + r 2 Since OP(r) = k +1, it follows that 0<= OP(r 1 ), OP(r 2 ) <= k. By the inductive hypothesis there exist NFA-ε machines M 1 and M 2 such that L(M 1 ) = L(r 1 ) and L(M 2 ) = L(r 2 ). Furthermore, both M 1 and M 2 have exactly one final state. Construct M as: Case 2) r = r 1 r 2 Since OP(r) = k+1, it follows that 0<= OP(r 1 ), OP(r 2 ) <= k. By the inductive hypothesis there exist NFA-ε machines M 1 and M 2 such that L(M 1 ) = L(r 1 ) and L(M 2 ) = L(r 2 ). Furthermore, both M 1 and M 2 have exactly one final state. Construct M as: Case 3) r = r 1 *

4 Since OP(r) = k+1, it follows that 0<= OP(r 1 ) <= k. By the inductive hypothesis there exists an NFA-ε machine M 1 such that L(M 1 ) = L(r 1 ). Furthermore, M 1 has exactly one final state. Example: r = 0(0+1)* r = r 1 r 2 r 1 = 0 r 2 = (0+1)* r 2 = r 3 * r 3 = 0+1 r 3 = r 4 + r 5 r 4 = 0 r 5 = 1 Example: r = 0(0+1)* r = r 1 r 2 r 1 = 0 r 2 = (0+1)* r 2 = r 3 * r 3 = 0+1

5 r = r + r 3 4 5 r 4 = 0 r 5 = 1 Example: r = 0(0+1)* r = r 1 r 2 r 1 = 0 r 2 = (0+1)* r 2 = r 3 * r 3 = 0+1 r 3 = r 4 + r 5 r 4 = 0 r 5 = 1

6 Example: r = 0(0+1)* r = r 1 r 2 r 1 = 0 r 2 = (0+1)* r 2 = r 3 * r 3 = 0+1 r 3 = r 4 + r 5 r 4 = 0 r 5 = 1 Definitions Required Converting a DFA to a Regular Expression Let M = (Q, Σ, δ, q 1, F) be a DFA with state set Q = {q 1, q 2,, q n }, and define: R i,j = { x x is in Σ* and δ(q i,x) = q j } R i,j is the set of all strings that define a path in M from q i to q j. Note that states have been numbered starting at 1, not 0! Example:

7 R 2,3 = {0, 001, 00101, 011, } R 1,4 = {01, 00101, } R 3,3 = {11, 100, } Another definition: R k = { x x is in Σ* and δ(q,x) = q, and for no u where 1 u < x and i,j i j x = uv there is no case such that δ(q i,u) = q p where p>k} In words: R k is the set of all the strings that define a path in M from q to q but that i,j i j passes through no state numbered greater than k. Note that it may be true that i>k or j>k, only the intermediate states may not be >k.

8 R 4 2,3 = {0, 1000, 011, } R1 2,3 = {0} 111 is not in R 4 2,3 111 is not in R 1 2,3 101 is not in R 1 2,3 R 5 2,3 = R 2,3 Obeservations: 1) R n i,j = R i,j 2) R k-1 i,j is a subset of R k i,j 3) L(M) = R n 1,q = R 1,q 4) R 0 = Easily computed from the DFA! i,j 5) R k i,j = Rk-1 i,k (Rk-1 k,k )* Rk-1 k,j U Rk-1 i,j Notes on 5: 5) R k i,j = Rk-1 i,k (Rk-1 k,k )* R k-1 k,j U Rk-1 i,j Consider paths represented by the strings in R k i,j :

9 IF x is a string in R k then no state numbered > k is passed through when processing x i,j and either: q k is not passed through, i.e., x is in R k-1 i,j q k is passed through one or more times, i.e., x is in R k-1 i,k (Rk-1 k,k )* R k-1 k,j Lemma 2: Let M = (Q, Σ, δ, q 1, F) be a DFA. Then there exists a regular expression r such that L(M) = L(r). Proof: First we will show (by induction on k) that for all i,j, and k, where 1 i,j n and 0 k n, that there exists a regular expression r such that L(r) = R k i,j. Basis: k=0 R 0 contains single symbols, one for each transition from q to q, and possibly ε if i=j. i,j i j case 1) No transitions from q i to q j and i!= j r 0 i,j = Ø case 2) At least one (m 1) transition from q i to q j and i!= j r 0 i,j = a 1 + a 2 + a 3 + + a m for all 1 p m where δ(q i, a p ) = q j, case 3) No transitions from q i to q j and i = j r 0 i,j = ε

10 case 4) At least one (m 1) transition from q i to q j and i = j r 0 = a + a + a + + a + ε where δ(q, a ) = q i,j 1 2 3 m i p j for all 1 p m Inductive Hypothesis: Suppose that R k-1 can be represented by the regular expression i,j rk-1 for all i,j 1 i,j n, and some k 1. Inductive Step: Consider R k = i,j Rk-1 i,k (Rk-1 k,k )* R k-1 U k,j Rk-1. By the inductive hypothesis there exist i,j regular expressions r k-1 i,k, rk-1 k,k, rk-1 k,j, and rk-1 i,j 1 i,j, respectively. Thus, if we let generating R k-1, i,k Rk-1, k,k Rk-1, and Rkk,j r k i,j = rk-1 i,k (rk-1 k,k )* r k-1 k,j + rk-1 i,j then r k i,j is a regular expression generating Rk i,j,i.e., L(rk i,j ) = Rk i,j. Finally, if F = {q j1, q j2,, q jr }, then r n 1,j1 + rn 1,j2 + + rn 1,jr is a regular expression generating L(M). Note: not only does this prove that the regular expressions generate the regular languages, but it also provides an algorithm for computing it!

11 All remaining columns are computed from the previous column using the formula. r 1 2,3 = 0 (ε)* 1 + 1 = 01 + 1 = r 0 2,1 (r0 1,1 )* r0 1,3 + r0 2,3 r 2 1,3 = r 1 1,2 (r1 2,2 )* r1 2,3 + r1 1,3 = 0 (ε + 00)* (1 + 01) + 1 = 0*1

12 To complete the regular expression, we compute: r 3 1,2 + r3 1,3

13 Theorem: Let L be a language. Then there exists an a regular expression r such that L = L(r) if and only if there exits a DFA M such that L = L(M). Proof: (if) Suppose there exists a DFA M such that L = L(M). Then by Lemma 2 there exists a regular expression r such that L = L(r). (only if) Suppose there exists a regular expression r such that L = L(r). Then by Lemma 1 there exists a DFA M such that L = L(M). Corollary: The regular expressions define the regular languages. Note: The conversion from a regular expression to a DFA and a program accepting L(r) is now complete, and fully automated! Applications of Regular Expression 1.Regular expressions in Unix In the UNIX operating system various commands use an extended regular expressions language that provideshorthands for many common expressions. In this we can write character classes (A character class is a pattern that defines a set of characters and matches exactly one character from that set.) to represent large set of characters. There are some rules for forming this character classes: The dot symbol (.) is to represent any character. The regular expression a+b+c+ +z is represented by [abc z] Within a character class representation, - can be used to define a set of characters in terms of a range. For example, a-z defines the set of lower-case letters and A-Z defines the set of upper-case letters. The endpoints of a range may be specified in either order (i.e. both 0-9 and 9-0 define the set of digits). If our expression involves operators such as minus then we can place it first or last to avoid confusion with the range specifier. i.e. [-.0-9]. The special characters in UNIX regular language can be represented as characters using \ symbol i.e. \ provides the usual escapes within character class brackets. Thus [[\]] matches either [ or ], because \ causes the first ] in the character class representation to be taken as a normal character rather than the closing bracket of the representation. Special notations [: digit : ] same as [0-9] [: alpha:] same as [A-Za-z]

14 [: alnum :] same as [A-Za-z0-9] Operators Used in place of +? 0 or 1 of R? Means 0 or 1 occurrence of R 1 or more of R+ means 1 or more occurrence of R {n} n copies of R {3} means RRR ^ Compliment of If the first character after the opening bracket of a character class is ^, the set defined by the remainder of the class is complemented with respect to the computer's character set. Using this notation, the character class represented by. can be described as [^\n]. If ^ appears as any character of a class except the first, it is not considered to be an operator. Thus [^abc] matches any character except a, b, or c but [a^bc] or [abc^] matches a, b, c or ^. When more than one expression can match the current character sequence, a choice is made as follows: 1. The longest match is preferred. 2. Among rules, which match the same number of characters, the rule given first is preferred. 2.Lexical analysis Compilers in a nutshell Purpose: translate a program in some language (the source language) into a lower-level language (the target language). Phases: Lexical Analysis: Converts a sequence of characters into words, or tokens

15 Syntax Analysis: Converts a sequence of tokens into a parse tree Semantic Analysis: Manipulates parse tree to verify symbol and type information Intermediate Code Generation: Converts parse tree into a sequence of intermediate code instructions Optimization: Manipulates intermediate code to produce a more efficient program Final Code Generation: Translates intermediate code into final (machine/assembly) code Overview of Lexical Analysis Convert character sequence into tokens, skip comments & whitespace Handle lexical errors Efficiency is crucial Tokens are specified as regular expressions, e.g. IDENTIFIER=[a-zA-Z][a-zA-Z0-9]* Lexical Analyzers are implemented by regular expressions. There is a problem that more than one token may be recognized at once. Suppose the string else matches for regular expression as well as the expression for identifiers. This problem is resolved by giving priority to first expression listed.

16 Regular Grammars A grammar is a quadruple G = (V, T, S, P) where V is a finite set of variables T is a finite set of symbols, called terminals S is in V and is called the startsymbol P is a finite set of productions, which are rules of the form α β whereα and β are strings consisting of terminals and variables. A grammar is said to be right-linear if every production in P is of the form A xb or A x where A and B are variables (perhaps the same, perhaps the start symbol S) in V and x is any string of terminal symbols (including the empty string λ) An alternate (and better) definition of a right-linear grammar says that every production in P is of the form A ab A a or or S λ(to allow λ to be in the language) where A and B are variables (perhaps the same, but B can't be S) in V and a is any terminal symbol An alternate (and better) definition of a right-linear grammar says that every production in P is of the form A ab A a or or

17 S λ(to allow λ to be in the language) where A and B are variables (perhaps the same, but B can't be S) in V and a is any terminal symbol A grammar is said to be left-linear if every production in P is of the form A Bx or A x where A and B are variables (perhaps the same, perhaps the start symbol S) in V and x is any string of terminal symbols (including the empty string λ) The alternate definition of a left-linear grammar says that every production in P is of the form A Ba A a or or S λ where A and B are variables (perhaps the same, but B can't be S) in V and a is any terminal symbol Any left-linear or right-linear grammar is called a regular grammar. For brevity, we often write a set of productions such as A x 1 A x 2 A x 3 As A x 1 x 2 x 3 A derivation in grammar G is any sequence of strings in V and T, connected with

18 starting with S and ending with a string containing no variables where each subsequent string is obtained by applying a production in P is called a derivation. S x 1 x 2 x 3... x n abbreviated as: S x n S x 1 x 2 x 3... x n abbreviated as: S x n We say that x n is a sentence of the language generated by G, L(G). We say that the other x's are sentential forms. L(G) = {w w T* and S x n } We call L(G) the language generated by G L(G) is the set of all sentences over grammar G Example 1 S abs a is an example of a right-linear grammar. Can you figure out what language it generates? L = {w {a,b}* w contains alternating a's and b's, begins with an a, and ends with a b} {a} L((ab)*a) Example 2 S Aab A Aab ab B a is an example of a left-linear grammar. Can you figure out what language it generates?

19 L = {w {a,b}* w is aa followed by at least ab's} one set of alternating L(aaab(ab)*) Regular Grammars and NFA's We get a feel for this by example. Let S aa A abs b Regular Grammars and Regular Expressions Example: L(aab*a) We can easily construct a regular language for this expression: S aa A ab B b B a

20 Types of grammars:

21 Key Terms: Introduction to regular operators, regular languages, Precedence of regular operators Regular expressions, Formal definition of regular expressions, Equivalence of Regular Expressions and Finite Automata. Theorem for conversion from regular expression to epsilon FA. Application of regular expressions Algebraic Laws for Regular Expressions. Multiple choice questions: 1. The regular expression 01*. a) The language consisting of strings of length 2. b) The language consisting of all strings that are a single 0 followed by any number of 1 s c) The language consisting of all strings that is a single 0. d) The language consisting of all strings that are a single 1 followed by any number of 0 s. 2. Union and concatenation are associative. a)true b)false 3. The regular expression 10*. a) The language consisting of strings of length 2. b) The language consisting of all strings that are either a single 0 followed by any number of 1 s c) The language consisting of all strings that is a single 0. d) The language consisting of all strings that are a single 1 followed by any number of 0 s. 4. The Kleene closure is represented by a) L b)l+ c)l* d)l1 5. In a regular expression L(E+F) is equal to a) L(E) + L(F) b) L(E) U L(F) c) L(EUF) d) L(EnF) 6. Union of a regular expression is commutative. a)true b)false

22 7.. In a regular expression L(E*) is equal to a) L(E) * b) (L(E) *) c) (L(E) ) * d) all of the above 8. The regular expression operators are a)union, intersection and concatenation b) union, concatenation and closure c) closure, intersection and concatenation d) union, intersection and closure 9. Concatenation of a regular expression is commutative. a) Trueb)False 10. The regular expression (10)*. a) The language consisting of strings of length 2. b) The language consisting of all strings that are either a single 0 followed by any number of 1 s c) The language consisting of alternating strings that begin with 1 and end with 0. d) The language consisting of all strings that are a single 1 followed by any number of 0 s. 11. Every language is a regular language. a) Trueb) False 12. The inverse homomorphism of a regular language is regular a) Trueb) False 13. The positive closure is represented by a) L b) L+ c) L* d)l1 14. The regular expression for the set of strings that end with 1 and has no substring 00 is given by

23 a) (0+1)*0101(0+1)* b) 11(1+0+0)*11 c) (1+01)*(10+11)*1 d) none 2 marks Question and answer:

24

25

26 Summary: A regular expression is a string that describes the whole set of strings according to the syntax. Arden s theorem helps in checking the equivalence of two regular expressions. Lexical analyzers and text editors are the applications of regular expressions. Grammar: generative description of a language.