Text Search and Closure Properties CSCI 330 Formal Languages and Automata Theory Siu On CHAN Fall 208 Chinese University of Hong Kong /28
Text Search
grep program grep -E regex file.txt Searches for an occurrence of patterns matching a regular expression regex language meaning cat 2 {cat, 2} union [abc] {a, b, c} shorthand for a b c [ab][2] {a, a2, b, b2} concatenation (ab) * {ε, ab, abab,... } star [ab]? {ε, a, b} zero or one (cat)+ {cat, catcat,... } one or more [ab]{2} {aa, ab, ba, bb} n copies 2/28
Searching with grep Words containing savor or savour cd /usr/share/dict/ grep -E 'savou?r' words savor savor's savored savorier savories savoriest savoring savors savory savory's unsavory 3/28
Searching with grep Words containing savor or savour cd /usr/share/dict/ grep -E 'savou?r' words Words with 5 consecutive a or b grep -E '[abab]{5}' words Babbage savor savor's savored savorier savories savoriest savoring savors savory savory's unsavory 3/28
More grep commands. any symbol [a-d] anything in a range ^ beginning of line $ end of line grep -E '^a.pl.$' words 4/28
How do you look for Words that start in go and have another go grep -E '^go.*go' words Words with at least ten vowels? grep -ie '([aeiouy].*){0}' words Words without any vowels? grep -ie '^[^aeiouy]*$' words [^R] means does not contain Words with exactly ten vowels? grep -ie '^[^aeiouy]*([aeiouy][^aeiouy]*){0}$' words 5/28
How grep (could) work regular expression NFA input DFA text file differences in class in grep [ab]?, a+, (cat){3} not allowed allowed input handling matches whole looks for substring output accept/reject finds substring Regular expression also supported in modern languages (C, Java, Python, etc) 6/28
Implementation of grep How do you handle expressions like [ab]? () [ab] zero or more R? ε R (cat)+ (cat)(cat)* one or more R+ RR a{3} aaa n copies R{n} RR }{{... R } n times [^aeiouy]? not containing 7/28
Closure properties
Example The language L of strings that end in 0 is regular (0 + ) 0 How about the language L of strings that do not end in 0? 8/28
Example The language L of strings that end in 0 is regular (0 + ) 0 How about the language L of strings that do not end in 0? Hint: a string does not end in 0 if and only if it ends in 000, 00, 00, 0, 00, 0 or or has length 0,, or 2 So L can be described by the regular expression (0+) (000+00+00+0+00+0+)+ε+(0+)+(0+)(0+) 8/28
Complement The complement L of a language L contains those strings that are not in L L = {w Σ w / L} Examples (Σ = {0, }) L = lang. of all strings that end in 0 L = lang. of all strings that do not end in 0 = lang. of all strings that end in 000,, (but not 0) or have length 0,, or 2 L 2 = lang. of = {ε,,,,... } L 2 = lang. of all strings that contain at least one 0 = lang. of the regular expression (0 + ) 0(0 + ) 9/28
Example The language L of strings that contain 0 is regular (0 + ) 0(0 + ) How about the language L of strings that do not contain 0? You can write a regular expression, but it is a lot of work! 0/28
Closure under complement If L is a regular language, so is L To argue this, we can use any of the equivalent definitions of regular languages regular expression NFA DFA The DFA definition will be the most convenient here We assume L has a DFA, and show L also has a DFA /28
Arguing closure under complement Suppose L is regular, then it has a DFA M accepts L Now consider the DFA M with the accepting and rejecting states of M reversed accepts strings not in L 2/28
Can we do the same with an NFA? 0, q 0 q 0 q 2 (0 + ) 0 0, q 0 q 0 q 2 3/28
Can we do the same with an NFA? 0, q 0 q 0 q 2 (0 + ) 0 0, q 0 q 0 q 2 (0 + ) Not the complement! 3/28
Intersection The intersection L L is the set of strings that are in both L and L Examples: L L L L (0 + ) L L L L (0 + ) 0 If L and L are regular, is L L also regular? 4/28
Closure under intersection If L and L are regular languages, so is L L To argue this, we can use any of the equivalent definitions of regular languages regular expression NFA DFA Suppose L and L have DFAs, call them M and M Goal: construct a DFA (or NFA) for L L 5/28
Example L (odd number of s) M 0 0 s 0 s L (even number of 0s) M r 0 0 0 r L L = lang. of even number of 0s and odd number of s 6/28
Example L (odd number of s) M 0 0 s 0 s L (even number of 0s) M 0 r 0 r 0 r 0, s 0 r 0, s 0 0 0 0 r, s 0 r, s L L = lang. of even number of 0s and odd number of s 6/28
Closure under intersection M and M DFA for L L states Q = {r,..., r s } Q = {s,..., s m } Q Q = {(r, s ), (r, s 2 ),..., (r 2, s ),..., (r n, s m )} start states r i for M (r i, s j ) s j for M accepting states F for M F F = F for M {(r i, s j ) r i F, s j F } Whenever M is in state r i and M is in state s j, the DFA for L L will be in state (r i, s j ) 7/28
Closure under intersection M and M DFA for L L transitions a r i r j r i, s k a r j, s l s k a s l 8/28
Reversal The reversal w R of a string w is w written backwards w = dog w R = god The reversal L R of a language L is the language obtained by reversing all its strings L = {dog, war, level} L R = {god, raw, level} 9/28
Reversal of regular languages L = language of all strings that end in 0 L is regular and has regex (0 + ) 0 How about L R? This is the language of all strings beginning in 0 It is regular and represented by 0(0 + ) 20/28
Closure under reversal If L is a regular language, so is L R How do we argue? regular expression NFA DFA 2/28
Arguing closure under reversal Take a regular expression E for L We will find a regular expression E R representing L R A regular expression can be of the following types: special symbols and ε alphabet symbols like a and b union, concatenation, or star of simpler expressions 22/28
Inductive proof of closure under reversal Regular expression E ε a reversal E R ε a E + E 2 E R + ER 2 E E 2 E R 2 ER E (E R ) 23/28
Duplication? L DUP = {ww w L} Example: L = {cat, dog} L DUP = {catcat, dogdog} If L is regular, is L DUP also regular? 24/28
Attempts Let s try regular expression L DUP? = L 2 25/28
Attempts Let s try regular expression L DUP? = L 2 L = {a, b} L DUP = {aa, bb} LL = {aa, ab, ba, bb} Let s try NFA q ε ε ε 0 NFA for L NFA for L q 25/28
An example L = language of 0 (L is regular) L = {, 0, 00, 000,... } L DUP = {, 00, 0000, 000000,... } = {0 n 0 n n 0} Let s design an NFA for L DUP 26/28
An example L DUP = {, 00, 0000, 000000,... } = {0 n 0 n n 0} 0 0 0 0 0 00 000 27/28
An example L DUP = {, 00, 0000, 000000,... } = {0 n 0 n n 0} 0 0 0 0 0 00 000 Seems to require infinitely many states! Next lecture: will show that languages like L DUP are not regular 27/28
Backreferences in grep Advanced feature in grep and other regular expression libraries grep -E '^(.*)\$' words the special expression \ refers to the substring specified by (.*) (.*)\ looks for a repeated substring, e.g. mama ^(.*)\$ accepts the language L DUP Standard regular expression libraries can accept irregular languages (as defined in this course)! 28/28