Regular Languages and Applications

Regulr Lnguges nd Applictions Yo-Su Hn Deprtment of Computer Science Yonsei University 1-1 SNU 4/14

Regulr Lnguges An old nd well-known topic in CS Kleene Theorem in 1959 FA (finite-stte utomton) constructions: Thompson utomt, position utomt in 1960s Pttern Mtching Prolem in 1970s... in 1980s REVISIT Stte Complexity, Prime Decomposition, Pttern Mtching since mid 1990s 2-1 SNU 4/14

Overview Bsic Notions Position Construction nd XML DTD Regulr-Expression Pttern Mtching Stte Complexity Future Directions nd Conclusions 3-1 SNU 4/14

Regulr Expressions Regulr expressions re very convenient form tht represents (infinite) sets of strings clled regulr sets. Given finite lphet Σ, regulr expression over Σ is defined recursively s follows: 1., the empty-set symol, is regulr expression. 2. λ, the empty-string symol, is regulr expression. 3. Σ is regulr expression. 4. E +F (union), where E nd F re regulr expressions, is regulr expression. 5. E F (ctention), where E nd F re regulr expressions, is regulr expression. 6. E (Kleene str), where E is regulr expression, is regulr expression. 4-1 SNU 4/14

Finite-stte Automt (FAs) A finite-stte utomton A is specified y tuple (Q, Σ, δ, s, F ); Q finite set of sttes Σ finite lphet δ(p, ) = q set of trnsition rules s Q the strt stte F Q set of finl sttes 5-1 SNU 4/14

Finite-stte Automt - exmple q 4 q 5 q 1 q 2 q 3 q 9 q 6 q 7 q 8 q 10 s = q 1 F = {q 9, q 10 } 6-1 SNU 4/14

Finite-stte Automt - exmple q 4 q 5 q 1 q 2 q 3 q 9 q 6 q 7 q 8 q 10 s = q 1 F = {q 9, q 10 } T = 6-2 SNU 4/14

Finite-stte Automt - exmple q 4 q 5 q 1 q 2 q 3 q 9 q 6 q 7 q 8 q 10 s = q 1 F = {q 9, q 10 } T = Q = {q 2 } 6-3 SNU 4/14

Finite-stte Automt - exmple q 4 q 5 q 1 q 2 q 3 q 9 q 6 q 7 q 8 q 10 s = q 1 F = {q 9, q 10 } T = Q = {q 2 } 6-4 SNU 4/14

Finite-stte Automt - exmple q 4 q 5 q 1 q 2 q 3 q 9 q 6 q 7 q 8 q 10 s = q 1 F = {q 9, q 10 } T = 6-5 SNU 4/14

Finite-stte Automt - exmple q 4 q 5 q 1 q 2 q 3 q 9 q 6 q 7 q 8 q 10 s = q 1 F = {q 9, q 10 } T = Q = {q 3 } 6-6 SNU 4/14

Finite-stte Automt - exmple q 4 q 5 q 1 q 2 q 3 q 9 q 6 q 7 q 8 q 10 s = q 1 F = {q 9, q 10 } T = Q = {q 4, q 6 } 6-7 SNU 4/14

Finite-stte Automt - exmple q 4 q 5 q 1 q 2 q 3 q 9 q 6 q 7 q 8 q 10 s = q 1 F = {q 9, q 10 } T = Q = {q 7 } 6-8 SNU 4/14

Finite-stte Automt - exmple q 4 q 5 q 1 q 2 q 3 q 9 q 6 q 7 q 8 q 10 s = q 1 F = {q 9, q 10 } T = Q = {q 8, q 9 } 6-9 SNU 4/14

Finite-stte Automt - exmple q 4 q 5 q 1 q 2 q 3 q 9 q 6 q 7 q 8 q 10 s = q 1 F = {q 9, q 10 } T = Q = {q 9 } 6-10 SNU 4/14

Finite-stte Automt - exmple q 4 q 5 q 1 q 2 q 3 q 9 q 6 q 7 q 8 q 10 s = q 1 F = {q 9, q 10 } T = ccepted!! Q = {q 9 } F 6-11 SNU 4/14

Finite-stte Automt - exmple q 4 q 5 q 1 q 2 q 3 q 9 q 6 q 7 q 8 q 10 s = q 1 F = {q 9, q 10 } L = L( ( + ( + ))) 6-12 SNU 4/14

REs into Finite-stte Automt The well-known Thompson construction y Ken Thompson in 1968. E = λ E = E = λ M(E 1 ) λ λ M(E 2 ) λ E 1 + E 2 λ M(E 1 ) M(E 2 ) E 1 E 2 λ M(E) λ λ E λ 7-1 SNU 4/14

REs into Finite-stte Automt The well-known Thompson construction y Ken Thompson in 1968. E = λ E = E = esy to understnd nd uild-up too mny λ trnsitions λ M(E 1 ) λ λ M(E 2 ) λ E 1 + E 2 λ M(E 1 ) M(E 2 ) E 1 E 2 λ M(E) λ λ E λ 7-2 SNU 4/14

Position Automt - nother utomton construction Proposed y Glushkov nd McNughton nd Ymd in 1960 independently. The construction is sed on the positions of chrcters of given regulr expression. 8-1 SNU 4/14

Position Automt - n exmple E = ( + ) c( + ) E = (1 + 2) 3(4 + 5) 9-1 SNU 4/14

Position Automt - n exmple E = ( + ) c( + ) E = (1 + 2) 3(4 + 5) 0 0 9-2 SNU 4/14

Position Automt - n exmple E = ( + ) c( + ) E = (1 + 2) 3(4 + 5) 0 1 4 0 3 2 5 9-3 SNU 4/14

Position Automt - n exmple E = ( + ) c( + ) E = (1 + 2) 3(4 + 5) 0 1 4 0 3 2 5 9-4 SNU 4/14

Position Automt - n exmple E = ( + ) c( + ) E = (1 + 2) 3(4 + 5) 0 0 1 1 3 3 4 2 2 5 9-5 SNU 4/14

Position Automt - n exmple E = ( + ) c( + ) E = (1 + 2) 3(4 + 5) 1 0 0 1 2 1 1 2 2 3 3 3 3 4 5 4 5 2 9-6 SNU 4/14

Position Automt - n exmple E = ( + ) c( + ) E = (1 + 2) 3(4 + 5) 1 0 0 1 2 1 1 2 2 3 3 3 3 4 5 4 5 2 9-7 SNU 4/14

Position Automt - n exmple E = ( + ) c( + ) E = (1 + 2) 3(4 + 5) 1 0 0 1 2 1 1 2 2 3 3 3 3 4 5 4 5 2 9-8 SNU 4/14

Position Automt - n exmple E = ( + ) c( + ) E = (1 + 2) 3(4 + 5) 0 0 1 2 c c c 3 4 5 9-9 SNU 4/14

Position Automt The construction looks nice! All in-trnsitions of stte hve the sme lel. The numer of sttes = E + 1 Less sttes thn the Thompson utomt nd, thus usully fster! E = ( + ) c( + ) 0 1 2 c c c 3 4 5 10-1 SNU 4/14

Where do position utomt led us? 11-1 SNU 4/14

One-Unmiguous Regulr Lnguges Proposed y Brüggemnn-Klein nd Wood. A regulr lnguge L is one-unmiguous if there is regulr expression E such tht L = L(E) nd the position utomton of E is deterministic. Given n one-unmiguous regulr expression E nd n input string w, we cn red w using one lookhed with respect to E. E = SEO(UL) N S E O U L U L N 12-2 SNU 4/14

One-Unmiguous Regulr Lnguges Proposed y Brüggemnn-Klein nd Wood. A regulr lnguge L is one-unmiguous if there is regulr expression E such tht L = L(E) nd the position utomton of E is deterministic. Given n one-unmiguous regulr expression E nd n input string w, we cn red w using one lookhed with respect to E. E = SEO(UL) N S E O U L U L N 12-3 SNU 4/14

One-Unmiguous Regulr Lnguges Proposed y Brüggemnn-Klein nd Wood. A regulr lnguge L is one-unmiguous if there is regulr expression E such tht L = L(E) nd the position utomton of E is deterministic. Given n one-unmiguous regulr expression E nd n input string w, we cn red w using one lookhed with respect to E. E = SEO(UL) N S E O U L U L N 12-4 SNU 4/14

One-Unmiguous Regulr Lnguges Proposed y Brüggemnn-Klein nd Wood. A regulr lnguge L is one-unmiguous if there is regulr expression E such tht L = L(E) nd the position utomton of E is deterministic. Given n one-unmiguous regulr expression E nd n input string w, we cn red w using one lookhed with respect to E. E = SEO(UL) N S E O U L U L N 12-5 SNU 4/14

One-Unmiguous Regulr Lnguges Proposed y Brüggemnn-Klein nd Wood. A regulr lnguge L is one-unmiguous if there is regulr expression E such tht L = L(E) nd the position utomton of E is deterministic. Given n one-unmiguous regulr expression E nd n input string w, we cn red w using one lookhed with respect to E. E = SEO(UL) N S E O U L U L N 12-6 SNU 4/14

One-Unmiguous Regulr Lnguges Proposed y Brüggemnn-Klein nd Wood. A regulr lnguge L is one-unmiguous if there is regulr expression E such tht L = L(E) nd the position utomton of E is deterministic. Given n one-unmiguous regulr expression E nd n input string w, we cn red w using one lookhed with respect to E. E = SEO(UL) N S E O U L U L N 12-7 SNU 4/14

One-Unmiguous Regulr Lnguges Proposed y Brüggemnn-Klein nd Wood. A regulr lnguge L is one-unmiguous if there is regulr expression E such tht L = L(E) nd the position utomton of E is deterministic. Given n one-unmiguous regulr expression E nd n input string w, we cn red w using one lookhed with respect to E. E = SEO(UL) N S E O U L U L N 12-8 SNU 4/14

One-Unmiguous Regulr Lnguges Proposed y Brüggemnn-Klein nd Wood. A regulr lnguge L is one-unmiguous if there is regulr expression E such tht L = L(E) nd the position utomton of E is deterministic. Given n one-unmiguous regulr expression E nd n input string w, we cn red w using one lookhed with respect to E. E = SEO(UL) N S E O U L U L N 12-9 SNU 4/14

One-Unmiguous Regulr Lnguges Proposed y Brüggemnn-Klein nd Wood. A regulr lnguge L is one-unmiguous if there is regulr expression E such tht L = L(E) nd the position utomton of E is deterministic. Given n one-unmiguous regulr expression E nd n input string w, we cn red w using one lookhed with respect to E. E = SEO(UL) N S E O U L U L N 12-10 SNU 4/14

One-Unmiguous Regulr Lnguges Proposed y Brüggemnn-Klein nd Wood. A regulr lnguge L is one-unmiguous if there is regulr expression E such tht L = L(E) nd the position utomton of E is deterministic. Not ll regulr expressions re one-unmiguous. E = SEO(UL) UNI S E O U L U L N Not ll regulr lnguges re one-unmiguous. There re some regulr lnguges tht cnnot e defined y n one-miguous regulr lnguges. e.g. L(( + ) ( + ) k ), k 1 12-11 SNU 4/14

One-Unmiguous Regulr Lnguges Proposed y Brüggemnn-Klein nd Wood. A regulr lnguge L is one-unmiguous if there is regulr expression E such tht L = L(E) nd the position utomton of E is deterministic. <?xml version="1.0"?> <!DOCTYPE BOOK [ <!ELEMENT p (#PCDATA)> <!ELEMENT BOOK (OPENER,SUBTITLE?,INTRODUCTION?,(SECTION PART)+)> <!ELEMENT OPENER (TITLE_TEXT)*> <!ELEMENT TITLE_TEXT (#PCDATA)> <!ELEMENT SUBTITLE (#PCDATA)> <!ELEMENT INTRODUCTION (HEADER, p+)+> <!ELEMENT PART (HEADER, CHAPTER+)> <!ELEMENT SECTION (HEADER, p+)> <!ELEMENT HEADER (#PCDATA)> <!ELEMENT CHAPTER (CHAPTER_NUMBER, CHAPTER_TEXT)> <!ELEMENT CHAPTER_NUMBER (#PCDATA)> <!ELEMENT CHAPTER_TEXT (p)+> ]> 12-12 SNU 4/14

One-Unmiguous Regulr Lnguges Proposed y Brüggemnn-Klein nd Wood. A regulr lnguge L is one-unmiguous if there is regulr expression E such tht L = L(E) nd the position utomton of E is deterministic. <?xml version="1.0"?> <!DOCTYPE BOOK [ <!ELEMENT p (#PCDATA)> <!ELEMENT BOOK (OPENER,SUBTITLE?,INTRODUCTION?,(SECTION PART)+)> <!ELEMENT OPENER (TITLE_TEXT)*> <!ELEMENT TITLE_TEXT (#PCDATA)> <!ELEMENT SUBTITLE (#PCDATA)> <!ELEMENT INTRODUCTION (HEADER, p+)+> content models <!ELEMENT PART (HEADER, CHAPTER+)> <!ELEMENT SECTION (HEADER, p+)> <!ELEMENT HEADER (#PCDATA)> <!ELEMENT CHAPTER (CHAPTER_NUMBER, CHAPTER_TEXT)> <!ELEMENT CHAPTER_NUMBER (#PCDATA)> <!ELEMENT CHAPTER_TEXT (p)+> ]> 12-13 SNU 4/14

One-Unmiguous Regulr Lnguges Proposed y Brüggemnn-Klein nd Wood. A regulr lnguge L is one-unmiguous if there is regulr expression E such tht L = L(E) nd the position utomton of E is deterministic. <?xml version="1.0"?> <!DOCTYPE BOOK [ <!ELEMENT p (#PCDATA)> <!ELEMENT BOOK (OPENER,SUBTITLE?,INTRODUCTION?,(SECTION PART)+)> <!ELEMENT OPENER (TITLE_TEXT)*> <!ELEMENT TITLE_TEXT (#PCDATA)> <!ELEMENT SUBTITLE (#PCDATA)> BOOK <!ELEMENT ::= INTRODUCTION (HEADER, p+)+> OPENER <!ELEMENT (SUBTITLE+λ) PART (HEADER, (INTRODUCTION+λ) CHAPTER+)> (SECTION + PART)(SECTION + PART) <!ELEMENT SECTION (HEADER, p+)> <!ELEMENT HEADER (#PCDATA)> <!ELEMENT CHAPTER (CHAPTER_NUMBER, CHAPTER_TEXT)> <!ELEMENT CHAPTER_NUMBER (#PCDATA)> <!ELEMENT CHAPTER_TEXT (p)+> ]> 12-14 SNU 4/14

One-Unmiguous Regulr Lnguges Proposed y Brüggemnn-Klein nd Wood. A regulr lnguge L is one-unmiguous if there is regulr expression E such tht L = L(E) nd the position utomton of E is deterministic. <?xml version="1.0"?> <!DOCTYPE BOOK [ <!ELEMENT p (#PCDATA)> <!ELEMENT BOOK (OPENER,SUBTITLE?,INTRODUCTION?,(SECTION PART)+)> <!ELEMENT OPENER (TITLE_TEXT)*> <!ELEMENT TITLE_TEXT (#PCDATA)> <!ELEMENT SUBTITLE (#PCDATA)> BOOK <!ELEMENT ::= INTRODUCTION (HEADER, p+)+> OPENER <!ELEMENT (SUBTITLE+λ) PART (HEADER, (INTRODUCTION+λ) CHAPTER+)> (SECTION + PART)(SECTION + PART) <!ELEMENT SECTION (HEADER, p+)> <!ELEMENT HEADER One-unmiguous (#PCDATA)> regulr expression!! <!ELEMENT CHAPTER (CHAPTER_NUMBER, CHAPTER_TEXT)> <!ELEMENT CHAPTER_NUMBER (#PCDATA)> <!ELEMENT CHAPTER_TEXT (p)+> ]> 12-15 SNU 4/14

One-Unmiguous Regulr Lnguges vs XML DTD Regulr expressions for content models of DTD re one-unmiguous XML DTDs re LL(1) grmmrs [Wood 96] LL(k) grmmrs hve proper hierrchy [AU 72] k-unmiguous regulr lnguges?? 13-1 SNU 4/14

k-lookhed Regulr Lnguges Two wys for defining k-lookhed regulr lnguges. The first is sed on lookhed of t most k 1 symols to determine the next, t most one, mtching position in given regulr expression: deterministic k-lookhed regulr expressions The second is similr except tht when we use lookhed of k symols, we must mtch the next k positions uniquely: k-lockdeterministic regulr expressions 14-1 SNU 4/14

Deterministic k-lookhed regulr lnguges fter reding i+1 t stte q i i i+1 i+2 i+k i+k+1 k-lookhed i i+1 i+2 t stte q i+1 k-lookhed i+k i+k+1 15-1 SNU 4/14

Deterministic k-lookhed regulr lnguges A regulr lnguge L is deterministic k-lookhed if there is deterministic k-lookhed regulr expression for L. A regulr expression is deterministic k-lookhed if its position utomton is deterministic k-lookhed. 0 1 2 3 E = ( + ) 16-1 SNU 4/14

Deterministic k-lookhed regulr lnguges Thm. L((+) (+) k ), for k 0, is deterministic (k+1)-lookhed. 17-1 SNU 4/14

Deterministic k-lookhed regulr lnguges Thm. L((+) (+) k ), for k 0, is deterministic (k+1)-lookhed. k = 1 17-2 SNU 4/14

17-3 SNU 4/14 Deterministic k-lookhed regulr lnguges Thm. L((+) (+) k ), for k 0, is deterministic (k+1)-lookhed. k = 1 k = 1 k = 2

Deterministic k-lookhed regulr lnguges Thm. L((+) (+) k ), for k 0, is deterministic (k+1)-lookhed. There exists hierrchy for deterministic k-lookhed regulr lnguges k k 1 k 2 k 3 1 18-1 SNU 4/14

k-lock-deterministic regulr lnguges i i+1 i+2 i+k i+k+1 t stte q i k-lookhed fter reding i+1 i+k i i+1 i+2 t stte q i i+k i+k+1 k-lookhed 19-1 SNU 4/14

k-lock-deterministic regulr lnguges We define regulr lnguge L to e k-lock-deterministic if there exists k-lock utomton A = (Q, Σ, Γ, δ, s, F ) tht stisfies the following conditions: 1. A is position utomton over Γ. 2. A is deterministic lock utomton. 3. L = L(A ). It is esy to verify tht position utomton A for n 1-deterministic regulr lnguge is 1-lock-deterministic. 20-1 SNU 4/14

k-lock-deterministic regulr lnguges Thm. There is proper hierrchy in k-lock-deterministic regulr lnguges. Sketch of Proof. A (k 1)-lock-deterministic regulr lnguge is k- lock-deterministic y definition. Thus, it is enough to show tht there is k-lock-deterministic regulr lnguge tht is not (k 1)-lockdeterministic. k k 1 k 2 k 3 1 21-1 SNU 4/14

k 3 sttes q 1 q 2 q 3 q 4 q 5 A q 1 q 3 A q 4 q 5 22-1 SNU 4/14

Two Wys... Thm. k-lock-deterministic regulr lnguges re proper sufmily of deterministic k-lookhed regulr lnguges. k-lookhed determinism k-lock determinism Generliztions of One-Deterministic Regulr Lnguges, Yo-Su Hn nd Derick Wood, Informtion nd Computtion, Vol. 206, 1117 1125, 2008 23-1 SNU 4/14

XML DTD vs XML Schem There s no vs XML Schem re much more flexile nd powerful Thus, there re lso much more difficult nd confusing 24-1 SNU 4/14

XML DTD vs XML Schem There s no vs XML Schem re much more flexile nd powerful Thus, there re lso much more difficult nd confusing XML DTD XML Schem 24-2 SNU 4/14

XML DTD vs XML Schem There s no vs XML Schem re much more flexile nd powerful Thus, there re lso much more difficult nd confusing XML DTD 1-lookhed determinism XML Schem k-lookhed determinism 24-3 SNU 4/14

XML DTD vs XML Schem There s no vs XML Schem re much more flexile nd powerful Thus, there re lso much more difficult nd confusing XML DTD 1-lookhed determinism XML Schem? k-lookhed determinism 24-5 SNU 4/14

Pttern Mtching - n ppliction of regulr lnguges Given regulr expression pttern P nd text T, find ll sustrings of T tht re in L(P ). T = AGCT AAT CCCT GAGAGT CCAGT T AGT CCCAT P = T (AG + C) T 25-1 SNU 4/14

Pttern Mtching New Domins: WEB, Bioinformtics, Huge DB, Imges or Source Codes 26-1 SNU 4/14

Pttern Mtching - relted work Given text T nd regulr expression E, The recognition prolem: We cn report ll end positions of mtching sustrings of T in O(mn) time [Aho] or in O(mn/ log n) time [Myers]. The identifiction prolem: We cn report ll (strt, end) positions of mtching sustrings of T in O(mn 2 ) time [Aho]. 27-1 SNU 4/14

Pttern Mtching - recognition prolem Given E over Σ, we prepend Σ to E; this llows mtching to egin t ny position in T. E = ( + ) T = 28-1 SNU 4/14

Pttern Mtching - recognition prolem Given E over Σ, we prepend Σ to E; this llows mtching to egin t ny position in T. E = ( + ) T = Σ E 28-2 SNU 4/14

Pttern Mtching - recognition prolem Given E over Σ, we prepend Σ to E; this llows mtching to egin t ny position in T. E = ( + ) T = 28-3 SNU 4/14

Pttern Mtching - recognition prolem Given E over Σ, we prepend Σ to E; this llows mtching to egin t ny position in T. E = ( + ) T = Given E nd T, we cn find ll end positions of mtching sustrings of T in O(mn) time using O(m) spce, where E = m nd T = n [Aho]. 28-4 SNU 4/14

Pttern Mtching - identifiction prolem Given E over Σ, we prepend Σ to E; this llows mtching to egin t ny position in T. E = ( + ) T = 29-1 SNU 4/14

Pttern Mtching - identifiction prolem Given E over Σ, we prepend Σ to E; this llows mtching to egin t ny position in T. E = ( + ) E R = ( + ) T = 29-2 SNU 4/14

Pttern Mtching - identifiction prolem Given E over Σ, we prepend Σ to E; this llows mtching to egin t ny position in T. E = ( + ) E R = ( + ) T = 29-3 SNU 4/14

Pttern Mtching - identifiction prolem Given E over Σ, we prepend Σ to E; this llows mtching to egin t ny position in T. E = ( + ) E R = ( + ) T = 29-4 SNU 4/14

Pttern Mtching - identifiction prolem Given E over Σ, we prepend Σ to E; this llows mtching to egin t ny position in T. E = ( + ) E R = ( + ) T = 29-5 SNU 4/14

Pttern Mtching - identifiction prolem Given E over Σ, we prepend Σ to E; this llows mtching to egin t ny position in T. E = ( + ) E R = ( + ) T = 29-6 SNU 4/14

Pttern Mtching - identifiction prolem Given E over Σ, we prepend Σ to E; this llows mtching to egin t ny position in T. E = ( + ) E R = ( + ) T = Running Time = No. of mtching end positions O(mn) = O(n) O(mn) = O(mn 2 ). 29-7 SNU 4/14

Prefix nd Infix Given two strings x nd y over Σ, we sy x is prefix of y if there exists z Σ such tht xz = y. x is n infix of y if there exists u, v Σ such tht uxv = y; we often cll x sustring of y. 30-1 SNU 4/14

Prefix nd Infix Given two strings x nd y over Σ, we sy x is prefix of y if there exists z Σ such tht xz = y. x is n infix of y if there exists u, v Σ such tht uxv = y; we often cll x sustring of y. y = seoul 30-2 SNU 4/14

Prefix nd Infix Given two strings x nd y over Σ, we sy x is prefix of y if there exists z Σ such tht xz = y. x is n infix of y if there exists u, v Σ such tht uxv = y; we often cll x sustring of y. We define pttern P to e prefix-free if no string in P is prefix of ny other strings in P. infix-free if no string in P is n infix of ny other strings in P. 30-5 SNU 4/14

Infix-free Regulr-Expression Mtching L IN L P RE L REG L REG L P RE L IN 31-1 SNU 4/14

Infix-free Regulr-Expression Mtching L IN L P RE L REG Given n infix-free regulr expression E nd text T : y = seoul eou is n infix of y. L REG E L P RE T 1 2 3 4 5 6 7 8 9 10 11 12 L IN 31-2 SNU 4/14

Infix-free Regulr-Expression Mtching L IN L P RE L REG Given n infix-free regulr expression E nd text T : y = seoul eou is n infix of y. L REG E L P RE T 1 2 3 4 5 6 7 8 9 10 11 12 = the recognition process L IN 31-3 SNU 4/14

Infix-free Regulr-Expression Mtching L IN L P RE L REG Given n infix-free regulr expression E nd text T : y = seoul eou is n infix of y. L REG E L P RE E R = T 1 2 3 4 5 6 7 8 9 10 11 12 = the recognition process L IN 31-5 SNU 4/14

Infix-free Regulr-Expression Mtching L IN L P RE L REG Given n infix-free regulr expression E nd text T : y = seoul eou is n infix of y. L REG E L P RE E R = T 1 2 3 4 5 6 7 8 9 10 11 12 = L IN the recognition process Becuse of infix-freeness, ech pir of (, ) from left to right must e mtching sustring. We cn find ll mtching sustrings in O(mn) time [HWW07]. Prefix-Free Regulr Lnguges nd Pttern Mtching, Yo-Su Hn, Yjun Wng nd Derick Wood Theoreticl Computer Science Vol. 389, 307 317, 2007 31-7 SNU 4/14

Prefix-free Regulr-Expression Mtching L IN L P RE L REG If E is infix-free, we hve n O(mn) running time lgorithm If E is (norml) regulr expression, we hve n O(mn 2 ) running time lgorithm If E is prefix-free, then there re t most n mtching sustrings of T tht elong to L(E), where n is the size of T. 32-1 SNU 4/14

Prefix-free Regulr-Expression Mtching Sketch of our lgorithm: E T 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 33-1 SNU 4/14

Prefix-free Regulr-Expression Mtching Sketch of our lgorithm: E T = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 the recognition process 33-2 SNU 4/14

Prefix-free Regulr-Expression Mtching Sketch of our lgorithm: E T = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 the recognition process 33-3 SNU 4/14

Prefix-free Regulr-Expression Mtching Sketch of our lgorithm: E E R = T = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 the recognition process 33-4 SNU 4/14

Prefix-free Regulr-Expression Mtching Sketch of our lgorithm: E E R = T = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 the recognition process 33-5 SNU 4/14

Prefix-free Regulr-Expression Mtching Sketch of our lgorithm: E E R = T = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 the recognition process prllel processing strts 33-6 SNU 4/14

Prefix-free Regulr-Expression Mtching Sketch of our lgorithm: E E R = T = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 the recognition process T = Running Time = No. of mtching end positions O(mn) = O(n) O(mn) = O(mn 2 ). prllel processing strts 33-7 SNU 4/14

Prefix-free Regulr-Expression Mtching Sketch of our lgorithm: E E R = T = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 the recognition process Becuse of prefix-freeness, no two process cn hve the sme stte of E t the sme time. This implies tht single reverse scn is enough to find corresponding strt positions for ech end position. set of sttes for 12 set of sttes for 15 33-10 SNU 4/14

Prefix-free Regulr-Expression Mtching Given prefix-free regulr expression E nd text T, we cn identify ll mtching sustrings of T tht elong to L(E) in O(mn) worst-cse time - [HWW07]. Prefix-Free Regulr Lnguges nd Pttern Mtching, Yo-Su Hn, Yjun Wng nd Derick Wood Theoreticl Computer Science Vol. 389, 307 317, 2007 34-1 SNU 4/14

Stte Complexity Wht is the stte complexity of regulr lnguge L? 35-1 SNU 4/14

Stte Complexity Wht is the stte complexity of regulr lnguge L? Stte complexity is descriptionl complexity of L L hs unique miniml DFA A We define the stte complexity of L to e the numer of sttes in A 35-2 SNU 4/14

Stte Complexity Prolem Given two (ritrry) regulr lnguges L 1 nd L 2, wht is the stte complexity of L 1 L 2? 36-1 SNU 4/14

Stte Complexity Prolem Given two (ritrry) regulr lnguges L 1 nd L 2, wht is the stte complexity of L 1 L 2? Upper ound m 1 m 2 36-2 SNU 4/14

Stte Complexity Prolem Given two (ritrry) regulr lnguges L 1 nd L 2, wht is the stte complexity of L 1 L 2? Upper ound t most m 1 m 2 f(m 1, m 2 ) 36-3 SNU 4/14

Stte Complexity Prolem Given two (ritrry) regulr lnguges L 1 nd L 2, wht is the stte complexity of L 1 L 2? Upper ound t most m 1 m 2 f(m 1, m 2 ) Lower ound 36-4 SNU 4/14

Stte Complexity Prolem Given two (ritrry) regulr lnguges L 1 nd L 2, wht is the stte complexity of L 1 L 2? Upper ound t most m 1 m 2 f(m 1, m 2 ) Lower ound Present two (generl) L 1 nd L 2 such tht the stte complexity of L 1 L 2 lwys reches the upper ound. 36-5 SNU 4/14

Stte Complexity - Motivtion 1970s 2011 In recent yers, there hve een mny new pplictions of FAs, such s in nturl lnguge nd speech processing, softwre engineering, nd imge genertion nd encoding tht need lrge numer of sttes. the Bell Ls multilingul TTS system: 26.6MB for Germn, 30.0MB for French nd 39.0MB for Chinese. 37-1 SNU 4/14

Stte Complexity - motivtion New Helper: FA mnipultion softwre systems such s Gril+, Automte nd FireLite 38-1 SNU 4/14

Stte Complexity - motivtion New Helper: FA mnipultion softwre systems such s Gril+, Automte nd FireLite We clculte the upper ound. 38-2 SNU 4/14

Stte Complexity - motivtion New Helper: FA mnipultion softwre systems such s Gril+, Automte nd FireLite We clculte the upper ound. We guess lower ound nd verify it, nd repet this step until we find mtching lower ound. 38-3 SNU 4/14

Stte Complexity opertion finite lnguges regulr lnguges L 1 L 2 O(mn) mn L 1 L 2 O(mn) mn Σ \ L 1 m m L 1 L 2 (m n + 3)2 n 2 1 (2m 1)2 n 1 L 1 2 m 3 + 2 m 4, for m 4 2 m 1 + 2 m 2 L R 1 3 2 p 1 1 if m = 2p 2 p 1 if m = 2p 1 2 m 39-1 SNU 4/14

Union of Finite Lnguges Given two miniml DFAs A nd B for non-empty finite lnguges L 1 nd L 2, we cn construct DFA for L(A) L(B) sed on the Crtesin product of sttes s follows: Let A = (Q 1, Σ, δ 1, s 1, F 1 ) nd B = (Q 2, Σ, δ 2, s 2, F 2 ). M = (Q 1 Q 2, Σ, δ, (s 1, s 2 ), F ), where for ll p Q 1 nd q Q 2 nd Σ, δ((p, q), ) = (δ(p, ), δ(q, )) nd F = (F 1 Q 2 ) (Q 1 F 2 ). M is deterministic. 40-1 SNU 4/14

Union of Finite Lnguges - Crtesin Product of Sttes 1,1 1,2 1,3 1,n-1 1,n 2,1 The m 1th stte in A is the finl stte whose outtrnsitions go to the sink stte, the mth stte. m-1,1 m,1 m-1,n-1 m,n-1 m-1,n m,n 41-1 SNU 4/14

Union of Finite Lnguges - Crtesin Product of Sttes 1,1 1,2 1,3 1,n-1 1,n 2,1 The m 1th stte in A is the finl stte whose outtrnsitions go to the sink stte, the mth stte. For stte (i, j) in M, L i,j (M) = L i (A) L j (B). m-1,1 m,1 m-1,n-1 m,n-1 m-1,n m,n ll sttes re unrechle from stte (1,1) since A nd B re non-returning. 41-3 SNU 4/14

Union of Finite Lnguges - Crtesin Product of Sttes 1,1 1,2 1,3 1,n-1 1,n 2,1 The m 1th stte in A is the finl stte whose outtrnsitions go to the sink stte, the mth stte. For stte (i, j) in M, L i,j (M) = L i (A) L j (B). m-1,1 m,1 m-1,n-1 m,n-1 m-1,n m,n ll sttes equivlent since L m 1,n 1 = L m 1,n = L m,n 1 = {λ}. ll sttes re unrechle from stte (1,1) since A nd B re non-returning. 41-4 SNU 4/14

Union of Finite Lnguges - Crtesin Product of Sttes 1,1 1,2 1,3 1,n-1 1,n 2,1 The m 1th stte in A is the finl stte whose outtrnsitions go to the sink stte, the mth stte. For stte (i, j) in M, L i,j (M) = L i (A) L j (B). m-1,1 m,1 m-1,n-1 m,n-1 m-1,n m,n ll sttes equivlent since L m 1,n 1 = L m 1,n = L m,n 1 = {λ}. Lemm ll sttes 1. mn (m+n 2) 2 re unrechle from= stte mn(1,1) (m + n) sttes re sufficient forsince L(A) A nd L(B). B re non-returning. 41-5 SNU 4/14

Union of Finite Lnguges Lemm 1. mn (m + n) sttes re sufficient for L(A) L(B). The next question is whether or not the ound is rechle in generl. 42-1 SNU 4/14

Union of Finite Lnguges Lemm 1. mn (m + n) sttes re sufficient for L(A) L(B). The next question is whether or not the ound is rechle in generl. The nswer is YES nd NO. 42-2 SNU 4/14

Union of Finite Lnguges Lemm 2. The upper ound mn (m + n) cnnot e reched with fixed lphet when m nd n re ritrrily lrge. Proof. Let A hve {p 0, p 1,..., p m 1 } nd B hve {q 0, q 1,..., q n 1 }. We order the sttes such tht if p j is rechle from p i, then i < j. Let i {1,..., m 1}. Any string tht reches p i from p 0 cn go through only the sttes p 1,..., p i 1 in etween nd cnnot visit the sme stte twice. Hence, there re t most t + t 2 + + t i = t(ti 1) t 1 = def D(i) strings tht cn rech p i from p 0. Since M is deterministic, for ny fixed i for 1 i < m 1, t most D(i) of the pir-sttes (p i, q j ) re rechle from (p 0, q 0 ) in M. Thus, if n 2 > D(i), then some pir-sttes with p i s the first component re not rechle. Therefore, the ound mn (m + n) is not rechle. 43-2 SNU 4/14

Union of Finite Lnguges Lemm 2. The upper ound mn (m + n) cnnot e reched with fixed lphet when m nd n re ritrrily lrge. Wht if the size of n lphet is NOT fixed? 44-1 SNU 4/14

Union of Finite Lnguges Lemm 2. The upper ound mn (m + n) cnnot e reched with fixed lphet when m nd n re ritrrily lrge. Wht if the size of n lphet is NOT fixed? Lemm 3. The upper ound mn (m + n) is rechle if the size of the lphet cn depend on m nd n. 44-2 SNU 4/14

Union of Finite Lnguges Lemm 3. The upper ound mn (m + n) is rechle if the size of the lphet cn depend on m nd n. We prove the lemm y presenting two finite lnguges whose union reches the ound. Let Σ = {, c} { i,j 1 i m 2, 1 j n 2 nd (i, j) (m 2, n 2)} Let A = (Q 1, Σ, δ 1, p 0, {p m 2 }), where Q 1 = {p 0, p 1,..., p m 1 } nd δ 1 is defined s follows: δ 1 (p i, ) = p i+1, for 0 i m 2. δ 1 (p 0, i,j ) = p i, for 1 i m 2 nd 1 j n 2, (i, j) (m 2, n 2). Let B = (Q 2, Σ, δ 2, q 0, {q n 2 }), where Q 2 = {q 0, q 1,..., q n 1 } nd δ 2 is defined s follows: δ 2 (q i, c) = q i+1, for 0 i n 2. δ 2 (q 0, i,j ) = q j, for 1 j n 2 nd 1 i m 2, (i, j) (m 2, n 2). 45-1 SNU 4/14

Union of Finite Lnguges Lemm 3. The upper ound mn (m + n) is rechle if the size of the lphet cn depend on m nd n. A, 11, 12, 13 0 1 2 3 4 5 21, 22, 23 31, 32, 33 41, 42 B c, 11, 21, 31, 41 c c c 0 1 2 3 4 12, 22, 32, 42 13, 23, 33 An exmple of two miniml DFAs for finite lnguges whose sizes re 6 nd 5, respectively, where stte 5 ove nd stte 4 elow re sink sttes 46-1 SNU 4/14

Union of Finite Lnguges Lemm 3. The upper ound mn (m + n) is rechle if the size of the lphet cn depend on m nd n. Let L = L(A) L(B). We shows tht there exists set R consisting of mn (m+n) strings over Σ tht re pirwise inequivlent modulo the right invrint congruence of L. Let R = R 1 R 2 R 3, where R 1 = { i 0 i m 1}. R 2 = {c j 1 j n 3}. (Note tht R 2 does not include strings c 0, c n 2 nd c n 1.) R 3 = { i,j 1 i m 2 nd 1 j n 2 nd (i, j) (m 2, n 2)}. It is esy to verify tht ll strings in R re pirwise inequivlent. complete proof is given in the proceedings.) Then, R = mn (m + n). (The 47-1 SNU 4/14

Union of Finite Lnguges Theorem 1. Given two miniml DFAs A nd B for finite lnguges, mn (m + n) sttes re necessry nd sufficient in the worst-cse for the miniml DFA of L(A) L(B), where m = A nd n = B. 48-1 SNU 4/14

Union of Finite Lnguges Lemm 2 shows tht the upper ound is unrechle if Σ is fixed wheres Lemm 3 shows tht the upper ound is rechle if Σ depends on m nd n. Then, wht is the stte complexity of union with fixed sized lphet? 49-1 SNU 4/14

Union of Finite Lnguges Lemm 4. There exist DFAs A nd B, with m nd n sttes respectively, tht recognize finite lnguges over Σ such tht the miniml DFA for L(A) L(B) requires c(min{m, n}) 2 sttes. Proof. Let s 1 e ritrry nd r = log s. We define the finite lnguge L 1 = {w 1 w 2 w 1 = 2r, w 2 = odd(w 1 ) {, }, even(w 1 ) {c, d} }. L 1 cn e recognized y DFA A with t most 10s sttes. 50-1 SNU 4/14

Union of Finite Lnguges L 1 = {w 1 w 2 w 1 = 2r, w 2 = odd(w 1 ) {, }, even(w 1 ) {c, d} }. c, d c, d c, d c, d c, d c, d expnding tree for w 1 prt c, d c, d c, d c, d c, d c, d c, d c, d merging tree for w 2 prt A DFA A tht recognizes L 1 when r = 3. We omit the sink stte nd its in-trnsitions. 51-1 SNU 4/14

Union of Finite Lnguges Symmetriclly, we define L 2 = {w 1 w 2 w 1 = 2r, odd(w 1 ) {, }, w 2 = even(w 1 ) {c, d} }. The lnguge L 2 consists of strings uv, where u = 2r, odd chrcters of u re in {, }, even chrcters of u re in {c, d} nd even(u) coincides with v. By similr rgument, L 2 cn e recognized y DFA B with t most 10s sttes. 52-1 SNU 4/14

Union of Finite Lnguges Now let L = L 1 L 2. Let u 1 nd u 2 e distinct strings of length 2r such tht odd(u i ) {, } nd even(u i ) {c, d} for i = 1, 2. If odd(u 1 ) odd(u 2 ): u 1 odd(u 1 ) L 1 L ut u 2 odd(u 1 ) / L. Hence, u 1 nd u 2 re not equivlent modulo the right invrint congruence of L. If even(u 1 ) even(u 2 ): u 1 even(u 1 ) L 2 L ut u 2 even(u 1 ) L. The ove implies tht the right invrint congruence of L hs t lest 2 r 2 r s 2 different clsses. Therefore, if m = n = 10s is the size of the miniml DFAs for the finite lnguges L 1 nd L 2, then we know tht the miniml DFA for L = L 1 L 2 needs t lest 1 100 n2 sttes. 53-1 SNU 4/14

RECAP Structurl properties of the k-lookhed determinism tht might led to n efficient XML Schem prser Fst regulr-expression pttern mtching lgorithms Stte Complexity 54-1 SNU 4/14

Future Directions nd Conclusions Hierrchy of k-lookhed determinism XML Schem prser regulr-expression pttern mtching system for source codes pttern mtching + indexing pure theory prcticl ppliction stte complexity 55-1 SNU 4/14

THANK YOU ANY QUESTIONS?? 56-1 SNU 4/14