On-Line Construction. of Suffix Trees. Overview. Suffix Trees. Notations. goo. Suffix tries

Similar documents
MAT 1275: Introduction to Mathematical Analysis

Global alignment. Genome Rearrangements Finding preserved genes. Lecture 18

Suffix Trees. Philip Bille

Ternary Directed Acyclic Word Graphs

Data Structures and Algorithm. Xiaoqing Zheng

Automata and Languages

Prefix-Free Regular-Expression Matching

Discrete Structures, Test 2 Monday, March 28, 2016 SOLUTIONS, VERSION α

Anatomy of a Deterministic Finite Automaton. Deterministic Finite Automata. A machine so simple that you can understand it in less than one minute

OVERVIEW Using Similarity and Proving Triangle Theorems G.SRT.4

CS311 Computational Structures Regular Languages and Regular Grammars. Lecture 6

SMARANDACHE GROUPOIDS

MAT 1275: Introduction to Mathematical Analysis

1.4 Nonregular Languages

Chapter Summary. Mathematical Induction Strong Induction Recursive Definitions Structural Induction Recursive Algorithms

22: Union Find. CS 473u - Algorithms - Spring April 14, We want to maintain a collection of sets, under the operations of:

7. SOLVING OBLIQUE TRIANGLES: THE LAW OF SINES

Grammar. Languages. Content 5/10/16. Automata and Languages. Regular Languages. Regular Languages

11.2. Infinite Series

The Laws of Sines and Cosines

Minimal DFA. minimal DFA for L starting from any other

Module 9: Tries and String Matching

Module 9: Tries and String Matching

System Validation (IN4387) November 2, 2012, 14:00-17:00

CS 491G Combinatorial Optimization Lecture Notes

1.3 Regular Expressions

STRAND F: GEOMETRY F1 Angles and Symmetry

Computational Biology Lecture 18: Genome rearrangements, finding maximal matches Saad Mneimneh

exact matching: topics

CS 573 Automata Theory and Formal Languages

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Solutions to Assignment 1

Chapter 2 Finite Automata

Non-Deterministic Finite Automata. Fall 2018 Costas Busch - RPI 1

Math 105: Review for Exam I - Solutions

Statistical modeling with stochastic processes. Alexandre Bouchard-Côté Lecture 11, Monday April 4

Speech Recognition Lecture 2: Finite Automata and Finite-State Transducers. Mehryar Mohri Courant Institute and Google Research

Convert the NFA into DFA

Tries and suffixes trees

11.1 Finite Automata. CS125 Lecture 11 Fall Motivation: TMs without a tape: maybe we can at least fully understand such a simple model?

CS 275 Automata and Formal Language Theory

Speech Recognition Lecture 2: Finite Automata and Finite-State Transducers

NON-DETERMINISTIC FSA

Greedy Algorithms. Kye Halsted. Edited by Chuck Cusack. These notes are based on chapter 17 of [1] and lectures from CSCE423/823, Spring 2001.

Technische Universität München Winter term 2009/10 I7 Prof. J. Esparza / J. Křetínský / M. Luttenberger 11. Februar Solution

FINITE BOOLEAN ALGEBRA. 1. Deconstructing Boolean algebras with atoms. Let B = <B,,,,,0,1> be a Boolean algebra and c B.

Asynchronous Sequen<al Circuits

Chapter 4 State-Space Planning

CMSC 330: Organization of Programming Languages. DFAs, and NFAs, and Regexps (Oh my!)

Finite State Automata and Determinisation

Solutions for HW9. Bipartite: put the red vertices in V 1 and the black in V 2. Not bipartite!

Algorithms & Data Structures Homework 8 HS 18 Exercise Class (Room & TA): Submitted by: Peer Feedback by: Points:

Subsequence Automata with Default Transitions

AUTOMATA AND LANGUAGES. Definition 1.5: Finite Automaton

Balanced binary search trees

A new Type of Fuzzy Functions in Fuzzy Topological Spaces

Counting Paths Between Vertices. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs

CS241 Week 6 Tutorial Solutions

Running an NFA & the subset algorithm (NFA->DFA) CS 350 Fall 2018 gilray.org/classes/fall2018/cs350/

Prefix-Free Subsets of Regular Languages and Descriptional Complexity

Non Deterministic Automata. Linz: Nondeterministic Finite Accepters, page 51

5. (±±) Λ = fw j w is string of even lengthg [ 00 = f11,00g 7. (11 [ 00)± Λ = fw j w egins with either 11 or 00g 8. (0 [ ffl)1 Λ = 01 Λ [ 1 Λ 9.

More Foundations. Undirected Graphs. Degree. A Theorem. Graphs, Products, & Relations

Synchronization of regular automata

T b a(f) [f ] +. P b a(f) = Conclude that if f is in AC then it is the difference of two monotone absolutely continuous functions.

CS415 Compilers. Lexical Analysis and. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

1 PYTHAGORAS THEOREM 1. Given a right angled triangle, the square of the hypotenuse is equal to the sum of the squares of the other two sides.

CS 360 Exam 2 Fall 2014 Name

Alignment of Long Sequences. BMI/CS Spring 2016 Anthony Gitter

Lecture 6: Coding theory

State Minimization for DFAs

Finite Automata. Informatics 2A: Lecture 3. John Longley. 22 September School of Informatics University of Edinburgh

Homology groups of disks with holes

Regular expressions, Finite Automata, transition graphs are all the same!!

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

NFAs and Regular Expressions. NFA-ε, continued. Recall. Last class: Today: Fun:

Deterministic Finite-State Automata

a n = 1 58 a n+1 1 = 57a n + 1 a n = 56(a n 1) 57 so 0 a n+1 1, and the required result is true, by induction.

Logic Synthesis and Verification

CHAPTER 1 Regular Languages. Contents. definitions, examples, designing, regular operations. Non-deterministic Finite Automata (NFA)

Distributions, spatial statistics and a Bayesian perspective

AP CALCULUS Test #6: Unit #6 Basic Integration and Applications

Exercises Chapter 1. Exercise 1.1. Let Σ be an alphabet. Prove wv = w + v for all strings w and v.

Where did dynamic programming come from?

Data Structures LECTURE 10. Huffman coding. Example. Coding: problem definition

Homework 4. 0 ε 0. (00) ε 0 ε 0 (00) (11) CS 341: Foundations of Computer Science II Prof. Marvin Nakayama

1.1. Linear Constant Coefficient Equations. Remark: A differential equation is an equation

Section 1.3 Triangles

Modern Physics. Unit 6: Hydrogen Atom - Radiation Lecture 6.1: The Radial Probability Density. Ron Reifenberger Professor of Physics Purdue University

Bisimulation, Games & Hennessy Milner logic

Page 1

Finite Automata. Informatics 2A: Lecture 3. Mary Cryan. 21 September School of Informatics University of Edinburgh

NFA and regex. the Boolean algebra of languages. non-deterministic machines. regular expressions

19 Optimal behavior: Game theory

Tree Structured Classifier

Finite Automata-cont d

Closure Properties of Regular Languages

Compression of Palindromes and Regularity.

Pre-Lie algebras, rooted trees and related algebraic structures

PYTHAGORAS THEOREM WHAT S IN CHAPTER 1? IN THIS CHAPTER YOU WILL:

Transcription:

On-Line Cnstrutin Overview Suffix tries f Suffix Trees E. Ukknen On-line nstrutin f suffix tries in qudrti time Suffix trees On-line nstrutin f suffix trees in liner time Applitins 1 2 Suffix Trees A suffix tree is trie-like dt struture representing ll suffixes f string. Nttins Let T = t 1 t n be string. Fr 0 i n, let T i = t 1 t i dente the g g i-length prefix f T. Fr 1 i n + 1, let T i = t i t n dente the suffix f T tht strts t the i th psitin. 3 Let σ(t) = {T i 1 i n + 1}. 4

Suffix Tries The suffix trie f T, dented by STrie(T), is trie representing σ(t). Suffix Tries (nt.) Definitin: STrie(T) is n ugmented DFA, STrie(T) = (Q { }, rt, F, g, f) where: 5 Q = {x x is substring f T} is the set f the sttes f the DFA. is n uxiliry stte. rt is the initil stte, rrespnding t the empty string ε. F = σ(t) is the set f finite sttes. 6 Suffix Tries (nt.) g : Q { } Σ Q ( prtil funtin) is the trnsitin funtin, defined s fllws: g(x,) = y fr ll x,y Q nd Σ, s.t. y = x. g(,) = rt fr ll Σ. f : Q Q { } is the suffix funtin defined s fllws: f(x) = y fr ll x,y Q, x rt, s.t Σ, s.t. x = y. f(rt) =. 7 An Exmple STrie() Σ ε 8

The Size f Suffix Tries Therem: The size f STrie(T), where T = n, is O(n 2 ). Prf: The size f STrie(T) is liner in the number f substrings f T. T hs t mst O(n 2 ) substrings. Thus the size f STrie(T) is O(n 2 ). 9 On-Line Cnstrutin f Suffix Tries Let T = t 1 t n. 1 i n, the lgrithm nstruts STrie(T i ). First we nstrut STrie(T 0 ) = STrie(ε). Then, 1 i n, we btin STrie(T i ) frm STrie(T i-1 ). 10 On-Line Cnstrutin f Suffix Tries (nt.) Observtin 1: σ(t i ) = {xt i x σ(ti-1 )} {ε}. Observtin 2: The suffixes f T i n be fund by strting t the stte T i nd fllwing the suffix links, until ε. Thus, σ(t i ) = {f j (T i ) 0 j i}. Definitin: The pth frm T i t fllwing the suffix links is lled the bundry pth 11 f STrie(T i ). On-Line Cnstrutin f Suffix Tries (nt.) Σ ε 12

STrie(T i-1 ) STrie(T i ) The Algrithm Σ rete STrie(ε) tp ε fr i 1 t n d r tp while g(r,t i ) is undefined d rete new stte r nd g(r,t i ) r if r tp then f(ld-r ) r ld-r r r f(r) f(ld-r ) g(r,t i ) tp g(tp,t i ) 13 14 The Algrithm (nt.) Running Time Σ Therem: The running time f the lgrithm is liner in the size f STrie(T), whih is, in wrst se, O( T 2 ). 15 16

Running Time (nt.) rete STrie(ε) tp ε fr i 1 t n d r tp while g(r, t i ) is undefined d rete new stte r nd g(r, t i ) r if r tp then f(ld-r ) r ld-r r r f(r) f(ld-r ) g(r, t i ) O(1) fr eh nde dded t STrie(T) 17 Suffix Trees A suffix tree STree(T) represents STrie(T) in spe liner in T. This is hieved by representing nly subset f Q { } f Q { }, lled the expliit sttes. 18 Expliit nd Impliit Sttes Definitin: A stte q is lled expliit in the fllwing ses: q is lef q is brnhing stte (hs t lest tw trnsitins) rt nd re ls defined t be brnhing sttes. Otherwise (if q hs extly ne trnsitins nd is nt the rt r ), q is lled impliit. 19 Expliit nd Impliit Sttes (nt). Σ 20

Generlized Trnsitin Funtin The string w spelled ut by the trnsitin pth in STrie(T) between tw expliit sttes s nd r is represented in STree(T) s generlized trnsitin g (s,w) = r. STrie(T) STree(T) Σ A generlized trnsitin g (s,w) = r is lled n -trnsitin if Σ nd v Σ* s.t. w = v. Nte tht fr eh expliit stte s nd Σ 21 there is t mst ne -trnsitin frm s. 22 STrie(T) STree(T) Σ STrie(T) STree(T) Σ 23 24

Suffix Links Definitin: If x Q is brnhing stte nd x = y, where Σ, then the suffix link f x is defined by f (x) = y, nd f (ε) =. Prpsitin: If x Q is brnhing stte nd f (x) = y then y is ls brnhing stte. STree(T) STree(T) = (Q { }, rt, g, f ). Σ Prf: b Σ s.t. x nd xb re substrings f T. y is suffix f x. Thus y nd yb re 25 ls substrings f T. 26 The Size f Suffix Trees Referene Pirs Therem: The size f STree(T), where T = n, is O(n). Prf: Sine we represent eh substring w = t k t p f T by pir pinters (k,p), the size f STree(T) is liner in the number f expliit sttes. STree(T) hs t mst n leves, nd thus t mst n - 1 brnhing sttes. Therefre, the size f STree(T) is Definitin: Let r be n expliit r impliit stte. (s,w) is lled referene pir fr r if: s is n expliit stte nd n nestr f r. w is the string spelled ut by the trnsitins frm s t r in the rrespnding suffix trie. Definitin: A referene pir (s,w) fr r is lled nnil if s is the lsest expliit nestr f r (r r itself, if it is expliit). O(n). 27 28

Ative Pint nd Endpint Ative Pint nd Endpint (nt.) Let s 1 = T i-1, s 2,, s i = rt, s i+1 = be the bundry pth f STrie(T i-1 ). The endpint Σ Definitin: s j is lled the tive pint f STrie(T i-1 ) if j is the smllest index fr whih s j is nt lef. Definitin: s j is lled the endpint f The tive pint STrie(T i-1 ) if j is the smllest index fr 29 30 Ative Pint nd Endpint (nt.) Adding t i -Trnsitins t STrie(T i-1 ) Prpsitin: s j nd s j re well defined nd Prf: j j. rt is nt lef s j is defined. g(,t i ) is defined s j is defined. g(s j,t i ) is defined s j is nt lef Lemm: When btining STrie(T i ) frm STrie(T i-1 ) the lgrithm dds t i -trnsitin t eh stte s h s.t. 1 h < j, nd nly t these sttes, s fllws: Fr 1 h < j, the new trnsitin expnds n ld brnh f the trie tht ends t s h. Fr j h < j, the new trnsitin initites new brnh frm s h. j j. 31 32

Adding t i -Trnsitins t STrie(T i-1 ) (nt.) The endpint Σ On-Line Cnstrutin f Suffix Trees We rete STree(ε), nd then 1 i n we btin STree(T i ) frm STree(T i-1 ). The tive pint When btining STree(T i ) frm STree(T i-1 ), we updte STree(T i-1 ) rding t the trnsitins we wuld dd t STrie(T i-1 ). 33 Nte tht s 1,,s i-1 re nt neessrily expliit sttes. 34 On-Line Cnstrutin f Suffix Trees (nt.) On-Line Cnstrutin f Suffix Trees (nt.) Fr 1 h < j: Fr j h < j : s h is lef. Thus, s, 0 k i-1 s.t. g (s, (k,i-1)) = s h. We reple this trnsitin by g (s,(k,i)) = s h. If s h is n impliit stte, we turn it int n expliit stte by splitting the trnsitin ntining it. This wuld tke t muh time. Thus, we dente trnsitins f the type g (s,(k,i-1)) in STree(T i-1 ) by g (s,(k, )). Hene, n 35 updtes re needed. We rete new lef s h t i nd dd new trnsitin g (s h,(i, )). 36

On-Line Cnstrutin f Suffix Trees (nt.) Lemm 1 Σ EPAP EP Σ EP Lemm 1: Let (s,(k,p)) be sme referene pir fr stte r. Then s, k s.t. (s,(k,p)) is the nnil referene pir fr r. AP Prf: Let s be the lsest expliit nestr f r, r r itself if r is expliit. t k t p is the pth frm the expliit stte s t r. Thus, the pth frm s t r is suffix t k t p f 37 t k t p. 38 Lemm 2 Lemm 3 Lemm 2: Let r be stte n the bundry pth f STrie(T i ). Then s, k s.t. (s,(k,i)) is the nnil referene pir fr r. Lemm 3: Let (s,(k,i-1)) be referene pir fr the endpint f STrie(T i-1 ). Then (s, (k,i)) is referene pir fr the tive pint f STrie(T i ). Prf: r is n the bundry pth f STrie(T i ). r refers t sme suffix t k t i f T i. (ε,(k,i)) is referene pir fr r. the lim hlds by lemm 1. 39 Prf: s j is the tive pint f STrie(T i-1 ) iff t j t i-1 is the lngest suffix f T i-1 tht urs t lest twie in T i-1. 40

Lemm 3 (nt.) The Algrithm Prf (nt.): s j is the endpint f STrie(T i-1 ) iff t j t i-1 is the lngest suffix f T i-1 suh tht t j t i-1 t i is substring f T i-1. Thus, if s j is the endpint f STrie(T i-1 ), then t j t i-1 t i is the lngest suffix f T i tht urs t lest twie in T i. rete STree(ε) s rt Therefre, s j t i is the tive pint f 41 42 k 1 fr i 1 t n d (s,k) updte(s,(k,i)) (s,k) nnize(s,(k,i)) Trnsfrms STree(T i-1 ) int STree(T i ). Input: (s,(k,i)) s.t. (s,(k,i-1) is the tive pint f STrie(T i-1 ). Output: (s,k ) s.t. (s,(k,i-1) is the endpint f STrie(T i-1 ). Input: referene pir (s,(k,p)) fr sme stte r. Output: (s,k ) s.t. (s, (k,p)) is the nnil referene pir fr r. updte(s,(k,i)) ld-r rt (endpint,r) test-nd-split(s,(k,i-1),t i ) while nt endpint d rete new stte r ; g (r,(i, )) r if ld-r rt then f (ld-r) r ld-r r (s,k) nnize(f (s),(k,i-1)) Input: the nnil referene pir fr sme stte r, nd t i. Output: true/flse if r is the endpint r nt, nd the expliit stte r (reting it if needed). (endpint,r) test-nd-split(s,(k,i-1),t i ) updte (3, ) (1, ) (1,2) Σ (5, ) (5, ) (2,2) (2, ) (5, ) (3, ) s s = = rt k = 23 45 1 i = 23 45 1 if ld-r rt then f (ld-r) s return (s,k) 43 44

test-nd-split(s,(k,p),t) nnize(s,(k,p)) if k p then if p < k then return (s,k) find the t k -trnsitin g (s,(k,p )) = s frm s else if t = t k +p-k+1 then return (true,s) find the t k -trnsitin g (s,(k,p )) = s frm s else while p k p k d rete new stte r k k + p k + 1 reple g (s,(k,p )) = s by g (s,(k,k +p-k)) = r s s nd g (r,(k +p-k+1,p )) = s if k p then return (flse,r) find the t k -trnsitin g (s,(k,p )) = s frm s else if t-trnsitin frm s then return (flse,s) 45 return (s,k) 46 Running Time Therem: The running time f the lgrithm is O(n). Prf: We divide the running time int tw mpnents: 1. The ttl time f the predure nnize. 2. The rest. 47 updte ld-r rt (endpint,r) test-nd-split(s,(k,i-1),t i ) while nt endpint d rete new stte r ; g (r,(i, )) r if ld-r rt then f (ld-r) r ld-r r Clled n times (s,k) nnize(f (s),(k,i-1)) (endpint,r) test-nd-split(s,(k,i-1),t i ) O(1) if ld-r rt then f (ld-r) s In eh exeutin f the lp, new stte is reted. 48

nnize if p < k then return (s,k) else find the t k -trnsitin g (s,(k,p )) = s frm s In eh while p k p k d exeutin f the lp, the vlue f k k + p k + 1 k inreses. s s frm s if k p then Clled O(n) times find the t k -trnsitin g (s,(k,p )) = s 49 Applitins - Ext String Mthing Input: tw strings: text T nd pttern P. Output: ll the urrenes f P in T. This prblem n be slved in O( T + P ) time (Byer-Mre, Knuth-Mrris-Prtt). 50 Applitins - Ext String Mthing (nt.) Applitins - Ext String Mthing (nt.) We lk t the se where we hve text T first, nd then sequene f ptterns P 1,,P r. This prblem n be slved using suffix trees. Prepressing time: O( T ). Finding pttern P: O( P +k), where k is the bbbbb b bbb# # b # b bb# b bb# b# b # # bbb# number f urrenes f P in T. 51 52

Applitins in Bilgy Finding Repets in DNA The DNA ntins mny repetitive sequenes with different bilgil funtins. We wnt t find ll mximl repets in DNA sequene. ACCAGTTCGCGCATGAACGTTCGACCGGTTCGAT 53 54 Finding Repets in DNA (nt.) Therem: All mximl repets in sequene T n be fund in O( T ) time using suffix trees. Finding Repets in DNA (nt.) Lemm: If w is mximl repet in T, then the stte w in STree(T) is expliit. Prf: If w is mximl repet then there re t lest tw urrenes f w in T s.t. the hrter fllwing w is different. Thus w is brnhing stte, nd therefre it is expliit. 55 56

Finding Repets in DNA (nt.) Crllry: There re t mst O( T ) mximl repets in T. Finding Repets in DNA (nt.) Definitin: The left hrter f lef t i t n f STree(T) is t i-1. Prf: By the bve lemm, eh mximl repet rrespnds t n expliit stte. Sine STree(T) hs O( T ) expliit sttes, T hs O( T ) mximl repets. Definitin: A nde w f STree(T) is lled left diverse if there re t lest tw leves in w s subtree with different left hrters. 57 Nte tht, by definitin, left diverse nde 58 is nt lef. Finding Repets in DNA (nt.) Finding Repets in DNA (nt.) Lemm: A substring w f T is mximl repet iff w is left diverse expliit stte in STree(T). 59 Prf: 1. Suppse w is mximl repet. i. By the previus lemm w is expliit. ii. b Σ s.t w nd bw re substrings f T. Let wu nd bwv be the rrespnding suffixes. wu nd wv re tw leves in the subtree f w with different left hrters. 60

Finding Repets in DNA (nt.) 2. Suppse tht w is expliit nd left diverse. (i) (ii) w w w wd bw bwd bw 61 Finding Repets in DNA (nt.) CAGCATAGC LD GCAT AGC# - G A TAGC# G # GC C ATAGC# # C LD LD LD T A TAGC# C LD GC ATAGC# The mximl repets: ε, C, CA, A, AGC A A # TAGC# # 62 A C A Bibligrphy On-Line Cnstrutin f Suffix Trees E. Ukknen Algrithms n String, Trees, nd Sequenes Dn Gusfield 63