On-Line Cnstrutin Overview Suffix tries f Suffix Trees E. Ukknen On-line nstrutin f suffix tries in qudrti time Suffix trees On-line nstrutin f suffix trees in liner time Applitins 1 2 Suffix Trees A suffix tree is trie-like dt struture representing ll suffixes f string. Nttins Let T = t 1 t n be string. Fr 0 i n, let T i = t 1 t i dente the g g i-length prefix f T. Fr 1 i n + 1, let T i = t i t n dente the suffix f T tht strts t the i th psitin. 3 Let σ(t) = {T i 1 i n + 1}. 4
Suffix Tries The suffix trie f T, dented by STrie(T), is trie representing σ(t). Suffix Tries (nt.) Definitin: STrie(T) is n ugmented DFA, STrie(T) = (Q { }, rt, F, g, f) where: 5 Q = {x x is substring f T} is the set f the sttes f the DFA. is n uxiliry stte. rt is the initil stte, rrespnding t the empty string ε. F = σ(t) is the set f finite sttes. 6 Suffix Tries (nt.) g : Q { } Σ Q ( prtil funtin) is the trnsitin funtin, defined s fllws: g(x,) = y fr ll x,y Q nd Σ, s.t. y = x. g(,) = rt fr ll Σ. f : Q Q { } is the suffix funtin defined s fllws: f(x) = y fr ll x,y Q, x rt, s.t Σ, s.t. x = y. f(rt) =. 7 An Exmple STrie() Σ ε 8
The Size f Suffix Tries Therem: The size f STrie(T), where T = n, is O(n 2 ). Prf: The size f STrie(T) is liner in the number f substrings f T. T hs t mst O(n 2 ) substrings. Thus the size f STrie(T) is O(n 2 ). 9 On-Line Cnstrutin f Suffix Tries Let T = t 1 t n. 1 i n, the lgrithm nstruts STrie(T i ). First we nstrut STrie(T 0 ) = STrie(ε). Then, 1 i n, we btin STrie(T i ) frm STrie(T i-1 ). 10 On-Line Cnstrutin f Suffix Tries (nt.) Observtin 1: σ(t i ) = {xt i x σ(ti-1 )} {ε}. Observtin 2: The suffixes f T i n be fund by strting t the stte T i nd fllwing the suffix links, until ε. Thus, σ(t i ) = {f j (T i ) 0 j i}. Definitin: The pth frm T i t fllwing the suffix links is lled the bundry pth 11 f STrie(T i ). On-Line Cnstrutin f Suffix Tries (nt.) Σ ε 12
STrie(T i-1 ) STrie(T i ) The Algrithm Σ rete STrie(ε) tp ε fr i 1 t n d r tp while g(r,t i ) is undefined d rete new stte r nd g(r,t i ) r if r tp then f(ld-r ) r ld-r r r f(r) f(ld-r ) g(r,t i ) tp g(tp,t i ) 13 14 The Algrithm (nt.) Running Time Σ Therem: The running time f the lgrithm is liner in the size f STrie(T), whih is, in wrst se, O( T 2 ). 15 16
Running Time (nt.) rete STrie(ε) tp ε fr i 1 t n d r tp while g(r, t i ) is undefined d rete new stte r nd g(r, t i ) r if r tp then f(ld-r ) r ld-r r r f(r) f(ld-r ) g(r, t i ) O(1) fr eh nde dded t STrie(T) 17 Suffix Trees A suffix tree STree(T) represents STrie(T) in spe liner in T. This is hieved by representing nly subset f Q { } f Q { }, lled the expliit sttes. 18 Expliit nd Impliit Sttes Definitin: A stte q is lled expliit in the fllwing ses: q is lef q is brnhing stte (hs t lest tw trnsitins) rt nd re ls defined t be brnhing sttes. Otherwise (if q hs extly ne trnsitins nd is nt the rt r ), q is lled impliit. 19 Expliit nd Impliit Sttes (nt). Σ 20
Generlized Trnsitin Funtin The string w spelled ut by the trnsitin pth in STrie(T) between tw expliit sttes s nd r is represented in STree(T) s generlized trnsitin g (s,w) = r. STrie(T) STree(T) Σ A generlized trnsitin g (s,w) = r is lled n -trnsitin if Σ nd v Σ* s.t. w = v. Nte tht fr eh expliit stte s nd Σ 21 there is t mst ne -trnsitin frm s. 22 STrie(T) STree(T) Σ STrie(T) STree(T) Σ 23 24
Suffix Links Definitin: If x Q is brnhing stte nd x = y, where Σ, then the suffix link f x is defined by f (x) = y, nd f (ε) =. Prpsitin: If x Q is brnhing stte nd f (x) = y then y is ls brnhing stte. STree(T) STree(T) = (Q { }, rt, g, f ). Σ Prf: b Σ s.t. x nd xb re substrings f T. y is suffix f x. Thus y nd yb re 25 ls substrings f T. 26 The Size f Suffix Trees Referene Pirs Therem: The size f STree(T), where T = n, is O(n). Prf: Sine we represent eh substring w = t k t p f T by pir pinters (k,p), the size f STree(T) is liner in the number f expliit sttes. STree(T) hs t mst n leves, nd thus t mst n - 1 brnhing sttes. Therefre, the size f STree(T) is Definitin: Let r be n expliit r impliit stte. (s,w) is lled referene pir fr r if: s is n expliit stte nd n nestr f r. w is the string spelled ut by the trnsitins frm s t r in the rrespnding suffix trie. Definitin: A referene pir (s,w) fr r is lled nnil if s is the lsest expliit nestr f r (r r itself, if it is expliit). O(n). 27 28
Ative Pint nd Endpint Ative Pint nd Endpint (nt.) Let s 1 = T i-1, s 2,, s i = rt, s i+1 = be the bundry pth f STrie(T i-1 ). The endpint Σ Definitin: s j is lled the tive pint f STrie(T i-1 ) if j is the smllest index fr whih s j is nt lef. Definitin: s j is lled the endpint f The tive pint STrie(T i-1 ) if j is the smllest index fr 29 30 Ative Pint nd Endpint (nt.) Adding t i -Trnsitins t STrie(T i-1 ) Prpsitin: s j nd s j re well defined nd Prf: j j. rt is nt lef s j is defined. g(,t i ) is defined s j is defined. g(s j,t i ) is defined s j is nt lef Lemm: When btining STrie(T i ) frm STrie(T i-1 ) the lgrithm dds t i -trnsitin t eh stte s h s.t. 1 h < j, nd nly t these sttes, s fllws: Fr 1 h < j, the new trnsitin expnds n ld brnh f the trie tht ends t s h. Fr j h < j, the new trnsitin initites new brnh frm s h. j j. 31 32
Adding t i -Trnsitins t STrie(T i-1 ) (nt.) The endpint Σ On-Line Cnstrutin f Suffix Trees We rete STree(ε), nd then 1 i n we btin STree(T i ) frm STree(T i-1 ). The tive pint When btining STree(T i ) frm STree(T i-1 ), we updte STree(T i-1 ) rding t the trnsitins we wuld dd t STrie(T i-1 ). 33 Nte tht s 1,,s i-1 re nt neessrily expliit sttes. 34 On-Line Cnstrutin f Suffix Trees (nt.) On-Line Cnstrutin f Suffix Trees (nt.) Fr 1 h < j: Fr j h < j : s h is lef. Thus, s, 0 k i-1 s.t. g (s, (k,i-1)) = s h. We reple this trnsitin by g (s,(k,i)) = s h. If s h is n impliit stte, we turn it int n expliit stte by splitting the trnsitin ntining it. This wuld tke t muh time. Thus, we dente trnsitins f the type g (s,(k,i-1)) in STree(T i-1 ) by g (s,(k, )). Hene, n 35 updtes re needed. We rete new lef s h t i nd dd new trnsitin g (s h,(i, )). 36
On-Line Cnstrutin f Suffix Trees (nt.) Lemm 1 Σ EPAP EP Σ EP Lemm 1: Let (s,(k,p)) be sme referene pir fr stte r. Then s, k s.t. (s,(k,p)) is the nnil referene pir fr r. AP Prf: Let s be the lsest expliit nestr f r, r r itself if r is expliit. t k t p is the pth frm the expliit stte s t r. Thus, the pth frm s t r is suffix t k t p f 37 t k t p. 38 Lemm 2 Lemm 3 Lemm 2: Let r be stte n the bundry pth f STrie(T i ). Then s, k s.t. (s,(k,i)) is the nnil referene pir fr r. Lemm 3: Let (s,(k,i-1)) be referene pir fr the endpint f STrie(T i-1 ). Then (s, (k,i)) is referene pir fr the tive pint f STrie(T i ). Prf: r is n the bundry pth f STrie(T i ). r refers t sme suffix t k t i f T i. (ε,(k,i)) is referene pir fr r. the lim hlds by lemm 1. 39 Prf: s j is the tive pint f STrie(T i-1 ) iff t j t i-1 is the lngest suffix f T i-1 tht urs t lest twie in T i-1. 40
Lemm 3 (nt.) The Algrithm Prf (nt.): s j is the endpint f STrie(T i-1 ) iff t j t i-1 is the lngest suffix f T i-1 suh tht t j t i-1 t i is substring f T i-1. Thus, if s j is the endpint f STrie(T i-1 ), then t j t i-1 t i is the lngest suffix f T i tht urs t lest twie in T i. rete STree(ε) s rt Therefre, s j t i is the tive pint f 41 42 k 1 fr i 1 t n d (s,k) updte(s,(k,i)) (s,k) nnize(s,(k,i)) Trnsfrms STree(T i-1 ) int STree(T i ). Input: (s,(k,i)) s.t. (s,(k,i-1) is the tive pint f STrie(T i-1 ). Output: (s,k ) s.t. (s,(k,i-1) is the endpint f STrie(T i-1 ). Input: referene pir (s,(k,p)) fr sme stte r. Output: (s,k ) s.t. (s, (k,p)) is the nnil referene pir fr r. updte(s,(k,i)) ld-r rt (endpint,r) test-nd-split(s,(k,i-1),t i ) while nt endpint d rete new stte r ; g (r,(i, )) r if ld-r rt then f (ld-r) r ld-r r (s,k) nnize(f (s),(k,i-1)) Input: the nnil referene pir fr sme stte r, nd t i. Output: true/flse if r is the endpint r nt, nd the expliit stte r (reting it if needed). (endpint,r) test-nd-split(s,(k,i-1),t i ) updte (3, ) (1, ) (1,2) Σ (5, ) (5, ) (2,2) (2, ) (5, ) (3, ) s s = = rt k = 23 45 1 i = 23 45 1 if ld-r rt then f (ld-r) s return (s,k) 43 44
test-nd-split(s,(k,p),t) nnize(s,(k,p)) if k p then if p < k then return (s,k) find the t k -trnsitin g (s,(k,p )) = s frm s else if t = t k +p-k+1 then return (true,s) find the t k -trnsitin g (s,(k,p )) = s frm s else while p k p k d rete new stte r k k + p k + 1 reple g (s,(k,p )) = s by g (s,(k,k +p-k)) = r s s nd g (r,(k +p-k+1,p )) = s if k p then return (flse,r) find the t k -trnsitin g (s,(k,p )) = s frm s else if t-trnsitin frm s then return (flse,s) 45 return (s,k) 46 Running Time Therem: The running time f the lgrithm is O(n). Prf: We divide the running time int tw mpnents: 1. The ttl time f the predure nnize. 2. The rest. 47 updte ld-r rt (endpint,r) test-nd-split(s,(k,i-1),t i ) while nt endpint d rete new stte r ; g (r,(i, )) r if ld-r rt then f (ld-r) r ld-r r Clled n times (s,k) nnize(f (s),(k,i-1)) (endpint,r) test-nd-split(s,(k,i-1),t i ) O(1) if ld-r rt then f (ld-r) s In eh exeutin f the lp, new stte is reted. 48
nnize if p < k then return (s,k) else find the t k -trnsitin g (s,(k,p )) = s frm s In eh while p k p k d exeutin f the lp, the vlue f k k + p k + 1 k inreses. s s frm s if k p then Clled O(n) times find the t k -trnsitin g (s,(k,p )) = s 49 Applitins - Ext String Mthing Input: tw strings: text T nd pttern P. Output: ll the urrenes f P in T. This prblem n be slved in O( T + P ) time (Byer-Mre, Knuth-Mrris-Prtt). 50 Applitins - Ext String Mthing (nt.) Applitins - Ext String Mthing (nt.) We lk t the se where we hve text T first, nd then sequene f ptterns P 1,,P r. This prblem n be slved using suffix trees. Prepressing time: O( T ). Finding pttern P: O( P +k), where k is the bbbbb b bbb# # b # b bb# b bb# b# b # # bbb# number f urrenes f P in T. 51 52
Applitins in Bilgy Finding Repets in DNA The DNA ntins mny repetitive sequenes with different bilgil funtins. We wnt t find ll mximl repets in DNA sequene. ACCAGTTCGCGCATGAACGTTCGACCGGTTCGAT 53 54 Finding Repets in DNA (nt.) Therem: All mximl repets in sequene T n be fund in O( T ) time using suffix trees. Finding Repets in DNA (nt.) Lemm: If w is mximl repet in T, then the stte w in STree(T) is expliit. Prf: If w is mximl repet then there re t lest tw urrenes f w in T s.t. the hrter fllwing w is different. Thus w is brnhing stte, nd therefre it is expliit. 55 56
Finding Repets in DNA (nt.) Crllry: There re t mst O( T ) mximl repets in T. Finding Repets in DNA (nt.) Definitin: The left hrter f lef t i t n f STree(T) is t i-1. Prf: By the bve lemm, eh mximl repet rrespnds t n expliit stte. Sine STree(T) hs O( T ) expliit sttes, T hs O( T ) mximl repets. Definitin: A nde w f STree(T) is lled left diverse if there re t lest tw leves in w s subtree with different left hrters. 57 Nte tht, by definitin, left diverse nde 58 is nt lef. Finding Repets in DNA (nt.) Finding Repets in DNA (nt.) Lemm: A substring w f T is mximl repet iff w is left diverse expliit stte in STree(T). 59 Prf: 1. Suppse w is mximl repet. i. By the previus lemm w is expliit. ii. b Σ s.t w nd bw re substrings f T. Let wu nd bwv be the rrespnding suffixes. wu nd wv re tw leves in the subtree f w with different left hrters. 60
Finding Repets in DNA (nt.) 2. Suppse tht w is expliit nd left diverse. (i) (ii) w w w wd bw bwd bw 61 Finding Repets in DNA (nt.) CAGCATAGC LD GCAT AGC# - G A TAGC# G # GC C ATAGC# # C LD LD LD T A TAGC# C LD GC ATAGC# The mximl repets: ε, C, CA, A, AGC A A # TAGC# # 62 A C A Bibligrphy On-Line Cnstrutin f Suffix Trees E. Ukknen Algrithms n String, Trees, nd Sequenes Dn Gusfield 63