On Suffix Tree Breadth - PDF Free Download

On Suffix Tree Bredth Golnz Bdkoeh 1,, Juh Kärkkäinen 2, Simon J. Puglisi 2,, nd Bell Zhukov 2, 1 Deprtment of Computer Science University of Wrwick Conventry, United Kingdom g.dkoeh@wrwick.c.uk 2 Helsinki Institute for Informtion Technology Deprtment of Computer Science University of Helsinki Helsinki, Finlnd {juh.krkkinen,puglisi,zhukov}@cs.helsinki.fi Astrct. The suffix tree the compcted trie of ll the suffixes of string is the most importnt nd widely-used dt structure in string processing. We consider nturl comintoril question out suffix trees: for string S of length n, how mny nodes ν S(d) cn there e t (string) depth d in its suffix tree? We prove ν(n, d) = mx S Σ n ν S(d) is O((n/d) log n), nd show tht this ound is lmost tight, descriing strings for which ν S(d) is Ω((n/d) log(n/d)). 1 Introduction The suffix tree, T S, of string S of n symols is compcted trie contining ll the suffixes of S. Since its discovery y Weiner 44 yers go [6] s n optiml solution to the longest common sustring prolem the suffix tree hs emerged s perhps the most importnt strction in string processing [1], nd now hs dozens of pplictions, most notly in ioinformtics [5]. Consequently, comintoril properties of suffix trees re of gret interest, nd hve een exploited in vrious wys to otin fster construction lgorithms, succinct representtions, nd efficient pttern mtching nd discovery lgorithms. Our focus in this rticle is nturl comintoril question out suffix trees: how mny nodes ν S (d) cn there e t (string) depth d in the suffix tree of string S? We prove tht ν(n, d) = mx S Σ n ν S (d) is O((n/d) log n), nd show tht this ound is lmost tight, descriing strings for which ν S (d) is Ω((n/d) log(n/d)). In the following section we ly down nottion nd formlly define sic concepts. Section 3 nd Section 4 del with the upper ound nd lower ound in turn, nd we close with discussion of the results. Supported y Leverhulme Erly Creer Fellowship. Supported y the Acdemy of Finlnd vi grnt 294143.

2 G. Bdkoeh, J. Kärkkäinen, S. J. Puglisi, nd B. Zhukov 2 Preliminries Throughout we consider string S = S[1..n] = S[1]S[2]... S[n] of n symols drwn from n ordered lphet Σ of size σ. For i = 1,..., n we write S[i..n] to denote the suffix of S of length n i + 1, tht is S[i..n] = S[i]S[i + 1] S[n]. For convenience we will frequently refer to suffix S[i..n] simply s suffix i. The suffix tree of S is compct trie representing ll the suffixes of S. Every suffix tree node either represents suffix or is rnching node. Ech rnching node represents string tht occurs t lest twice in S nd hs t lest two distinct symols following those occurrences. The string depth or simply depth of node is the length of the string it represents. Figs. 1 nd 2 show exmples of suffix trees. The suffix rry of S, denoted SA, is n rry SA[1..n] which contins permuttion of the integers 1..n such tht S[SA[1]..n] < S[SA[2]..n] < < S[SA[n]..n]. In other words, SA[j] = i iff S[i..n] is the j th suffix of S in scending lexicogrphicl order. We use SA 1 to denote the inverse permuttion. For convenience, we lso define SA[] = n + 1 to represent the empty suffix. The lcp rry LCP = LCP[1..n] is n rry defined y S nd SA. Let lcp(i, j) denote the length of the longest common prefix of suffixes i nd j. For every j 1..n, LCP[j] = lcp(sa[j 1], SA[j]), tht is, LCP contins the length of the longest common prefix for ech pir of lexicogrphiclly djcent suffixes. The permuted lcp rry PLCP[1..n] hs the sme contents s LCP ut in different order. Specificlly, for every j 1..n, PLCP[SA[j]] = LCP[j]. (1) Then PLCP[i] = lcp(i, φ(i)) when we define φ(i) = SA[SA 1 [i] 1]. A inry de Bruijn sequence of order k, denoted y β k, is inry word of length 2 k +k 1 where ech of the 2 k words of length k over the inry lphet ppers s fctor exctly once. As n exmple, β 4 = is de Bruijn sequence of order 4, see Fig. 1. 3 Upper Bound We re interested in the quntity ν(n, d), which is the mximum numer of rnching nodes t depth d over ny string of length n. A trivil upper ound on ν(n, d) relevnt for shllow levels is ν(n, d) σ d for strings over n lphet of size σ. Another esy upper ound is ν(n, d) (n d)/2, since there re n d suffixes longer thn d nd ech rnching node t depth d must represent prefix of t lest two such suffixes. In prticulr, ν(2 k + k 1, k 1) = 2 k 1 since the upper ound is mtched y inry de Bruijn sequence of order k, s shown in Fig. 1.

On Suffix Tree Bredth 3 19 18 17 1 2 3 6 4 7 12 16 5 9 11 15 8 14 13 Fig. 1. The suffix tree of string β 4 =, the inry de Bruijn sequence of order 4. The dshed rectngle contins internl nodes t depth 3. Bsed on the ove, ν(n, d) increses with d up to level d log σ n nd then strts to go down. The min result of this section is much tighter upper ound for lrger d showing quick decrese fter level log n. Our upperound proof mkes use of the concept of irreducile lcp vlues, first defined in [3]. We sy tht PLCP[i] = lcp(i, φ(i)) is reducile if S[i 1] = S[φ(i) 1] nd irreducile otherwise. In prticulr, it is irreducile if i = 1 or φ(i) = 1. Reducile vlues re esy to compute vi the next lemm. Lemm 1 ([3]). If PLCP[i] is reducile, then PLCP[i] = PLCP[i 1] 1. We lso need n upper ound on the sum of irreducile lcp vlues. Lemm 2 ([3,2]). The sum of ll irreducile lcp vlues is n log n. Theorem 3. The numer of rnching nodes t depth d in the suffix tree for string of length n is (n/d) log n. Proof. Let S e string with ν(n, d) rnching nodes t depth d in the suffix tree of S. Every such rnching node corresponds to one or more vlues d in the lcp rry, ech of which in turn corresponds to position in the PLCP rry with vlue d. In other words, the numer of d s in the PLCP rry of S is n upper ound on ν S (d). Let i 1,..., i r e the positions of irreducile vlues in the PCLP rry in scending order, nd let i r+1 = n + 1. Since i 1 = 1, the intervls PLCP[i j..i j+1 1], j = 1..n, form prtitioning of the PLCP rry. Due to

4 G. Bdkoeh, J. Kärkkäinen, S. J. Puglisi, nd B. Zhukov 38 37 36 35 34 33 32 1 2 3 4 5 11 6 18 12 22 7 19 13 23 3 8 16 2 28 14 26 24 31 9 17 21 29 15 27 25 Fig. 2. The suffix tree of string W 2,4 =. The dshed rectngle contins internl nodes t depth 6. Lemm 1, for every j = 1..n, PLCP[i j..i j+1 1] contins t most one d nd only if PLCP[i j ] d. Therefore, ech occurrence of d cn e mpped to unique irreducile lcp vlue d. Using Lemm 2, dν(n, d) n log n so the upper ound follows. 4 Lower Bound This section is devoted to proving the following result. Theorem 4. For ny positive integers j 1 nd k 3, there ( exists string of length n = j(2 k + k 1) such tht its suffix tree hs 1 n 2 d 1) log ( n d 1) rnching nodes t depth d = j(k 1). Proof. Our proof is sed on construction of the following string, W j,k. Let β k e inry de Bruijn sequence of order k. Clerly, the suffix tree of β k is full up to depth k 1, nd hs 2 k 1 nodes t depth k 1. Now, let W j,k = w j (β k ) where morphism w j is the following { wj () = j w j () = j 1

On Suffix Tree Bredth 5 It is cler tht W j,k = n = j(2 k + k 1). Let m = ν Wj,k (j(k 1)) denote the numer of rnching nodes of the suffix tree of string W j,k t depth d = j(k 1). We clim tht m 2 k 1. If oth y nd y occur in β k for some string y, then oth x nd x1 occur in W i,j for x = w j (y). Thus every rnching node representing y in the suffix tree of β k is uniquely mpped to rnching node representing x = w j (y) in the suffix tree of W i,j. Since the suffix tree of β k hs 2 k 1 rnching nodes t depth k 1, the clim m 2 k 1 follows. Wht remins is to show the steps for the clcultion of the lower ound. Since m 2 k 1 nd d = j(k 1) = n(k 1) (2 k +k 1), we hve n d = 2k k 1 + 1 2m k 1 + 1, which implies m 1 ( n ) 2 d 1 (k 1) 1 ( n ) ( ) 2 k 2 d 1 log = 1 ( n ) ( n ) k 1 2 d 1 log d 1. 5 Discussion Notice tht the loweround construction implies d = Ω(log n); thus it does not contrdict the upper ounds for smll d discussed in Section 3. The upper nd lower ounds of Theorems 3 nd 4 re symptoticlly equl when d = O(n 1 ɛ ) for ny constnt ɛ >. We conjecture tht the lower ound is symptoticlly tight even for lrger d, ut proving mtching upper ound remins n open prolem. Essentilly the sme ounds hold for ll vrints nd generliztions. We hve counted only rnching nodes ut including leves (nd unry nodes representing suffixes) too would not chnge much s there cn e only one lef (or unry node) t ech level. Similrly, dding unique termintor symol to the end of the string dds t most one node per level. Considering suffix tree of multiple strings (contining ll suffixes of ll strings) could dd more lefs to level ut no more thn n/d lefs t level d; thus the symptotic results do not chnge. Another vrint considers the string to e cyclic replcing suffixes with rottions nd even suffix trees for collections of cyclic strings hve een considered [2,4]. All the results hold in this cse too: the key result for the upper ound, Lemm 2, ws proved for collections of cyclic strings [2], nd de Bruijn sequences re nturlly defined s cyclic strings. Finlly, notice tht Theorems 3 nd 4 hold for ny lphet size. References 1. Apostolico, A., Crochemore, M., Frch-Colton, M., Glil, Z., Muthukrishnn, S.: 4 yers of suffix trees. Commun. ACM 59(4), 66 73 (216) 2. Kärkkäinen, J., Kemp, D., Pitkowski, M.: Tighter ounds for the sum of irreducile LCP vlues. Theor. Comput. Sci. 656, 265 278 (216), https://doi.org/.16/j.tcs.215.12.9

6 G. Bdkoeh, J. Kärkkäinen, S. J. Puglisi, nd B. Zhukov 3. Kärkkäinen, J., Mnzini, G., Puglisi, S.J.: Permuted longest-common-prefix rry. In: Comintoril Pttern Mtching, 2th Annul Symposium, CPM 29. Lecture Notes in Computer Science, vol. 5577, pp. 181 192. Springer (29), https://doi. org/.7/978-3-642-2441-2_17 4. Kärkkäinen, J., Pitkowski, M., Puglisi, S.J.: String inference from longest-commonprefix rry. In: 44th Interntionl Colloquium on Automt, Lnguges, nd Progrmming, ICALP 217. pp. 62:1 62:14 (217), https://doi.org/.423/ LIPIcs.ICALP.217.62 5. Mäkinen, V., Belzzougui, D., Cunil, F., Tomescu, A.I.: Genome-Scle Algorithm Design: Biologicl Sequence Anlysis in the Er of High-Throughput Sequencing. Cmridge University Press (215), http://www.genome-scle.info/ 6. Weiner, P.: Liner pttern mtching lgorithms. In: 14th Annul Symposium on Switching nd Automt Theory. pp. 1 11. IEEE Computer Society (1973)