On Suffix Tree Breadth

Similar documents
First Midterm Examination

Coalgebra, Lecture 15: Equations for Deterministic Automata

Farey Fractions. Rickard Fernström. U.U.D.M. Project Report 2017:24. Department of Mathematics Uppsala University

1 Nondeterministic Finite Automata

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

Homework Solution - Set 5 Due: Friday 10/03/08

Lecture 09: Myhill-Nerode Theorem

Riemann is the Mann! (But Lebesgue may besgue to differ.)

CS 275 Automata and Formal Language Theory

Dynamic Fully-Compressed Suffix Trees

First Midterm Examination

CS 330 Formal Methods and Models Dana Richards, George Mason University, Spring 2016 Quiz Solutions

CMPSCI 250: Introduction to Computation. Lecture #31: What DFA s Can and Can t Do David Mix Barrington 9 April 2014

Harvard University Computer Science 121 Midterm October 23, 2012

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

Minimal DFA. minimal DFA for L starting from any other

Parse trees, ambiguity, and Chomsky normal form

Torsion in Groups of Integral Triangles

The size of subsequence automaton

Assignment 1 Automata, Languages, and Computability. 1 Finite State Automata and Regular Languages

More on automata. Michael George. March 24 April 7, 2014

Homework 3 Solutions

1 From NFA to regular expression

CS 373, Spring Solutions to Mock midterm 1 (Based on first midterm in CS 273, Fall 2008.)

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying

Bases for Vector Spaces

Recitation 3: More Applications of the Derivative

MTH 505: Number Theory Spring 2017

Designing finite automata II

Lecture 3. Introduction digital logic. Notes. Notes. Notes. Representations. February Bern University of Applied Sciences.

A recursive construction of efficiently decodable list-disjunct matrices

Lecture 08: Feb. 08, 2019

Chapter Five: Nondeterministic Finite Automata. Formal Language, chapter 5, slide 1

CS 275 Automata and Formal Language Theory

Let's start with an example:

CS 330 Formal Methods and Models

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /j.jda

p-adic Egyptian Fractions

5. (±±) Λ = fw j w is string of even lengthg [ 00 = f11,00g 7. (11 [ 00)± Λ = fw j w egins with either 11 or 00g 8. (0 [ ffl)1 Λ = 01 Λ [ 1 Λ 9.

UniversitaireWiskundeCompetitie. Problem 2005/4-A We have k=1. Show that for every q Q satisfying 0 < q < 1, there exists a finite subset K N so that

Finite-State Automata: Recap

Hamiltonian Cycle in Complete Multipartite Graphs

Nondeterminism and Nodeterministic Automata

Riemann Sums and Riemann Integrals

Theory of Computation Regular Languages

Lecture Note 9: Orthogonal Reduction

Quadratic Forms. Quadratic Forms

CM10196 Topic 4: Functions and Relations

1 Error Analysis of Simple Rules for Numerical Integration

Connected-components. Summary of lecture 9. Algorithms and Data Structures Disjoint sets. Example: connected components in graphs

Riemann Sums and Riemann Integrals

CIRCULAR COLOURING THE PLANE

Lecture 6. Notes. Notes. Notes. Representations Z A B and A B R. BTE Electronics Fundamentals August Bern University of Applied Sciences

Section 6: Area, Volume, and Average Value

Module 9: Tries and String Matching

Module 9: Tries and String Matching

Chapter 8.2: The Integral

Formal Languages and Automata

CHAPTER 1 Regular Languages. Contents

Finite Automata-cont d

Domino Recognizability of Triangular Picture Languages

Linear Inequalities. Work Sheet 1

Lecture 3 ( ) (translated and slightly adapted from lecture notes by Martin Klazar)

Math 61CM - Solutions to homework 9

Homework 4. 0 ε 0. (00) ε 0 ε 0 (00) (11) CS 341: Foundations of Computer Science II Prof. Marvin Nakayama

Theory of Computation Regular Languages. (NTU EE) Regular Languages Fall / 38

CS103 Handout 32 Fall 2016 November 11, 2016 Problem Set 7

The Regulated and Riemann Integrals

Anatomy of a Deterministic Finite Automaton. Deterministic Finite Automata. A machine so simple that you can understand it in less than one minute

Some Theory of Computation Exercises Week 1

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

Lecture 3: Equivalence Relations

1.4 Nonregular Languages

Matching patterns of line segments by eigenvector decomposition

CS 310 (sec 20) - Winter Final Exam (solutions) SOLUTIONS

Formal languages, automata, and theory of computation

Symbolic enumeration methods for unlabelled structures

Convert the NFA into DFA

38 Riemann sums and existence of the definite integral.

Section 6.1 INTRO to LAPLACE TRANSFORMS

2.4 Linear Inequalities and Interval Notation

USA Mathematical Talent Search Round 1 Solutions Year 21 Academic Year

Matrix Algebra. Matrix Addition, Scalar Multiplication and Transposition. Linear Algebra I 24

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

The area under the graph of f and above the x-axis between a and b is denoted by. f(x) dx. π O

Lecture 20: Numerical Integration III

set is not closed under matrix [ multiplication, ] and does not form a group.

Section 4: Integration ECO4112F 2011

Lab 11 Approximate Integration

Section 7.1 Area of a Region Between Two Curves

SWEN 224 Formal Foundations of Programming WITH ANSWERS

(e) if x = y + z and a divides any two of the integers x, y, or z, then a divides the remaining integer

1.3 Regular Expressions

Surface maps into free groups

Chapter 2 Finite Automata

Talen en Automaten Test 1, Mon 7 th Dec, h45 17h30

Fingerprint idea. Assume:

The maximal number of runs in standard Sturmian words

Solving the String Statistics Problem in Time O(n log n)

1 The Riemann Integral

Transcription:

On Suffix Tree Bredth Golnz Bdkoeh 1,, Juh Kärkkäinen 2, Simon J. Puglisi 2,, nd Bell Zhukov 2, 1 Deprtment of Computer Science University of Wrwick Conventry, United Kingdom g.dkoeh@wrwick.c.uk 2 Helsinki Institute for Informtion Technology Deprtment of Computer Science University of Helsinki Helsinki, Finlnd {juh.krkkinen,puglisi,zhukov}@cs.helsinki.fi Astrct. The suffix tree the compcted trie of ll the suffixes of string is the most importnt nd widely-used dt structure in string processing. We consider nturl comintoril question out suffix trees: for string S of length n, how mny nodes ν S(d) cn there e t (string) depth d in its suffix tree? We prove ν(n, d) = mx S Σ n ν S(d) is O((n/d) log n), nd show tht this ound is lmost tight, descriing strings for which ν S(d) is Ω((n/d) log(n/d)). 1 Introduction The suffix tree, T S, of string S of n symols is compcted trie contining ll the suffixes of S. Since its discovery y Weiner 44 yers go [6] s n optiml solution to the longest common sustring prolem the suffix tree hs emerged s perhps the most importnt strction in string processing [1], nd now hs dozens of pplictions, most notly in ioinformtics [5]. Consequently, comintoril properties of suffix trees re of gret interest, nd hve een exploited in vrious wys to otin fster construction lgorithms, succinct representtions, nd efficient pttern mtching nd discovery lgorithms. Our focus in this rticle is nturl comintoril question out suffix trees: how mny nodes ν S (d) cn there e t (string) depth d in the suffix tree of string S? We prove tht ν(n, d) = mx S Σ n ν S (d) is O((n/d) log n), nd show tht this ound is lmost tight, descriing strings for which ν S (d) is Ω((n/d) log(n/d)). In the following section we ly down nottion nd formlly define sic concepts. Section 3 nd Section 4 del with the upper ound nd lower ound in turn, nd we close with discussion of the results. Supported y Leverhulme Erly Creer Fellowship. Supported y the Acdemy of Finlnd vi grnt 294143.

2 G. Bdkoeh, J. Kärkkäinen, S. J. Puglisi, nd B. Zhukov 2 Preliminries Throughout we consider string S = S[1..n] = S[1]S[2]... S[n] of n symols drwn from n ordered lphet Σ of size σ. For i = 1,..., n we write S[i..n] to denote the suffix of S of length n i + 1, tht is S[i..n] = S[i]S[i + 1] S[n]. For convenience we will frequently refer to suffix S[i..n] simply s suffix i. The suffix tree of S is compct trie representing ll the suffixes of S. Every suffix tree node either represents suffix or is rnching node. Ech rnching node represents string tht occurs t lest twice in S nd hs t lest two distinct symols following those occurrences. The string depth or simply depth of node is the length of the string it represents. Figs. 1 nd 2 show exmples of suffix trees. The suffix rry of S, denoted SA, is n rry SA[1..n] which contins permuttion of the integers 1..n such tht S[SA[1]..n] < S[SA[2]..n] < < S[SA[n]..n]. In other words, SA[j] = i iff S[i..n] is the j th suffix of S in scending lexicogrphicl order. We use SA 1 to denote the inverse permuttion. For convenience, we lso define SA[] = n + 1 to represent the empty suffix. The lcp rry LCP = LCP[1..n] is n rry defined y S nd SA. Let lcp(i, j) denote the length of the longest common prefix of suffixes i nd j. For every j 1..n, LCP[j] = lcp(sa[j 1], SA[j]), tht is, LCP contins the length of the longest common prefix for ech pir of lexicogrphiclly djcent suffixes. The permuted lcp rry PLCP[1..n] hs the sme contents s LCP ut in different order. Specificlly, for every j 1..n, PLCP[SA[j]] = LCP[j]. (1) Then PLCP[i] = lcp(i, φ(i)) when we define φ(i) = SA[SA 1 [i] 1]. A inry de Bruijn sequence of order k, denoted y β k, is inry word of length 2 k +k 1 where ech of the 2 k words of length k over the inry lphet ppers s fctor exctly once. As n exmple, β 4 = is de Bruijn sequence of order 4, see Fig. 1. 3 Upper Bound We re interested in the quntity ν(n, d), which is the mximum numer of rnching nodes t depth d over ny string of length n. A trivil upper ound on ν(n, d) relevnt for shllow levels is ν(n, d) σ d for strings over n lphet of size σ. Another esy upper ound is ν(n, d) (n d)/2, since there re n d suffixes longer thn d nd ech rnching node t depth d must represent prefix of t lest two such suffixes. In prticulr, ν(2 k + k 1, k 1) = 2 k 1 since the upper ound is mtched y inry de Bruijn sequence of order k, s shown in Fig. 1.

On Suffix Tree Bredth 3 19 18 17 1 2 3 6 4 7 12 16 5 9 11 15 8 14 13 Fig. 1. The suffix tree of string β 4 =, the inry de Bruijn sequence of order 4. The dshed rectngle contins internl nodes t depth 3. Bsed on the ove, ν(n, d) increses with d up to level d log σ n nd then strts to go down. The min result of this section is much tighter upper ound for lrger d showing quick decrese fter level log n. Our upperound proof mkes use of the concept of irreducile lcp vlues, first defined in [3]. We sy tht PLCP[i] = lcp(i, φ(i)) is reducile if S[i 1] = S[φ(i) 1] nd irreducile otherwise. In prticulr, it is irreducile if i = 1 or φ(i) = 1. Reducile vlues re esy to compute vi the next lemm. Lemm 1 ([3]). If PLCP[i] is reducile, then PLCP[i] = PLCP[i 1] 1. We lso need n upper ound on the sum of irreducile lcp vlues. Lemm 2 ([3,2]). The sum of ll irreducile lcp vlues is n log n. Theorem 3. The numer of rnching nodes t depth d in the suffix tree for string of length n is (n/d) log n. Proof. Let S e string with ν(n, d) rnching nodes t depth d in the suffix tree of S. Every such rnching node corresponds to one or more vlues d in the lcp rry, ech of which in turn corresponds to position in the PLCP rry with vlue d. In other words, the numer of d s in the PLCP rry of S is n upper ound on ν S (d). Let i 1,..., i r e the positions of irreducile vlues in the PCLP rry in scending order, nd let i r+1 = n + 1. Since i 1 = 1, the intervls PLCP[i j..i j+1 1], j = 1..n, form prtitioning of the PLCP rry. Due to

4 G. Bdkoeh, J. Kärkkäinen, S. J. Puglisi, nd B. Zhukov 38 37 36 35 34 33 32 1 2 3 4 5 11 6 18 12 22 7 19 13 23 3 8 16 2 28 14 26 24 31 9 17 21 29 15 27 25 Fig. 2. The suffix tree of string W 2,4 =. The dshed rectngle contins internl nodes t depth 6. Lemm 1, for every j = 1..n, PLCP[i j..i j+1 1] contins t most one d nd only if PLCP[i j ] d. Therefore, ech occurrence of d cn e mpped to unique irreducile lcp vlue d. Using Lemm 2, dν(n, d) n log n so the upper ound follows. 4 Lower Bound This section is devoted to proving the following result. Theorem 4. For ny positive integers j 1 nd k 3, there ( exists string of length n = j(2 k + k 1) such tht its suffix tree hs 1 n 2 d 1) log ( n d 1) rnching nodes t depth d = j(k 1). Proof. Our proof is sed on construction of the following string, W j,k. Let β k e inry de Bruijn sequence of order k. Clerly, the suffix tree of β k is full up to depth k 1, nd hs 2 k 1 nodes t depth k 1. Now, let W j,k = w j (β k ) where morphism w j is the following { wj () = j w j () = j 1

On Suffix Tree Bredth 5 It is cler tht W j,k = n = j(2 k + k 1). Let m = ν Wj,k (j(k 1)) denote the numer of rnching nodes of the suffix tree of string W j,k t depth d = j(k 1). We clim tht m 2 k 1. If oth y nd y occur in β k for some string y, then oth x nd x1 occur in W i,j for x = w j (y). Thus every rnching node representing y in the suffix tree of β k is uniquely mpped to rnching node representing x = w j (y) in the suffix tree of W i,j. Since the suffix tree of β k hs 2 k 1 rnching nodes t depth k 1, the clim m 2 k 1 follows. Wht remins is to show the steps for the clcultion of the lower ound. Since m 2 k 1 nd d = j(k 1) = n(k 1) (2 k +k 1), we hve n d = 2k k 1 + 1 2m k 1 + 1, which implies m 1 ( n ) 2 d 1 (k 1) 1 ( n ) ( ) 2 k 2 d 1 log = 1 ( n ) ( n ) k 1 2 d 1 log d 1. 5 Discussion Notice tht the loweround construction implies d = Ω(log n); thus it does not contrdict the upper ounds for smll d discussed in Section 3. The upper nd lower ounds of Theorems 3 nd 4 re symptoticlly equl when d = O(n 1 ɛ ) for ny constnt ɛ >. We conjecture tht the lower ound is symptoticlly tight even for lrger d, ut proving mtching upper ound remins n open prolem. Essentilly the sme ounds hold for ll vrints nd generliztions. We hve counted only rnching nodes ut including leves (nd unry nodes representing suffixes) too would not chnge much s there cn e only one lef (or unry node) t ech level. Similrly, dding unique termintor symol to the end of the string dds t most one node per level. Considering suffix tree of multiple strings (contining ll suffixes of ll strings) could dd more lefs to level ut no more thn n/d lefs t level d; thus the symptotic results do not chnge. Another vrint considers the string to e cyclic replcing suffixes with rottions nd even suffix trees for collections of cyclic strings hve een considered [2,4]. All the results hold in this cse too: the key result for the upper ound, Lemm 2, ws proved for collections of cyclic strings [2], nd de Bruijn sequences re nturlly defined s cyclic strings. Finlly, notice tht Theorems 3 nd 4 hold for ny lphet size. References 1. Apostolico, A., Crochemore, M., Frch-Colton, M., Glil, Z., Muthukrishnn, S.: 4 yers of suffix trees. Commun. ACM 59(4), 66 73 (216) 2. Kärkkäinen, J., Kemp, D., Pitkowski, M.: Tighter ounds for the sum of irreducile LCP vlues. Theor. Comput. Sci. 656, 265 278 (216), https://doi.org/.16/j.tcs.215.12.9

6 G. Bdkoeh, J. Kärkkäinen, S. J. Puglisi, nd B. Zhukov 3. Kärkkäinen, J., Mnzini, G., Puglisi, S.J.: Permuted longest-common-prefix rry. In: Comintoril Pttern Mtching, 2th Annul Symposium, CPM 29. Lecture Notes in Computer Science, vol. 5577, pp. 181 192. Springer (29), https://doi. org/.7/978-3-642-2441-2_17 4. Kärkkäinen, J., Pitkowski, M., Puglisi, S.J.: String inference from longest-commonprefix rry. In: 44th Interntionl Colloquium on Automt, Lnguges, nd Progrmming, ICALP 217. pp. 62:1 62:14 (217), https://doi.org/.423/ LIPIcs.ICALP.217.62 5. Mäkinen, V., Belzzougui, D., Cunil, F., Tomescu, A.I.: Genome-Scle Algorithm Design: Biologicl Sequence Anlysis in the Er of High-Throughput Sequencing. Cmridge University Press (215), http://www.genome-scle.info/ 6. Weiner, P.: Liner pttern mtching lgorithms. In: 14th Annul Symposium on Switching nd Automt Theory. pp. 1 11. IEEE Computer Society (1973)