Forbidden Patterns. {vmakinen leena.salmela

Size: px
Start display at page:

Download "Forbidden Patterns. {vmakinen leena.salmela"

Transcription

1 Forbidden Patterns Johannes Fischer 1,, Travis Gagie 2,, Tsvi Kopelowitz 3, Moshe Lewenstein 4, Veli Mäkinen 5,, Leena Salmela 5,, and Niko Välimäki 5, 1 KIT, Karlsruhe, Germany, johannes.fischer@kit.edu 2 Aalto University, Espoo, Finland, travis.gagie@aalto.fi 3 Weizmann Institute of Science, Rehovot, Israel, kopelot@gmail.com 4 Bar-Ilan University, Ramat Gan, Israel, moshe@cs.biu.ac.il 5 University of Helsinki, Helsinki, Finland, {vmakinen leena.salmela nvalimak}@cs.helsinki.fi Abstract. We consider the problem of indexing a collection of documents (a.k.a. strings) of total length n such that the following kind of queries are supported: given two patterns P + and P, list all n match documents containing P + but not P. This is a natural extension of the classic problem of document listing as considered by Muthukrishnan [SODA 02], where only the positive pattern P + is given. Our main solution is an index of size O(n 3/2 ) bits that supports queries in O( P + + P + n match + n) time. 1 Introduction In recent years, the pattern matching community has paid a considerable amount of attention to document retrieval tasks, where, in contrast to traditional indexed pattern matching, the task is to output each document containing a search pattern even just once, and in particular without spending time proportional to the number of total occurrences of the pattern in that document. Starting with Muthukrishnan s seminal paper [14] (building in fact on an earlier paper by the same author with colleagues [12]), an abundance of articles on variations of this scheme emerged, including: space reductions of the underlying data structures [7, 16, 17], ranking of the output [9, 11, 15], two-pattern queries [3, 5], and perhaps many more. A different possible, and indeed very natural, extension of the basic problem is to exclude some documents from the output. In this setting, the user specifies, Supported by the German Research Foundation (DFG). Supported by Academy of Finland grant Partially funded by the Academy of Finland grant Also affiliated with Helsinki Institute for Information Technology (HIIT). Supported by Academy of Finland grant (ALGODAN). Partially funded by the Academy of Finland grant (ALGODAN) and Helsinki Doctoral Programme in Computer Science (Hecse).

2 in addition to the query pattern P +, a negative pattern P that should not occur in the retrieved documents. Formally, this problem of forbidden patterns can be modeled as follows: 6 Given: A collection of static documents D = {D 1,..., D d } over an alphabet Σ of total length n = i d D i. Compute: An index that, given two patterns P + and P online, allows us to compute all n match documents containing P +, but not P. The best we can hope for is certainly an index of linear size having a query time of O( P + + P + n match ), as this is the time just to read the input and write the output of the query. However, achieving anything close to this optimum seems completely out of reach (at least at the current state of research), as the forbidden pattern queries can be regarded as set difference queries, which are arguably at least as hard as set intersection queries. In the realm of document retrieval, those latter queries correspond to the case where two positive patterns P + 1 and P + 2 are given (and one is interested in all documents containing both positive patterns); we are aware of three indexes that address this problem: (1) Õ(n 3/2 ) words of space with query time O( P P n match + n) [5], (2) O(n log n) words and O( P P + 2 +( n match n log n+n match ) log 2 n) time [3], and (3) O(n) words and O( P P + 2 +( n match n log n+n match ) log n)) time [8] (this improves on [3] in both time and space and has the further advantage that it generalizes to more than two patterns). In Appendix A of this paper, we show that this problem of two positive patterns is indeed harder than document retrieval with just one pattern. 7 However, the main body of this paper is devoted to the forbidden patterns problem. The following theorem summarizes our main result. Theorem 1. For a text collection of total length n, there exists a data structure of size O(n 3/2 ) bits such that subsequent forbidden pattern queries can be answered in O( P + + P + n match + n) time. The rest of this article is structured as follows. Sect. 2 introduces known results that form the basic building blocks of our solution, including a description of the preprocessing algorithm for document retrieval with just one positive pattern [14]. In Sect. 3, we then give data structures for the forbidden patterns problem, where, apart from proving Thm. 1, we also look at the variation of just counting the number documents. Finally, Sect. 4 concludes the paper. 6 Muthukrishnan [14] already considered the case where just a negative pattern P is given, and has an optimal solution that outputs all documents not containing P. 7 Unfortunately, we could not come up with any meaningful lower bound for the forbidden patterns problem. 2

3 2 Preliminaries 2.1 Succinct Data Structures Consider a bit-string S[1, n] of length n. We define the fundamental rank- and select-operations on S as follows: rank 1 (S, i) gives the number of 1 s in the prefix S[1, i], and select 1 (S, i) gives the position of the i th 1 in S, reading S from left to right (1 i n). The following lemma summarizes a by-now classic result (see, e.g., [13]): Lemma 1. A bit-string of length n can be represented in n + o(n) bits such that rank- and select-operations are supported in O(1) time. 2.2 Range Minimum Queries A basic building block for our solution is a space-efficient preprocessing scheme for O(1) range minimum queries. For a static array E[1, n] of n objects from a totally ordered universe and two indices i and j with 1 i j n, a range minimum query rmq E (i, j) returns the position of a minimum element in the sub-array E[i, j]; in symbols: rmq E (i, j) = argmin { E[k] i k j }. We state the following result [6, Thm. 5.8]: Lemma 2. A static array E[1, n] can be preprocessed in O(n) time into a data structure of size 2n + o(n) bits such that subsequent range minimum queries on E can be answered in O(1) time, without consulting E at query time. This size is asymptotically optimal. 2.3 Document Retrieval We now explain Muthukrishnan s solution [14] for document retrieval with only one positive pattern P (in fact, we describe a variant [16] of the original algorithm that is more convenient for our purposes). The overall idea is to build a generalized suffix tree ST for the collection of documents D = {D 1,..., D d }, and enhance it with additional information for reporting the documents. A generalized suffix tree for D is a suffix tree for the text T := D 1 # 1 D 2 # 2... D d # d, where the # i s are distinct characters not appearing elsewhere in D. A suffix tree ST on T, in turn, is a compacted trie on all suffixes of T, and consists of only O( T ) nodes. Every such node v is said to correspond to a substring α of T iff the letters on the root-to-v path are exactly α. A suffix tree ST for T allows us to locate all occ occurrences of a search pattern P in T in optimal O( P + occ) time (with perfect hashing; otherwise the search takes O( P log Σ + occ) time). This search proceeds in two steps: it first finds in O( P ) time the node v in ST such that all leaves below v correspond to the 3

4 occ suffixes that are prefixed by P. In a second step, the starting points of all these suffixes are reported in additional O(occ) time. For document retrieval, it should be clear that we can reuse only the first part of this search, but must modify the second step such that it uses O(n match ) instead of O(occ) time. To this end, Muthukrishnan s solution proceeds as follows. Consider the leaves of ST in lexicographic order. The positions in T of their corresponding suffixes form a permutation of the numbers [1, n ] (n = n+d = O(n) being the size of T ), the so-called suffix-array A[1, n ]. Define a document array D[1, n ] of the same size as A, such that D[i] holds the document number of the lexicographically i th suffix. More formally, D[i] = j iff # j is the first document separator in T [A[i], n ]. We now chain suffixes from the same document in a new array E[1, n ] by defining E[i] = max { j < i D[j] = D[i] }, where the maximum of the empty set is assumed to be. Array E is prepared for constant-time range minimum queries using Lemma 2. With these data structures, we can obtain optimal O( P + n match ) listing time, as explained next. We first use ST to find in O( P ) time the interval [l, r] in A such that the suffixes in A[l, r] are exactly those that are prefixed by P. We then call the recursive procedure list in Alg. 1, initially invoked by list(l, r) and assuming V [i] = 0 for all 1 i d just before that first call: Algorithm 1: List all documents in D[i, j] not occurring in D[l, i 1]. procedure list(i, j) m rmq E (i, j) if V [D[m]] = 0 then output D[m] V [D[m]] 1 list(i, m 1) list(m + 1, j) The idea of procedure list is that each distinct document identifier in D[l, r] is listed only at the place m of its leftmost occurrence in the interval [l, r]; such places are conveniently located by range minimum queries on E. To avoid duplicate outputs, we mark all documents found by a 1 in an additional array V [1, d], which is initialized with all 0 s in the preprocessing phase. Whenever the smallest element in E[i, j] comes from a document already reported (hence V [D[m]] = 1), the recursion can be stopped since every document in D[i, j] is reported when visiting distinct intervals [i, j ] with l i j < i. Hence, the overall running time is O(n match ). At the end of the reporting phase, we need to reset V [ ] to 0 for all documents in the output. This takes additional O(n match ) time. Apart from the suffix tree ST, the space for this solution is dominated by the n words needed for storing the document array D. 4

5 3 Document Retrieval with Forbidden Patterns We now come to the description of our solution to the problem of forbidden patterns, as presented in the introduction. We proceed by first presenting a rather simple solution (Sect. 3.1), which is then subsequently refined (Sect ). 3.1 O(n 2 ) Words of Space We first show how to achieve optimal ( P + + P + n match ) query time with O(n 2 ) space. The idea is again to store a generalized suffix tree ST for the set of documents D and enhance it with additional information. In particular, for every node v in ST corresponding to string α, we store a copy of ST that excludes the documents containing α. We call that copy ST v. Every ST v is prepared for normal document listing (Sect. 2.3). When a query arrives, we first match P in ST until reaching node v (if the matching ends on an edge, we take the following node). We then jump to ST v, where we match P +, and list all n match documents in optimal O(n match ) time. The space for this solution is clearly O(n 2 ) words, as the number of nodes in ST is O(n), and for each such node v we build another generalized suffix tree ST v, all of which could contain O(n) nodes in the worst case. 3.2 O(n 2 ) Bits of Space We now reduce the solution from the previous section to O(n 2 ) bits. Our aim is to reuse the full suffix tree ST also when matching the positive pattern P +, and use a modified RMQ-structure when reporting documents by procedure list (see Sect. 2.3). To this end, let v be a node in ST, and let D v be the set of documents containing the string represented by v (hence D v is the set of documents in D[l, r] if A[l, r] is the suffix array interval for v in the sense of Sect. 2.3). In a (conceptual) copy E v of the global chaining array E, we blank out all entries corresponding to documents in D v by setting the corresponding values to +. More precisely, { + if D[i] D v, and E v [i] = E[i] otherwise. Now each such E v is prepared for range minimum queries using Lemma 2, and only this RMQ-structure (not the array E v itself!) is stored at node v. A further bit-vector B v [1, n ] at node v marks those positions with a 1 that correspond to documents in D v, in symbols: B v [i] = 1 if E v [i] = +, and B v [i] = 0 otherwise. Hence, the total space needed is 3n + o(n ) = O(n) bits per node in ST. We also store the global document array D[1, n ], plus the bit-vector V [1, d] needed 5

6 Algorithm 2: Modified procedure for document listing. procedure list (i, j) m rmq Ev (i, j) if V [D[m]] = 0 and B v[m] = 0 then output D[m] V [D[m]] 1 list (i, m 1) list (m + 1, j) by Alg. 1, needing O(n log n) and d = O(n) bits, respectively. The space for the entire data structure thereby amounts to O(n 2 ) bits. The query processing starts as in the previous section: we first match P in ST until reaching node v (and again take the following node if we end on an edge). Now instead of jumping to ST v (which is no longer stored), we use ST again to find the interval [l, r] in the suffix array A such that the suffixes in A[l, r] are exactly those that are prefixed by P +. We then call procedure list(l, r), but using the RMQ-structure for E v instead of E (corresponding to the negative pattern P ). We need to further modify that procedure such that it does not list those documents in D v ; this can be accomplished by checking if the m th bit in B v is set to 0. If, on the other hand, B v [m] = 1, we can stop the recursion, as in that case all other entries in E v [i, j] must also be + (and hence come from documents in D v ). The complete modified algorithm list can be seen in Alg. 2. As before, after having listed all n match documents, we need to unmark the listed documents in V in additional O(n match ) time in order to prepare for the next query. 3.3 O(n 3/2 ) Bits of Space We now present a space/time tradeoff for the solution given in the previous section. Our general idea is to store the RMQ-structures only at a selected subset of nodes in ST, thereby possibly listing false documents that need to be filtered at query time. For what follows, the reader should also consult the example shown in Fig. 1. We assign a weight w v to each node v in ST as follows. As before, let D v denote the set of documents below v; i.e., the set of documents in D that contain the string represented by v. Then the weight of v is defined to be the number of documents in D v, w v = D v. In ST, we mark certain nodes as important. The RMQ-structures will only be stored at important nodes. We will make sure that each node v has an important successor u such that w v w u + s, for some integer s to be determined later. At v, we also store a pointer to this important successor u. Let p v = u denote this pointer (for important nodes v we define p v = v). At query time, when the search for P ends at v, we use the algorithm from the previous section, but 6

7 T = turing# 1 tape# 2 enigma# 3 apple# 4 a e g i u m r ing#1 a#3 n p t p n ig i 2 r 2 m # 3 e p # # gm e ing 2 1 a#3 # 2 le 2 m g n p u a#3 ma g l e#4 g # 2 a le#4 le a rin # 1 # # 4 # 1 # pe 1 # 4 # 3 # 4 3 # g 2 # l Fig. 1. Generalized suffix tree (with super-leaf l) for the collection of documents D = {turing, tape, enigma, apple} (with irrelevant parts pruned). The number inside a node v denotes its weight w v. Important nodes (assuming s = 2) are shown in bold. Hence, in this example only 3 RMQ-structures are stored. now with the RMQ-structure for E pv. This reports at most s false documents, which need to be discarded from the output. It thus remains to identify the false documents. For this, we need not store any additional data structures, as explained next. Let α denote the string represented by node p v. Observe that the false documents are exactly those that contain P, but not α; but this corresponds to a forbidden pattern query, with P as the positive pattern, and α as the negative one! And for this query all necessary data structures are at hand, because at p v there exists an RMQ-structure that filters exactly those documents containing α. In summary, to answer the query P + and not P, we first match P in ST up to node v, where we follow the pointer p v to an important successor representing string α. At p v, we answer the query P and not α with the algorithm from Sect. 3.2 to identify the set of false documents, which we mark in a bit-vector F [1, d]. Finally, we answer the query P + and not α, again by using the algorithm from Sect. 3.2, but this time outputting only those documents not marked in F. In the end, all marked documents are unmarked to prepare for the next query. As by definition the marked (=false) documents are at most s in number, the total query time is O( P + + P + n match + s). Identifying Important Nodes. It remains to show how the important nodes can be identified. We do this in a bottom-up traversal of ST. We first enhance ST with a super-leaf l that is the single child of all original leaves in ST. This 7

8 node l has weight w l = 0, is marked as important, and stores an RMQ-structure for the original array E. During the bottom-up traversal of all original nodes in ST (excluding the super-leaf l), let us assume we arrive at a node v with children v 1,..., v k. By induction, all v i s have already been assigned an important successor p vi. Let v m be the child of v having an important successor occurring in most documents, m = argmax{w pvi 1 i k}. Then if w v w pvm s, we set p v to p vm. Otherwise, we mark v as important, create an RMQ-structure for E v, and set p v = v. Space Analysis. To analyze the space, consider the subtree ST I of ST consisting only of important nodes and their mutual lowest common ancestors (in the terminology of Cole et al. [4], ST I is the subtree of ST induced by the important nodes). The nodes in ST I are further divided into two different classes: (1) non-branching internal, and (2) other. A node belongs to class non-branching internal iff it has exactly one child in ST I ; otherwise it belongs to class other. We analyze the number of non-branching internal and other nodes separately. Let us first consider the other nodes. They form yet another induced subtree of ST I, let us call it ST I. A leaf v in ST I covers at least s (original) leaves in ST, as its weight w v must by definition be at least w v > s, and in order to cover s documents at least s suffixes from T must be covered. Hence, the number of leaves in ST I is bounded from above by n/s, and because ST I is compact, the total number of nodes in ST I is O(n/s). Now look at the non-branching internal nodes of ST I. Because every such node v must have w v w u > s for its single child u in ST I, this increase in weight can only come from at least s (original) leaves of ST for which v is their nearest important ancestor. As every leaf in ST can contribute to at most one such non-branching internal node, the number of non-branching internal nodes in also bounded by O(n/s). In total, we store the RMQ-structures (using O(n) bits each) only at O(n/s) nodes of ST. By setting s = n, we obtain Thm Approximate Counting Queries If only the number of documents containing P + but not P matters, we can obtain a faster data structure than that of Thm. 1. We first explain Sadakane s succinct data structure for document counting in the presence of just a positive pattern P [16, Sect. 5.2]. Without going too much into the details, the idea of that algorithm is as follows: for each node v in ST we store in c(v) how many duplicate suffixes from the same document occur in its subtree. When a query pattern P arrives, we first locate the node v representing P and its suffix array interval [l, r]. Then r l + 1 c(v) gives the number of documents containing P, as r l + 1 is the total number of occurrences, and c(v) is the right correction 8

9 term that needs to be subtracted. (This idea was first used by Hui [10], who also shows how to compute the correction terms in overall linear time by lowest common ancestor queries.) The c(v)-numbers are not stored directly in each node (this would need O(n log n) bits overall), but rather in a bit-vector H of length O(n) such that c(v) can be computed by a constant number of rank- and select-queries on H (in fact by counting the 0 s in a certain interval in H ), given v s suffix array interval [l, r]. In total, this solution uses O(n) bits (see Lemma 1). For forbidden patterns, we modify the data structure from Sect. 3.3 and store only the following information at important nodes v: Vector B v, marking the documents in D v, as defined in Sect Each B v is prepared for constant-time rank 1 -queries. Vector H v, which is the bit-vector H as defined in the preceding paragraph, but modified to exclude those documents from D v. Using H v we can compute for any node u a modified correction term c v (u) that excludes the documents containing the string represented by v. Vector H v is at most as long as H, hence using only O(n) bits. Finally, each node v in ST stores the pointer p v to its nearest important successor. When we need to answer a query of the form what is the approximate number of documents containing P + but not P? we first match P in ST until reaching node v. We then match P + in ST, say until reaching u. Assume that u corresponds to the suffix array interval [l, r]. Now observe that the number of 1 s in B pv [l, r] gives the number of suffixes below u that correspond to forbidden patterns, and that this number can be computed by f = rank 1 (B pv, r) rank 1 (B pv, l 1). So r l+1 f is the number of suffixes below u excluding those documents containing P. Hence, we return n match = r l+1 f c p v (u) as the approximate answer to n match, which satisfies n match n match n match + n by the definition of important nodes (with s = n). Theorem 2. For a text collection of total length n, there exists a data structure of size O(n 3/2 ) bits such that subsequent queries asking for the approximate number of documents containing P + but not P can be answered in O( P + + P ) time, with an additive error of at most n. 4 Conclusions We initiated the study of document retrieval in the presence of forbidden patterns. Apart from improving on the space consumption and/or query time of our data structures, we point out the following subjects deserving further investigation: (1) document retrieval with more than two patterns, both positive and negative, (2) lower bounds for forbidden patterns, possibly in the spirit of 9

10 Appendix A, but also in more realistic machine models such as the word-ram, and (3) algorithmic engineering and comparison with inverted indexes. References 1. B. Chazelle. Lower bounds for orthogonal range searching, I: The reporting case. J. ACM, 37: , Y.-F. Chien, W.-K. Hon, R. Shah, and J. S. Vitter. Geometric Burrows-Wheeler transform: Linking range searching and text indexing. In Proc. DCC, pages IEEE Press, H. Cohen and E. Porat. Fast set intersection and two-patterns matching. Theor. Comput. Sci., 411(40 42): , R. Cole, M. Farach-Colton, R. Hariharan, T. M. Przytycka, and M. Thorup. An O(n log n) algorithm for the maximum agreement subtree problem for binary trees. SIAM J. Comput., 30(5): , P. Ferragina, N. Koudas, S. Muthukrishnan, and D. Srivastava. Two-dimensional substring indexing. J. Comput. Syst. Sci., 66(4): , J. Fischer and V. Heun. Space efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput., 40(2): , T. Gagie, S. J. Puglisi, and A. Turpin. Range quantile queries: Another virtue of wavelet trees. In Proc. SPIRE, volume 5721 of LNCS, pages 1 6. Springer, W.-K. Hon, R. Shah, S. V. Thankachan, and J. S. Vitter. String retrieval for multipattern queries. In Proc. SPIRE, volume 6393 of LNCS, pages Springer, W. K. Hon, R. Shah, and J. S. Vitter. Space-efficient framework for top-k string retrieval problems. In Proc. FOCS, pages IEEE Computer Society, L. C. K. Hui. Color set size problem with application to string matching. In Proc. CPM, volume 644 of LNCS, pages Springer, M. Karpinski and Y. Nekrich. Top-k color queries for document retrieval. In Proc. SODA, pages ACM/SIAM, Y. Matias, S. Muthukrishnan, S. C. Şahinalp, and J. Ziv. Augmenting suffix trees, with applications. In Proc. ESA, volume 1461 of LNCS, pages Springer, J. I. Munro and V. Raman. Succinct representation of balanced parentheses and static trees. SIAM J. Comput., 31(3): , S. Muthukrishnan. Efficient algorithms for document retrieval problems. In Proc. SODA, pages ACM/SIAM, G. Navarro and Y. Nekrich. Top-k document retrieval in optimal time and linear space. In Proc. SODA. ACM/SIAM, to appear K. Sadakane. Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms, 5(1):12 22, N. Välimäki and V. Mäkinen. Space-efficient algorithms for document retrieval. In Proc. CPM, volume 4580 of LNCS, pages Springer, Appendix A A Lower Bound for Two Positive Patterns Consider the two-pattern matching problem in [3]: 10

11 Given: A collection of static documents D = {D 1,..., D d } over an alphabet Σ of total length n = i d D i. Compute: An index that, given two patterns P 1 + and P 2 + online, allows us to compute all n match documents containing P 1 + and P 2 +. We give a new lower bound result for the problem using the technique introduced in [2]. The reduction is from 4-dimensional range queries: Given set S of n points in R 4 each represented with 4h bits, preprocess S to find points {s s S ([x l, x r ] [y l, y r ] [z l, z r ] [t l, t r ])}, where x, y, z, and t denote the 4 coordinate ranges. Chazelle [1] showed that on a pointer machine, an index supporting d-dimensional range searching in O(polylog(n) + occ) query time requires Ω(n(log n/ log log n) d 1 ) words of storage. First, we can assume that points in S are from a [1, n] [1, n] [1, n] [1, n] grid, since sorting and mapping n points in R 4 to their ranks in each coordinate take O(nh ) space and O(n log n) time. Later, a range query can be cast to the corresponding one on the ranks in O(log n) time. From the i th point s = (x, y, z, t) S we create the document D i = y i # 1 x i # 2 ti # 3 z i, (1) where c denotes the h = Θ(log n)-bit binary representation of integer c, and c its reverse (and the # i s are again new characters). Consider a balanced binary tree on values 1, 2,..., n in its leaves in this order. Associating 0 with left-branches and 1 with right-branches, paths from the root to the leaves define prefix codes for the values. An interval [c, d] can be partitioned into O(log n) intervals such that each interval corresponds to a different subtree; denote by P (c, d) the set of O(log n) prefix codes defined by paths from the root to the roots of these O(log n) subtrees. We cast a given 4-dimensional range query [x l, x r ] [y l, y r ] [z l, z r ] [t l, t r ] into O(log 4 n) two-pattern queries P + 1 = y # 1 x and P + 2 = t # 3 z (2) for all (x, y, z, t) P (x l, x r ) P (y l, y r ) P (z l, z r ) P (t l, t r ). One can now see that these O(log 4 n) two-pattern queries on D = {D 1,..., D n } of total length N = Θ(n log n) bits, constructed using Eq. (1), solve the 4-dimensional range reporting query. Theorem 3. On a pointer machine, an index on a document collection of total length n supporting two-pattern matching in O( P P + 2 +n match +polylog n) time requires Ω(n(log n/ log log n) 3 ) bits in the worst case. Proof. The reduction showed the connection between an N bit collection and 4-dimensional range queries on Θ(N/ log N) points, so recasting the result with a collection of length n gives Ω(n/ log n(log(n/ log n)/ log log(n/ log n)) 4 1 ) 11

12 words, i.e. Ω(n(log n/ log log n) 3 ) bits. 12

arxiv: v1 [cs.ds] 15 Feb 2012

arxiv: v1 [cs.ds] 15 Feb 2012 Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl

More information

Lecture 18 April 26, 2012

Lecture 18 April 26, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and

More information

Smaller and Faster Lempel-Ziv Indices

Smaller and Faster Lempel-Ziv Indices Smaller and Faster Lempel-Ziv Indices Diego Arroyuelo and Gonzalo Navarro Dept. of Computer Science, Universidad de Chile, Chile. {darroyue,gnavarro}@dcc.uchile.cl Abstract. Given a text T[1..u] over an

More information

A Space-Efficient Frameworks for Top-k String Retrieval

A Space-Efficient Frameworks for Top-k String Retrieval A Space-Efficient Frameworks for Top-k String Retrieval Wing-Kai Hon, National Tsing Hua University Rahul Shah, Louisiana State University Sharma V. Thankachan, Louisiana State University Jeffrey Scott

More information

arxiv: v1 [cs.ds] 22 Nov 2012

arxiv: v1 [cs.ds] 22 Nov 2012 Faster Compact Top-k Document Retrieval Roberto Konow and Gonzalo Navarro Department of Computer Science, University of Chile {rkonow,gnavarro}@dcc.uchile.cl arxiv:1211.5353v1 [cs.ds] 22 Nov 2012 Abstract:

More information

arxiv: v2 [cs.ds] 3 Oct 2017

arxiv: v2 [cs.ds] 3 Oct 2017 Orthogonal Vectors Indexing Isaac Goldstein 1, Moshe Lewenstein 1, and Ely Porat 1 1 Bar-Ilan University, Ramat Gan, Israel {goldshi,moshe,porately}@cs.biu.ac.il arxiv:1710.00586v2 [cs.ds] 3 Oct 2017 Abstract

More information

A Faster Grammar-Based Self-Index

A Faster Grammar-Based Self-Index A Faster Grammar-Based Self-Index Travis Gagie 1 Pawe l Gawrychowski 2 Juha Kärkkäinen 3 Yakov Nekrich 4 Simon Puglisi 5 Aalto University Max-Planck-Institute für Informatik University of Helsinki University

More information

LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations

LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations Jérémy Barbay 1, Johannes Fischer 2, and Gonzalo Navarro 1 1 Department of Computer Science, University of Chile, {jbarbay gnavarro}@dcc.uchile.cl

More information

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with

More information

String Searching with Ranking Constraints and Uncertainty

String Searching with Ranking Constraints and Uncertainty Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 2015 String Searching with Ranking Constraints and Uncertainty Sudip Biswas Louisiana State University and Agricultural

More information

New Lower and Upper Bounds for Representing Sequences

New Lower and Upper Bounds for Representing Sequences New Lower and Upper Bounds for Representing Sequences Djamal Belazzougui 1 and Gonzalo Navarro 2 1 LIAFA, Univ. Paris Diderot - Paris 7, France. dbelaz@liafa.jussieu.fr 2 Department of Computer Science,

More information

Optimal Color Range Reporting in One Dimension

Optimal Color Range Reporting in One Dimension Optimal Color Range Reporting in One Dimension Yakov Nekrich 1 and Jeffrey Scott Vitter 1 The University of Kansas. yakov.nekrich@googlemail.com, jsv@ku.edu Abstract. Color (or categorical) range reporting

More information

Rank and Select Operations on Binary Strings (1974; Elias)

Rank and Select Operations on Binary Strings (1974; Elias) Rank and Select Operations on Binary Strings (1974; Elias) Naila Rahman, University of Leicester, www.cs.le.ac.uk/ nyr1 Rajeev Raman, University of Leicester, www.cs.le.ac.uk/ rraman entry editor: Paolo

More information

Practical Indexing of Repetitive Collections using Relative Lempel-Ziv

Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Gonzalo Navarro and Víctor Sepúlveda CeBiB Center for Biotechnology and Bioengineering, Chile Department of Computer Science, University

More information

arxiv: v1 [cs.ds] 19 Apr 2011

arxiv: v1 [cs.ds] 19 Apr 2011 Fixed Block Compression Boosting in FM-Indexes Juha Kärkkäinen 1 and Simon J. Puglisi 2 1 Department of Computer Science, University of Helsinki, Finland juha.karkkainen@cs.helsinki.fi 2 Department of

More information

Succinct Data Structures for the Range Minimum Query Problems

Succinct Data Structures for the Range Minimum Query Problems Succinct Data Structures for the Range Minimum Query Problems S. Srinivasa Rao Seoul National University Joint work with Gerth Brodal, Pooya Davoodi, Mordecai Golin, Roberto Grossi John Iacono, Danny Kryzanc,

More information

Alphabet-Independent Compressed Text Indexing

Alphabet-Independent Compressed Text Indexing Alphabet-Independent Compressed Text Indexing DJAMAL BELAZZOUGUI Université Paris Diderot GONZALO NAVARRO University of Chile Self-indexes are able to represent a text within asymptotically the information-theoretic

More information

Alphabet Friendly FM Index

Alphabet Friendly FM Index Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM

More information

Compact Indexes for Flexible Top-k Retrieval

Compact Indexes for Flexible Top-k Retrieval Compact Indexes for Flexible Top-k Retrieval Simon Gog Matthias Petri Institute of Theoretical Informatics, Karlsruhe Institute of Technology Computing and Information Systems, The University of Melbourne

More information

Compressed Index for Dynamic Text

Compressed Index for Dynamic Text Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution

More information

arxiv: v1 [cs.ds] 25 Nov 2009

arxiv: v1 [cs.ds] 25 Nov 2009 Alphabet Partitioning for Compressed Rank/Select with Applications Jérémy Barbay 1, Travis Gagie 1, Gonzalo Navarro 1 and Yakov Nekrich 2 1 Department of Computer Science University of Chile {jbarbay,

More information

Space-Efficient Construction Algorithm for Circular Suffix Tree

Space-Efficient Construction Algorithm for Circular Suffix Tree Space-Efficient Construction Algorithm for Circular Suffix Tree Wing-Kai Hon, Tsung-Han Ku, Rahul Shah and Sharma Thankachan CPM2013 1 Outline Preliminaries and Motivation Circular Suffix Tree Our Indexes

More information

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:

More information

Compressed Representations of Sequences and Full-Text Indexes

Compressed Representations of Sequences and Full-Text Indexes Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Università di Pisa GIOVANNI MANZINI Università del Piemonte Orientale VELI MÄKINEN University of Helsinki AND GONZALO NAVARRO

More information

Dynamic Entropy-Compressed Sequences and Full-Text Indexes

Dynamic Entropy-Compressed Sequences and Full-Text Indexes Dynamic Entropy-Compressed Sequences and Full-Text Indexes VELI MÄKINEN University of Helsinki and GONZALO NAVARRO University of Chile First author funded by the Academy of Finland under grant 108219.

More information

Online Sorted Range Reporting and Approximating the Mode

Online Sorted Range Reporting and Approximating the Mode Online Sorted Range Reporting and Approximating the Mode Mark Greve Progress Report Department of Computer Science Aarhus University Denmark January 4, 2010 Supervisor: Gerth Stølting Brodal Online Sorted

More information

Text Indexing: Lecture 6

Text Indexing: Lecture 6 Simon Gog gog@kit.edu - 0 Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Reviewing the last two lectures We have seen two top-k document retrieval frameworks. Question

More information

A Faster Grammar-Based Self-index

A Faster Grammar-Based Self-index A Faster Grammar-Based Self-index Travis Gagie 1,Pawe l Gawrychowski 2,, Juha Kärkkäinen 3, Yakov Nekrich 4, and Simon J. Puglisi 5 1 Aalto University, Finland 2 University of Wroc law, Poland 3 University

More information

Cell Probe Lower Bounds and Approximations for Range Mode

Cell Probe Lower Bounds and Approximations for Range Mode Cell Probe Lower Bounds and Approximations for Range Mode Mark Greve, Allan Grønlund Jørgensen, Kasper Dalgaard Larsen, and Jakob Truelsen MADALGO, Department of Computer Science, Aarhus University, Denmark.

More information

Succinct Suffix Arrays based on Run-Length Encoding

Succinct Suffix Arrays based on Run-Length Encoding Succinct Suffix Arrays based on Run-Length Encoding Veli Mäkinen Gonzalo Navarro Abstract A succinct full-text self-index is a data structure built on a text T = t 1 t 2...t n, which takes little space

More information

arxiv: v1 [cs.ds] 30 Nov 2018

arxiv: v1 [cs.ds] 30 Nov 2018 Faster Attractor-Based Indexes Gonzalo Navarro 1,2 and Nicola Prezza 3 1 CeBiB Center for Biotechnology and Bioengineering 2 Dept. of Computer Science, University of Chile, Chile. gnavarro@dcc.uchile.cl

More information

Fast Fully-Compressed Suffix Trees

Fast Fully-Compressed Suffix Trees Fast Fully-Compressed Suffix Trees Gonzalo Navarro Department of Computer Science University of Chile, Chile gnavarro@dcc.uchile.cl Luís M. S. Russo INESC-ID / Instituto Superior Técnico Technical University

More information

Optimal Dynamic Sequence Representations

Optimal Dynamic Sequence Representations Optimal Dynamic Sequence Representations Gonzalo Navarro Yakov Nekrich Abstract We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on

More information

Space-Efficient Re-Pair Compression

Space-Efficient Re-Pair Compression Space-Efficient Re-Pair Compression Philip Bille, Inge Li Gørtz, and Nicola Prezza Technical University of Denmark, DTU Compute {phbi,inge,npre}@dtu.dk Abstract Re-Pair [5] is an effective grammar-based

More information

A Simple Alphabet-Independent FM-Index

A Simple Alphabet-Independent FM-Index A Simple Alphabet-Independent -Index Szymon Grabowski 1, Veli Mäkinen 2, Gonzalo Navarro 3, Alejandro Salinger 3 1 Computer Engineering Dept., Tech. Univ. of Lódź, Poland. e-mail: sgrabow@zly.kis.p.lodz.pl

More information

String Range Matching

String Range Matching String Range Matching Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi Department of Computer Science, University of Helsinki Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Given strings

More information

Lecture 2 September 4, 2014

Lecture 2 September 4, 2014 CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 2 September 4, 2014 Scribe: David Liu 1 Overview In the last lecture we introduced the word RAM model and covered veb trees to solve the

More information

Compressed Representations of Sequences and Full-Text Indexes

Compressed Representations of Sequences and Full-Text Indexes Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Dipartimento di Informatica, Università di Pisa, Italy GIOVANNI MANZINI Dipartimento di Informatica, Università del Piemonte

More information

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille 1, Rolf Fagerberg 2, and Inge Li Gørtz 3 1 IT University of Copenhagen. Rued Langgaards

More information

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Wing-Kai Hon Kunihiko Sadakane Wing-Kin Sung Abstract Suffix trees and suffix arrays are the most prominent full-text indices, and their

More information

On Compressing and Indexing Repetitive Sequences

On Compressing and Indexing Repetitive Sequences On Compressing and Indexing Repetitive Sequences Sebastian Kreft a,1,2, Gonzalo Navarro a,2 a Department of Computer Science, University of Chile Abstract We introduce LZ-End, a new member of the Lempel-Ziv

More information

String Indexing for Patterns with Wildcards

String Indexing for Patterns with Wildcards MASTER S THESIS String Indexing for Patterns with Wildcards Hjalte Wedel Vildhøj and Søren Vind Technical University of Denmark August 8, 2011 Abstract We consider the problem of indexing a string t of

More information

Reducing the Space Requirement of LZ-Index

Reducing the Space Requirement of LZ-Index Reducing the Space Requirement of LZ-Index Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile {darroyue, gnavarro}@dcc.uchile.cl 2 Dept. of

More information

Advanced Text Indexing Techniques. Johannes Fischer

Advanced Text Indexing Techniques. Johannes Fischer Advanced ext Indexing echniques Johannes Fischer SS 2009 1 Suffix rees, -Arrays and -rays 1.1 Recommended Reading Dan Gusfield: Algorithms on Strings, rees, and Sequences. 1997. ambridge University Press,

More information

Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT)

Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT) Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT) Disclaimer Students attending my lectures are often astonished that I present the material in a much livelier form than in this script.

More information

arxiv: v2 [cs.ds] 5 Mar 2014

arxiv: v2 [cs.ds] 5 Mar 2014 Order-preserving pattern matching with k mismatches Pawe l Gawrychowski 1 and Przemys law Uznański 2 1 Max-Planck-Institut für Informatik, Saarbrücken, Germany 2 LIF, CNRS and Aix-Marseille Université,

More information

Advanced Data Structures

Advanced Data Structures Simon Gog gog@kit.edu - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Predecessor data structures We want to support

More information

Small-Space Dictionary Matching (Dissertation Proposal)

Small-Space Dictionary Matching (Dissertation Proposal) Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length

More information

Università degli studi di Udine

Università degli studi di Udine Università degli studi di Udine Computing LZ77 in Run-Compressed Space This is a pre print version of the following article: Original Computing LZ77 in Run-Compressed Space / Policriti, Alberto; Prezza,

More information

ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION

ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION PAOLO FERRAGINA, IGOR NITTO, AND ROSSANO VENTURINI Abstract. One of the most famous and investigated lossless data-compression schemes is the one introduced

More information

Lower Bounds for Dynamic Connectivity (2004; Pǎtraşcu, Demaine)

Lower Bounds for Dynamic Connectivity (2004; Pǎtraşcu, Demaine) Lower Bounds for Dynamic Connectivity (2004; Pǎtraşcu, Demaine) Mihai Pǎtraşcu, MIT, web.mit.edu/ mip/www/ Index terms: partial-sums problem, prefix sums, dynamic lower bounds Synonyms: dynamic trees 1

More information

Advanced Data Structures

Advanced Data Structures Simon Gog gog@kit.edu - Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Predecessor data structures We want to support the following operations on a set of integers from

More information

On Pattern Matching With Swaps

On Pattern Matching With Swaps On Pattern Matching With Swaps Fouad B. Chedid Dhofar University, Salalah, Oman Notre Dame University - Louaize, Lebanon P.O.Box: 2509, Postal Code 211 Salalah, Oman Tel: +968 23237200 Fax: +968 23237720

More information

arxiv: v1 [cs.ds] 9 Apr 2018

arxiv: v1 [cs.ds] 9 Apr 2018 From Regular Expression Matching to Parsing Philip Bille Technical University of Denmark phbi@dtu.dk Inge Li Gørtz Technical University of Denmark inge@dtu.dk arxiv:1804.02906v1 [cs.ds] 9 Apr 2018 Abstract

More information

Stronger Lempel-Ziv Based Compressed Text Indexing

Stronger Lempel-Ziv Based Compressed Text Indexing Stronger Lempel-Ziv Based Compressed Text Indexing Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile, Blanco Encalada 2120, Santiago, Chile.

More information

Quiz 1 Solutions. (a) f 1 (n) = 8 n, f 2 (n) = , f 3 (n) = ( 3) lg n. f 2 (n), f 1 (n), f 3 (n) Solution: (b)

Quiz 1 Solutions. (a) f 1 (n) = 8 n, f 2 (n) = , f 3 (n) = ( 3) lg n. f 2 (n), f 1 (n), f 3 (n) Solution: (b) Introduction to Algorithms October 14, 2009 Massachusetts Institute of Technology 6.006 Spring 2009 Professors Srini Devadas and Constantinos (Costis) Daskalakis Quiz 1 Solutions Quiz 1 Solutions Problem

More information

Burrows-Wheeler Transforms in Linear Time and Linear Bits

Burrows-Wheeler Transforms in Linear Time and Linear Bits Burrows-Wheeler Transforms in Linear Time and Linear Bits Russ Cox (following Hon, Sadakane, Sung, Farach, and others) 18.417 Final Project BWT in Linear Time and Linear Bits Three main parts to the result.

More information

arxiv: v2 [cs.ds] 8 Apr 2016

arxiv: v2 [cs.ds] 8 Apr 2016 Optimal Dynamic Strings Paweł Gawrychowski 1, Adam Karczmarz 1, Tomasz Kociumaka 1, Jakub Łącki 2, and Piotr Sankowski 1 1 Institute of Informatics, University of Warsaw, Poland [gawry,a.karczmarz,kociumaka,sank]@mimuw.edu.pl

More information

arxiv: v3 [cs.ds] 6 Sep 2018

arxiv: v3 [cs.ds] 6 Sep 2018 Universal Compressed Text Indexing 1 Gonzalo Navarro 2 arxiv:1803.09520v3 [cs.ds] 6 Sep 2018 Abstract Center for Biotechnology and Bioengineering (CeBiB), Department of Computer Science, University of

More information

Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility

Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility Tao Jiang, Ming Li, Brendan Lucier September 26, 2005 Abstract In this paper we study the Kolmogorov Complexity of a

More information

Finding Consensus Strings With Small Length Difference Between Input and Solution Strings

Finding Consensus Strings With Small Length Difference Between Input and Solution Strings Finding Consensus Strings With Small Length Difference Between Input and Solution Strings Markus L. Schmid Trier University, Fachbereich IV Abteilung Informatikwissenschaften, D-54286 Trier, Germany, MSchmid@uni-trier.de

More information

Rotation and Lighting Invariant Template Matching

Rotation and Lighting Invariant Template Matching Rotation and Lighting Invariant Template Matching Kimmo Fredriksson 1, Veli Mäkinen 2, and Gonzalo Navarro 3 1 Department of Computer Science, University of Joensuu. kfredrik@cs.joensuu.fi 2 Department

More information

Sorting suffixes of two-pattern strings

Sorting suffixes of two-pattern strings Sorting suffixes of two-pattern strings Frantisek Franek W. F. Smyth Algorithms Research Group Department of Computing & Software McMaster University Hamilton, Ontario Canada L8S 4L7 April 19, 2004 Abstract

More information

Covering Linear Orders with Posets

Covering Linear Orders with Posets Covering Linear Orders with Posets Proceso L. Fernandez, Lenwood S. Heath, Naren Ramakrishnan, and John Paul C. Vergara Department of Information Systems and Computer Science, Ateneo de Manila University,

More information

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES DO HUY HOANG (B.C.S. (Hons), NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE SCHOOL OF COMPUTING NATIONAL

More information

1 Approximate Quantiles and Summaries

1 Approximate Quantiles and Summaries CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity

More information

In-Memory Storage for Labeled Tree-Structured Data

In-Memory Storage for Labeled Tree-Structured Data In-Memory Storage for Labeled Tree-Structured Data by Gelin Zhou A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer

More information

arxiv: v2 [cs.ds] 6 Jul 2015

arxiv: v2 [cs.ds] 6 Jul 2015 Online Self-Indexed Grammar Compression Yoshimasa Takabatake 1, Yasuo Tabei 2, and Hiroshi Sakamoto 1 1 Kyushu Institute of Technology {takabatake,hiroshi}@donald.ai.kyutech.ac.jp 2 PRESTO, Japan Science

More information

Internal Pattern Matching Queries in a Text and Applications

Internal Pattern Matching Queries in a Text and Applications Internal Pattern Matching Queries in a Text and Applications Tomasz Kociumaka Jakub Radoszewski Wojciech Rytter Tomasz Waleń Abstract We consider several types of internal queries: questions about subwords

More information

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems Shunsuke Inenaga 1, Ayumi Shinohara 2,3 and Masayuki Takeda 2,3 1 Department of Computer Science, P.O. Box 26 (Teollisuuskatu 23)

More information

Notes on Logarithmic Lower Bounds in the Cell Probe Model

Notes on Logarithmic Lower Bounds in the Cell Probe Model Notes on Logarithmic Lower Bounds in the Cell Probe Model Kevin Zatloukal November 10, 2010 1 Overview Paper is by Mihai Pâtraşcu and Erik Demaine. Both were at MIT at the time. (Mihai is now at AT&T Labs.)

More information

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Roberto Grossi Dipartimento di Informatica Università di Pisa 56125 Pisa, Italy grossi@di.unipi.it Jeffrey

More information

Self-Indexed Grammar-Based Compression

Self-Indexed Grammar-Based Compression Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca

More information

Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs

Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs Krishnendu Chatterjee Rasmus Ibsen-Jensen Andreas Pavlogiannis IST Austria Abstract. We consider graphs with n nodes together

More information

An O(N) Semi-Predictive Universal Encoder via the BWT

An O(N) Semi-Predictive Universal Encoder via the BWT An O(N) Semi-Predictive Universal Encoder via the BWT Dror Baron and Yoram Bresler Abstract We provide an O(N) algorithm for a non-sequential semi-predictive encoder whose pointwise redundancy with respect

More information

Dynamic Ordered Sets with Exponential Search Trees

Dynamic Ordered Sets with Exponential Search Trees Dynamic Ordered Sets with Exponential Search Trees Arne Andersson Computing Science Department Information Technology, Uppsala University Box 311, SE - 751 05 Uppsala, Sweden arnea@csd.uu.se http://www.csd.uu.se/

More information

Preview: Text Indexing

Preview: Text Indexing Simon Gog gog@ira.uka.de - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Text Indexing Motivation Problems Given a text

More information

A Linear Time Algorithm for Ordered Partition

A Linear Time Algorithm for Ordered Partition A Linear Time Algorithm for Ordered Partition Yijie Han School of Computing and Engineering University of Missouri at Kansas City Kansas City, Missouri 64 hanyij@umkc.edu Abstract. We present a deterministic

More information

Jumbled String Matching: Motivations, Variants, Algorithms

Jumbled String Matching: Motivations, Variants, Algorithms Jumbled String Matching: Motivations, Variants, Algorithms Zsuzsanna Lipták University of Verona (Italy) Workshop Combinatorial structures for sequence analysis in bioinformatics Milano-Bicocca, 27 Nov

More information

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts) Introduction to Algorithms October 13, 2010 Massachusetts Institute of Technology 6.006 Fall 2010 Professors Konstantinos Daskalakis and Patrick Jaillet Quiz 1 Solutions Quiz 1 Solutions Problem 1. We

More information

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Advanced Implementations of Tables: Balanced Search Trees and Hashing Advanced Implementations of Tables: Balanced Search Trees and Hashing Balanced Search Trees Binary search tree operations such as insert, delete, retrieve, etc. depend on the length of the path to the

More information

Randomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th

Randomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th CSE 3500 Algorithms and Complexity Fall 2016 Lecture 10: September 29, 2016 Quick sort: Average Run Time In the last lecture we started analyzing the expected run time of quick sort. Let X = k 1, k 2,...,

More information

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler Complexity Theory Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität Wien 15 May, 2018 Reinhard

More information

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181. Complexity Theory Complexity Theory Outline Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität

More information

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

More information

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

More information

Approximate String Matching with Lempel-Ziv Compressed Indexes

Approximate String Matching with Lempel-Ziv Compressed Indexes Approximate String Matching with Lempel-Ziv Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2 and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,

More information

Data selection. Lower complexity bound for sorting

Data selection. Lower complexity bound for sorting Data selection. Lower complexity bound for sorting Lecturer: Georgy Gimel farb COMPSCI 220 Algorithms and Data Structures 1 / 12 1 Data selection: Quickselect 2 Lower complexity bound for sorting 3 The

More information

Self-Indexed Grammar-Based Compression

Self-Indexed Grammar-Based Compression Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca

More information

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding SIGNAL COMPRESSION Lecture 7 Variable to Fix Encoding 1. Tunstall codes 2. Petry codes 3. Generalized Tunstall codes for Markov sources (a presentation of the paper by I. Tabus, G. Korodi, J. Rissanen.

More information

E D I C T The internal extent formula for compacted tries

E D I C T The internal extent formula for compacted tries E D C T The internal extent formula for compacted tries Paolo Boldi Sebastiano Vigna Università degli Studi di Milano, taly Abstract t is well known [Knu97, pages 399 4] that in a binary tree the external

More information

Problem. Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26

Problem. Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26 Binary Search Introduction Problem Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26 Strategy 1: Random Search Randomly select a page until the page containing

More information

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have EECS 229A Spring 2007 * * Solutions to Homework 3 1. Problem 4.11 on pg. 93 of the text. Stationary processes (a) By stationarity and the chain rule for entropy, we have H(X 0 ) + H(X n X 0 ) = H(X 0,

More information

Sliding Windows with Limited Storage

Sliding Windows with Limited Storage Electronic Colloquium on Computational Complexity, Report No. 178 (2012) Sliding Windows with Limited Storage Paul Beame Computer Science and Engineering University of Washington Seattle, WA 98195-2350

More information

Motif Extraction from Weighted Sequences

Motif Extraction from Weighted Sequences Motif Extraction from Weighted Sequences C. Iliopoulos 1, K. Perdikuri 2,3, E. Theodoridis 2,3,, A. Tsakalidis 2,3 and K. Tsichlas 1 1 Department of Computer Science, King s College London, London WC2R

More information

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS *

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS * A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS * 1 Jorma Tarhio and Esko Ukkonen Department of Computer Science, University of Helsinki Tukholmankatu 2, SF-00250 Helsinki,

More information

CSE 202 Homework 4 Matthias Springer, A

CSE 202 Homework 4 Matthias Springer, A CSE 202 Homework 4 Matthias Springer, A99500782 1 Problem 2 Basic Idea PERFECT ASSEMBLY N P: a permutation P of s i S is a certificate that can be checked in polynomial time by ensuring that P = S, and

More information

Skriptum VL Text-Indexierung Sommersemester 2010 Johannes Fischer (KIT)

Skriptum VL Text-Indexierung Sommersemester 2010 Johannes Fischer (KIT) 1 Recommended Reading Skriptum VL Text-Indexierung Sommersemester 2010 Johannes Fischer (KIT) D. Gusfield: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. M. Crochemore,

More information

SORTING SUFFIXES OF TWO-PATTERN STRINGS.

SORTING SUFFIXES OF TWO-PATTERN STRINGS. International Journal of Foundations of Computer Science c World Scientific Publishing Company SORTING SUFFIXES OF TWO-PATTERN STRINGS. FRANTISEK FRANEK and WILLIAM F. SMYTH Algorithms Research Group,

More information

Text matching of strings in terms of straight line program by compressed aleshin type automata

Text matching of strings in terms of straight line program by compressed aleshin type automata Text matching of strings in terms of straight line program by compressed aleshin type automata 1 A.Jeyanthi, 2 B.Stalin 1 Faculty, 2 Assistant Professor 1 Department of Mathematics, 2 Department of Mechanical

More information