Forbidden Patterns. {vmakinen leena.salmela
|
|
- Paula Griffin
- 6 years ago
- Views:
Transcription
1 Forbidden Patterns Johannes Fischer 1,, Travis Gagie 2,, Tsvi Kopelowitz 3, Moshe Lewenstein 4, Veli Mäkinen 5,, Leena Salmela 5,, and Niko Välimäki 5, 1 KIT, Karlsruhe, Germany, johannes.fischer@kit.edu 2 Aalto University, Espoo, Finland, travis.gagie@aalto.fi 3 Weizmann Institute of Science, Rehovot, Israel, kopelot@gmail.com 4 Bar-Ilan University, Ramat Gan, Israel, moshe@cs.biu.ac.il 5 University of Helsinki, Helsinki, Finland, {vmakinen leena.salmela nvalimak}@cs.helsinki.fi Abstract. We consider the problem of indexing a collection of documents (a.k.a. strings) of total length n such that the following kind of queries are supported: given two patterns P + and P, list all n match documents containing P + but not P. This is a natural extension of the classic problem of document listing as considered by Muthukrishnan [SODA 02], where only the positive pattern P + is given. Our main solution is an index of size O(n 3/2 ) bits that supports queries in O( P + + P + n match + n) time. 1 Introduction In recent years, the pattern matching community has paid a considerable amount of attention to document retrieval tasks, where, in contrast to traditional indexed pattern matching, the task is to output each document containing a search pattern even just once, and in particular without spending time proportional to the number of total occurrences of the pattern in that document. Starting with Muthukrishnan s seminal paper [14] (building in fact on an earlier paper by the same author with colleagues [12]), an abundance of articles on variations of this scheme emerged, including: space reductions of the underlying data structures [7, 16, 17], ranking of the output [9, 11, 15], two-pattern queries [3, 5], and perhaps many more. A different possible, and indeed very natural, extension of the basic problem is to exclude some documents from the output. In this setting, the user specifies, Supported by the German Research Foundation (DFG). Supported by Academy of Finland grant Partially funded by the Academy of Finland grant Also affiliated with Helsinki Institute for Information Technology (HIIT). Supported by Academy of Finland grant (ALGODAN). Partially funded by the Academy of Finland grant (ALGODAN) and Helsinki Doctoral Programme in Computer Science (Hecse).
2 in addition to the query pattern P +, a negative pattern P that should not occur in the retrieved documents. Formally, this problem of forbidden patterns can be modeled as follows: 6 Given: A collection of static documents D = {D 1,..., D d } over an alphabet Σ of total length n = i d D i. Compute: An index that, given two patterns P + and P online, allows us to compute all n match documents containing P +, but not P. The best we can hope for is certainly an index of linear size having a query time of O( P + + P + n match ), as this is the time just to read the input and write the output of the query. However, achieving anything close to this optimum seems completely out of reach (at least at the current state of research), as the forbidden pattern queries can be regarded as set difference queries, which are arguably at least as hard as set intersection queries. In the realm of document retrieval, those latter queries correspond to the case where two positive patterns P + 1 and P + 2 are given (and one is interested in all documents containing both positive patterns); we are aware of three indexes that address this problem: (1) Õ(n 3/2 ) words of space with query time O( P P n match + n) [5], (2) O(n log n) words and O( P P + 2 +( n match n log n+n match ) log 2 n) time [3], and (3) O(n) words and O( P P + 2 +( n match n log n+n match ) log n)) time [8] (this improves on [3] in both time and space and has the further advantage that it generalizes to more than two patterns). In Appendix A of this paper, we show that this problem of two positive patterns is indeed harder than document retrieval with just one pattern. 7 However, the main body of this paper is devoted to the forbidden patterns problem. The following theorem summarizes our main result. Theorem 1. For a text collection of total length n, there exists a data structure of size O(n 3/2 ) bits such that subsequent forbidden pattern queries can be answered in O( P + + P + n match + n) time. The rest of this article is structured as follows. Sect. 2 introduces known results that form the basic building blocks of our solution, including a description of the preprocessing algorithm for document retrieval with just one positive pattern [14]. In Sect. 3, we then give data structures for the forbidden patterns problem, where, apart from proving Thm. 1, we also look at the variation of just counting the number documents. Finally, Sect. 4 concludes the paper. 6 Muthukrishnan [14] already considered the case where just a negative pattern P is given, and has an optimal solution that outputs all documents not containing P. 7 Unfortunately, we could not come up with any meaningful lower bound for the forbidden patterns problem. 2
3 2 Preliminaries 2.1 Succinct Data Structures Consider a bit-string S[1, n] of length n. We define the fundamental rank- and select-operations on S as follows: rank 1 (S, i) gives the number of 1 s in the prefix S[1, i], and select 1 (S, i) gives the position of the i th 1 in S, reading S from left to right (1 i n). The following lemma summarizes a by-now classic result (see, e.g., [13]): Lemma 1. A bit-string of length n can be represented in n + o(n) bits such that rank- and select-operations are supported in O(1) time. 2.2 Range Minimum Queries A basic building block for our solution is a space-efficient preprocessing scheme for O(1) range minimum queries. For a static array E[1, n] of n objects from a totally ordered universe and two indices i and j with 1 i j n, a range minimum query rmq E (i, j) returns the position of a minimum element in the sub-array E[i, j]; in symbols: rmq E (i, j) = argmin { E[k] i k j }. We state the following result [6, Thm. 5.8]: Lemma 2. A static array E[1, n] can be preprocessed in O(n) time into a data structure of size 2n + o(n) bits such that subsequent range minimum queries on E can be answered in O(1) time, without consulting E at query time. This size is asymptotically optimal. 2.3 Document Retrieval We now explain Muthukrishnan s solution [14] for document retrieval with only one positive pattern P (in fact, we describe a variant [16] of the original algorithm that is more convenient for our purposes). The overall idea is to build a generalized suffix tree ST for the collection of documents D = {D 1,..., D d }, and enhance it with additional information for reporting the documents. A generalized suffix tree for D is a suffix tree for the text T := D 1 # 1 D 2 # 2... D d # d, where the # i s are distinct characters not appearing elsewhere in D. A suffix tree ST on T, in turn, is a compacted trie on all suffixes of T, and consists of only O( T ) nodes. Every such node v is said to correspond to a substring α of T iff the letters on the root-to-v path are exactly α. A suffix tree ST for T allows us to locate all occ occurrences of a search pattern P in T in optimal O( P + occ) time (with perfect hashing; otherwise the search takes O( P log Σ + occ) time). This search proceeds in two steps: it first finds in O( P ) time the node v in ST such that all leaves below v correspond to the 3
4 occ suffixes that are prefixed by P. In a second step, the starting points of all these suffixes are reported in additional O(occ) time. For document retrieval, it should be clear that we can reuse only the first part of this search, but must modify the second step such that it uses O(n match ) instead of O(occ) time. To this end, Muthukrishnan s solution proceeds as follows. Consider the leaves of ST in lexicographic order. The positions in T of their corresponding suffixes form a permutation of the numbers [1, n ] (n = n+d = O(n) being the size of T ), the so-called suffix-array A[1, n ]. Define a document array D[1, n ] of the same size as A, such that D[i] holds the document number of the lexicographically i th suffix. More formally, D[i] = j iff # j is the first document separator in T [A[i], n ]. We now chain suffixes from the same document in a new array E[1, n ] by defining E[i] = max { j < i D[j] = D[i] }, where the maximum of the empty set is assumed to be. Array E is prepared for constant-time range minimum queries using Lemma 2. With these data structures, we can obtain optimal O( P + n match ) listing time, as explained next. We first use ST to find in O( P ) time the interval [l, r] in A such that the suffixes in A[l, r] are exactly those that are prefixed by P. We then call the recursive procedure list in Alg. 1, initially invoked by list(l, r) and assuming V [i] = 0 for all 1 i d just before that first call: Algorithm 1: List all documents in D[i, j] not occurring in D[l, i 1]. procedure list(i, j) m rmq E (i, j) if V [D[m]] = 0 then output D[m] V [D[m]] 1 list(i, m 1) list(m + 1, j) The idea of procedure list is that each distinct document identifier in D[l, r] is listed only at the place m of its leftmost occurrence in the interval [l, r]; such places are conveniently located by range minimum queries on E. To avoid duplicate outputs, we mark all documents found by a 1 in an additional array V [1, d], which is initialized with all 0 s in the preprocessing phase. Whenever the smallest element in E[i, j] comes from a document already reported (hence V [D[m]] = 1), the recursion can be stopped since every document in D[i, j] is reported when visiting distinct intervals [i, j ] with l i j < i. Hence, the overall running time is O(n match ). At the end of the reporting phase, we need to reset V [ ] to 0 for all documents in the output. This takes additional O(n match ) time. Apart from the suffix tree ST, the space for this solution is dominated by the n words needed for storing the document array D. 4
5 3 Document Retrieval with Forbidden Patterns We now come to the description of our solution to the problem of forbidden patterns, as presented in the introduction. We proceed by first presenting a rather simple solution (Sect. 3.1), which is then subsequently refined (Sect ). 3.1 O(n 2 ) Words of Space We first show how to achieve optimal ( P + + P + n match ) query time with O(n 2 ) space. The idea is again to store a generalized suffix tree ST for the set of documents D and enhance it with additional information. In particular, for every node v in ST corresponding to string α, we store a copy of ST that excludes the documents containing α. We call that copy ST v. Every ST v is prepared for normal document listing (Sect. 2.3). When a query arrives, we first match P in ST until reaching node v (if the matching ends on an edge, we take the following node). We then jump to ST v, where we match P +, and list all n match documents in optimal O(n match ) time. The space for this solution is clearly O(n 2 ) words, as the number of nodes in ST is O(n), and for each such node v we build another generalized suffix tree ST v, all of which could contain O(n) nodes in the worst case. 3.2 O(n 2 ) Bits of Space We now reduce the solution from the previous section to O(n 2 ) bits. Our aim is to reuse the full suffix tree ST also when matching the positive pattern P +, and use a modified RMQ-structure when reporting documents by procedure list (see Sect. 2.3). To this end, let v be a node in ST, and let D v be the set of documents containing the string represented by v (hence D v is the set of documents in D[l, r] if A[l, r] is the suffix array interval for v in the sense of Sect. 2.3). In a (conceptual) copy E v of the global chaining array E, we blank out all entries corresponding to documents in D v by setting the corresponding values to +. More precisely, { + if D[i] D v, and E v [i] = E[i] otherwise. Now each such E v is prepared for range minimum queries using Lemma 2, and only this RMQ-structure (not the array E v itself!) is stored at node v. A further bit-vector B v [1, n ] at node v marks those positions with a 1 that correspond to documents in D v, in symbols: B v [i] = 1 if E v [i] = +, and B v [i] = 0 otherwise. Hence, the total space needed is 3n + o(n ) = O(n) bits per node in ST. We also store the global document array D[1, n ], plus the bit-vector V [1, d] needed 5
6 Algorithm 2: Modified procedure for document listing. procedure list (i, j) m rmq Ev (i, j) if V [D[m]] = 0 and B v[m] = 0 then output D[m] V [D[m]] 1 list (i, m 1) list (m + 1, j) by Alg. 1, needing O(n log n) and d = O(n) bits, respectively. The space for the entire data structure thereby amounts to O(n 2 ) bits. The query processing starts as in the previous section: we first match P in ST until reaching node v (and again take the following node if we end on an edge). Now instead of jumping to ST v (which is no longer stored), we use ST again to find the interval [l, r] in the suffix array A such that the suffixes in A[l, r] are exactly those that are prefixed by P +. We then call procedure list(l, r), but using the RMQ-structure for E v instead of E (corresponding to the negative pattern P ). We need to further modify that procedure such that it does not list those documents in D v ; this can be accomplished by checking if the m th bit in B v is set to 0. If, on the other hand, B v [m] = 1, we can stop the recursion, as in that case all other entries in E v [i, j] must also be + (and hence come from documents in D v ). The complete modified algorithm list can be seen in Alg. 2. As before, after having listed all n match documents, we need to unmark the listed documents in V in additional O(n match ) time in order to prepare for the next query. 3.3 O(n 3/2 ) Bits of Space We now present a space/time tradeoff for the solution given in the previous section. Our general idea is to store the RMQ-structures only at a selected subset of nodes in ST, thereby possibly listing false documents that need to be filtered at query time. For what follows, the reader should also consult the example shown in Fig. 1. We assign a weight w v to each node v in ST as follows. As before, let D v denote the set of documents below v; i.e., the set of documents in D that contain the string represented by v. Then the weight of v is defined to be the number of documents in D v, w v = D v. In ST, we mark certain nodes as important. The RMQ-structures will only be stored at important nodes. We will make sure that each node v has an important successor u such that w v w u + s, for some integer s to be determined later. At v, we also store a pointer to this important successor u. Let p v = u denote this pointer (for important nodes v we define p v = v). At query time, when the search for P ends at v, we use the algorithm from the previous section, but 6
7 T = turing# 1 tape# 2 enigma# 3 apple# 4 a e g i u m r ing#1 a#3 n p t p n ig i 2 r 2 m # 3 e p # # gm e ing 2 1 a#3 # 2 le 2 m g n p u a#3 ma g l e#4 g # 2 a le#4 le a rin # 1 # # 4 # 1 # pe 1 # 4 # 3 # 4 3 # g 2 # l Fig. 1. Generalized suffix tree (with super-leaf l) for the collection of documents D = {turing, tape, enigma, apple} (with irrelevant parts pruned). The number inside a node v denotes its weight w v. Important nodes (assuming s = 2) are shown in bold. Hence, in this example only 3 RMQ-structures are stored. now with the RMQ-structure for E pv. This reports at most s false documents, which need to be discarded from the output. It thus remains to identify the false documents. For this, we need not store any additional data structures, as explained next. Let α denote the string represented by node p v. Observe that the false documents are exactly those that contain P, but not α; but this corresponds to a forbidden pattern query, with P as the positive pattern, and α as the negative one! And for this query all necessary data structures are at hand, because at p v there exists an RMQ-structure that filters exactly those documents containing α. In summary, to answer the query P + and not P, we first match P in ST up to node v, where we follow the pointer p v to an important successor representing string α. At p v, we answer the query P and not α with the algorithm from Sect. 3.2 to identify the set of false documents, which we mark in a bit-vector F [1, d]. Finally, we answer the query P + and not α, again by using the algorithm from Sect. 3.2, but this time outputting only those documents not marked in F. In the end, all marked documents are unmarked to prepare for the next query. As by definition the marked (=false) documents are at most s in number, the total query time is O( P + + P + n match + s). Identifying Important Nodes. It remains to show how the important nodes can be identified. We do this in a bottom-up traversal of ST. We first enhance ST with a super-leaf l that is the single child of all original leaves in ST. This 7
8 node l has weight w l = 0, is marked as important, and stores an RMQ-structure for the original array E. During the bottom-up traversal of all original nodes in ST (excluding the super-leaf l), let us assume we arrive at a node v with children v 1,..., v k. By induction, all v i s have already been assigned an important successor p vi. Let v m be the child of v having an important successor occurring in most documents, m = argmax{w pvi 1 i k}. Then if w v w pvm s, we set p v to p vm. Otherwise, we mark v as important, create an RMQ-structure for E v, and set p v = v. Space Analysis. To analyze the space, consider the subtree ST I of ST consisting only of important nodes and their mutual lowest common ancestors (in the terminology of Cole et al. [4], ST I is the subtree of ST induced by the important nodes). The nodes in ST I are further divided into two different classes: (1) non-branching internal, and (2) other. A node belongs to class non-branching internal iff it has exactly one child in ST I ; otherwise it belongs to class other. We analyze the number of non-branching internal and other nodes separately. Let us first consider the other nodes. They form yet another induced subtree of ST I, let us call it ST I. A leaf v in ST I covers at least s (original) leaves in ST, as its weight w v must by definition be at least w v > s, and in order to cover s documents at least s suffixes from T must be covered. Hence, the number of leaves in ST I is bounded from above by n/s, and because ST I is compact, the total number of nodes in ST I is O(n/s). Now look at the non-branching internal nodes of ST I. Because every such node v must have w v w u > s for its single child u in ST I, this increase in weight can only come from at least s (original) leaves of ST for which v is their nearest important ancestor. As every leaf in ST can contribute to at most one such non-branching internal node, the number of non-branching internal nodes in also bounded by O(n/s). In total, we store the RMQ-structures (using O(n) bits each) only at O(n/s) nodes of ST. By setting s = n, we obtain Thm Approximate Counting Queries If only the number of documents containing P + but not P matters, we can obtain a faster data structure than that of Thm. 1. We first explain Sadakane s succinct data structure for document counting in the presence of just a positive pattern P [16, Sect. 5.2]. Without going too much into the details, the idea of that algorithm is as follows: for each node v in ST we store in c(v) how many duplicate suffixes from the same document occur in its subtree. When a query pattern P arrives, we first locate the node v representing P and its suffix array interval [l, r]. Then r l + 1 c(v) gives the number of documents containing P, as r l + 1 is the total number of occurrences, and c(v) is the right correction 8
9 term that needs to be subtracted. (This idea was first used by Hui [10], who also shows how to compute the correction terms in overall linear time by lowest common ancestor queries.) The c(v)-numbers are not stored directly in each node (this would need O(n log n) bits overall), but rather in a bit-vector H of length O(n) such that c(v) can be computed by a constant number of rank- and select-queries on H (in fact by counting the 0 s in a certain interval in H ), given v s suffix array interval [l, r]. In total, this solution uses O(n) bits (see Lemma 1). For forbidden patterns, we modify the data structure from Sect. 3.3 and store only the following information at important nodes v: Vector B v, marking the documents in D v, as defined in Sect Each B v is prepared for constant-time rank 1 -queries. Vector H v, which is the bit-vector H as defined in the preceding paragraph, but modified to exclude those documents from D v. Using H v we can compute for any node u a modified correction term c v (u) that excludes the documents containing the string represented by v. Vector H v is at most as long as H, hence using only O(n) bits. Finally, each node v in ST stores the pointer p v to its nearest important successor. When we need to answer a query of the form what is the approximate number of documents containing P + but not P? we first match P in ST until reaching node v. We then match P + in ST, say until reaching u. Assume that u corresponds to the suffix array interval [l, r]. Now observe that the number of 1 s in B pv [l, r] gives the number of suffixes below u that correspond to forbidden patterns, and that this number can be computed by f = rank 1 (B pv, r) rank 1 (B pv, l 1). So r l+1 f is the number of suffixes below u excluding those documents containing P. Hence, we return n match = r l+1 f c p v (u) as the approximate answer to n match, which satisfies n match n match n match + n by the definition of important nodes (with s = n). Theorem 2. For a text collection of total length n, there exists a data structure of size O(n 3/2 ) bits such that subsequent queries asking for the approximate number of documents containing P + but not P can be answered in O( P + + P ) time, with an additive error of at most n. 4 Conclusions We initiated the study of document retrieval in the presence of forbidden patterns. Apart from improving on the space consumption and/or query time of our data structures, we point out the following subjects deserving further investigation: (1) document retrieval with more than two patterns, both positive and negative, (2) lower bounds for forbidden patterns, possibly in the spirit of 9
10 Appendix A, but also in more realistic machine models such as the word-ram, and (3) algorithmic engineering and comparison with inverted indexes. References 1. B. Chazelle. Lower bounds for orthogonal range searching, I: The reporting case. J. ACM, 37: , Y.-F. Chien, W.-K. Hon, R. Shah, and J. S. Vitter. Geometric Burrows-Wheeler transform: Linking range searching and text indexing. In Proc. DCC, pages IEEE Press, H. Cohen and E. Porat. Fast set intersection and two-patterns matching. Theor. Comput. Sci., 411(40 42): , R. Cole, M. Farach-Colton, R. Hariharan, T. M. Przytycka, and M. Thorup. An O(n log n) algorithm for the maximum agreement subtree problem for binary trees. SIAM J. Comput., 30(5): , P. Ferragina, N. Koudas, S. Muthukrishnan, and D. Srivastava. Two-dimensional substring indexing. J. Comput. Syst. Sci., 66(4): , J. Fischer and V. Heun. Space efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput., 40(2): , T. Gagie, S. J. Puglisi, and A. Turpin. Range quantile queries: Another virtue of wavelet trees. In Proc. SPIRE, volume 5721 of LNCS, pages 1 6. Springer, W.-K. Hon, R. Shah, S. V. Thankachan, and J. S. Vitter. String retrieval for multipattern queries. In Proc. SPIRE, volume 6393 of LNCS, pages Springer, W. K. Hon, R. Shah, and J. S. Vitter. Space-efficient framework for top-k string retrieval problems. In Proc. FOCS, pages IEEE Computer Society, L. C. K. Hui. Color set size problem with application to string matching. In Proc. CPM, volume 644 of LNCS, pages Springer, M. Karpinski and Y. Nekrich. Top-k color queries for document retrieval. In Proc. SODA, pages ACM/SIAM, Y. Matias, S. Muthukrishnan, S. C. Şahinalp, and J. Ziv. Augmenting suffix trees, with applications. In Proc. ESA, volume 1461 of LNCS, pages Springer, J. I. Munro and V. Raman. Succinct representation of balanced parentheses and static trees. SIAM J. Comput., 31(3): , S. Muthukrishnan. Efficient algorithms for document retrieval problems. In Proc. SODA, pages ACM/SIAM, G. Navarro and Y. Nekrich. Top-k document retrieval in optimal time and linear space. In Proc. SODA. ACM/SIAM, to appear K. Sadakane. Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms, 5(1):12 22, N. Välimäki and V. Mäkinen. Space-efficient algorithms for document retrieval. In Proc. CPM, volume 4580 of LNCS, pages Springer, Appendix A A Lower Bound for Two Positive Patterns Consider the two-pattern matching problem in [3]: 10
11 Given: A collection of static documents D = {D 1,..., D d } over an alphabet Σ of total length n = i d D i. Compute: An index that, given two patterns P 1 + and P 2 + online, allows us to compute all n match documents containing P 1 + and P 2 +. We give a new lower bound result for the problem using the technique introduced in [2]. The reduction is from 4-dimensional range queries: Given set S of n points in R 4 each represented with 4h bits, preprocess S to find points {s s S ([x l, x r ] [y l, y r ] [z l, z r ] [t l, t r ])}, where x, y, z, and t denote the 4 coordinate ranges. Chazelle [1] showed that on a pointer machine, an index supporting d-dimensional range searching in O(polylog(n) + occ) query time requires Ω(n(log n/ log log n) d 1 ) words of storage. First, we can assume that points in S are from a [1, n] [1, n] [1, n] [1, n] grid, since sorting and mapping n points in R 4 to their ranks in each coordinate take O(nh ) space and O(n log n) time. Later, a range query can be cast to the corresponding one on the ranks in O(log n) time. From the i th point s = (x, y, z, t) S we create the document D i = y i # 1 x i # 2 ti # 3 z i, (1) where c denotes the h = Θ(log n)-bit binary representation of integer c, and c its reverse (and the # i s are again new characters). Consider a balanced binary tree on values 1, 2,..., n in its leaves in this order. Associating 0 with left-branches and 1 with right-branches, paths from the root to the leaves define prefix codes for the values. An interval [c, d] can be partitioned into O(log n) intervals such that each interval corresponds to a different subtree; denote by P (c, d) the set of O(log n) prefix codes defined by paths from the root to the roots of these O(log n) subtrees. We cast a given 4-dimensional range query [x l, x r ] [y l, y r ] [z l, z r ] [t l, t r ] into O(log 4 n) two-pattern queries P + 1 = y # 1 x and P + 2 = t # 3 z (2) for all (x, y, z, t) P (x l, x r ) P (y l, y r ) P (z l, z r ) P (t l, t r ). One can now see that these O(log 4 n) two-pattern queries on D = {D 1,..., D n } of total length N = Θ(n log n) bits, constructed using Eq. (1), solve the 4-dimensional range reporting query. Theorem 3. On a pointer machine, an index on a document collection of total length n supporting two-pattern matching in O( P P + 2 +n match +polylog n) time requires Ω(n(log n/ log log n) 3 ) bits in the worst case. Proof. The reduction showed the connection between an N bit collection and 4-dimensional range queries on Θ(N/ log N) points, so recasting the result with a collection of length n gives Ω(n/ log n(log(n/ log n)/ log log(n/ log n)) 4 1 ) 11
12 words, i.e. Ω(n(log n/ log log n) 3 ) bits. 12
arxiv: v1 [cs.ds] 15 Feb 2012
Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl
More informationLecture 18 April 26, 2012
6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and
More informationSmaller and Faster Lempel-Ziv Indices
Smaller and Faster Lempel-Ziv Indices Diego Arroyuelo and Gonzalo Navarro Dept. of Computer Science, Universidad de Chile, Chile. {darroyue,gnavarro}@dcc.uchile.cl Abstract. Given a text T[1..u] over an
More informationA Space-Efficient Frameworks for Top-k String Retrieval
A Space-Efficient Frameworks for Top-k String Retrieval Wing-Kai Hon, National Tsing Hua University Rahul Shah, Louisiana State University Sharma V. Thankachan, Louisiana State University Jeffrey Scott
More informationarxiv: v1 [cs.ds] 22 Nov 2012
Faster Compact Top-k Document Retrieval Roberto Konow and Gonzalo Navarro Department of Computer Science, University of Chile {rkonow,gnavarro}@dcc.uchile.cl arxiv:1211.5353v1 [cs.ds] 22 Nov 2012 Abstract:
More informationarxiv: v2 [cs.ds] 3 Oct 2017
Orthogonal Vectors Indexing Isaac Goldstein 1, Moshe Lewenstein 1, and Ely Porat 1 1 Bar-Ilan University, Ramat Gan, Israel {goldshi,moshe,porately}@cs.biu.ac.il arxiv:1710.00586v2 [cs.ds] 3 Oct 2017 Abstract
More informationA Faster Grammar-Based Self-Index
A Faster Grammar-Based Self-Index Travis Gagie 1 Pawe l Gawrychowski 2 Juha Kärkkäinen 3 Yakov Nekrich 4 Simon Puglisi 5 Aalto University Max-Planck-Institute für Informatik University of Helsinki University
More informationLRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations
LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations Jérémy Barbay 1, Johannes Fischer 2, and Gonzalo Navarro 1 1 Department of Computer Science, University of Chile, {jbarbay gnavarro}@dcc.uchile.cl
More informationTheoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts
Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with
More informationString Searching with Ranking Constraints and Uncertainty
Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 2015 String Searching with Ranking Constraints and Uncertainty Sudip Biswas Louisiana State University and Agricultural
More informationNew Lower and Upper Bounds for Representing Sequences
New Lower and Upper Bounds for Representing Sequences Djamal Belazzougui 1 and Gonzalo Navarro 2 1 LIAFA, Univ. Paris Diderot - Paris 7, France. dbelaz@liafa.jussieu.fr 2 Department of Computer Science,
More informationOptimal Color Range Reporting in One Dimension
Optimal Color Range Reporting in One Dimension Yakov Nekrich 1 and Jeffrey Scott Vitter 1 The University of Kansas. yakov.nekrich@googlemail.com, jsv@ku.edu Abstract. Color (or categorical) range reporting
More informationRank and Select Operations on Binary Strings (1974; Elias)
Rank and Select Operations on Binary Strings (1974; Elias) Naila Rahman, University of Leicester, www.cs.le.ac.uk/ nyr1 Rajeev Raman, University of Leicester, www.cs.le.ac.uk/ rraman entry editor: Paolo
More informationPractical Indexing of Repetitive Collections using Relative Lempel-Ziv
Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Gonzalo Navarro and Víctor Sepúlveda CeBiB Center for Biotechnology and Bioengineering, Chile Department of Computer Science, University
More informationarxiv: v1 [cs.ds] 19 Apr 2011
Fixed Block Compression Boosting in FM-Indexes Juha Kärkkäinen 1 and Simon J. Puglisi 2 1 Department of Computer Science, University of Helsinki, Finland juha.karkkainen@cs.helsinki.fi 2 Department of
More informationSuccinct Data Structures for the Range Minimum Query Problems
Succinct Data Structures for the Range Minimum Query Problems S. Srinivasa Rao Seoul National University Joint work with Gerth Brodal, Pooya Davoodi, Mordecai Golin, Roberto Grossi John Iacono, Danny Kryzanc,
More informationAlphabet-Independent Compressed Text Indexing
Alphabet-Independent Compressed Text Indexing DJAMAL BELAZZOUGUI Université Paris Diderot GONZALO NAVARRO University of Chile Self-indexes are able to represent a text within asymptotically the information-theoretic
More informationAlphabet Friendly FM Index
Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM
More informationCompact Indexes for Flexible Top-k Retrieval
Compact Indexes for Flexible Top-k Retrieval Simon Gog Matthias Petri Institute of Theoretical Informatics, Karlsruhe Institute of Technology Computing and Information Systems, The University of Melbourne
More informationCompressed Index for Dynamic Text
Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution
More informationarxiv: v1 [cs.ds] 25 Nov 2009
Alphabet Partitioning for Compressed Rank/Select with Applications Jérémy Barbay 1, Travis Gagie 1, Gonzalo Navarro 1 and Yakov Nekrich 2 1 Department of Computer Science University of Chile {jbarbay,
More informationSpace-Efficient Construction Algorithm for Circular Suffix Tree
Space-Efficient Construction Algorithm for Circular Suffix Tree Wing-Kai Hon, Tsung-Han Ku, Rahul Shah and Sharma Thankachan CPM2013 1 Outline Preliminaries and Motivation Circular Suffix Tree Our Indexes
More informationIndexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile
Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:
More informationCompressed Representations of Sequences and Full-Text Indexes
Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Università di Pisa GIOVANNI MANZINI Università del Piemonte Orientale VELI MÄKINEN University of Helsinki AND GONZALO NAVARRO
More informationDynamic Entropy-Compressed Sequences and Full-Text Indexes
Dynamic Entropy-Compressed Sequences and Full-Text Indexes VELI MÄKINEN University of Helsinki and GONZALO NAVARRO University of Chile First author funded by the Academy of Finland under grant 108219.
More informationOnline Sorted Range Reporting and Approximating the Mode
Online Sorted Range Reporting and Approximating the Mode Mark Greve Progress Report Department of Computer Science Aarhus University Denmark January 4, 2010 Supervisor: Gerth Stølting Brodal Online Sorted
More informationText Indexing: Lecture 6
Simon Gog gog@kit.edu - 0 Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Reviewing the last two lectures We have seen two top-k document retrieval frameworks. Question
More informationA Faster Grammar-Based Self-index
A Faster Grammar-Based Self-index Travis Gagie 1,Pawe l Gawrychowski 2,, Juha Kärkkäinen 3, Yakov Nekrich 4, and Simon J. Puglisi 5 1 Aalto University, Finland 2 University of Wroc law, Poland 3 University
More informationCell Probe Lower Bounds and Approximations for Range Mode
Cell Probe Lower Bounds and Approximations for Range Mode Mark Greve, Allan Grønlund Jørgensen, Kasper Dalgaard Larsen, and Jakob Truelsen MADALGO, Department of Computer Science, Aarhus University, Denmark.
More informationSuccinct Suffix Arrays based on Run-Length Encoding
Succinct Suffix Arrays based on Run-Length Encoding Veli Mäkinen Gonzalo Navarro Abstract A succinct full-text self-index is a data structure built on a text T = t 1 t 2...t n, which takes little space
More informationarxiv: v1 [cs.ds] 30 Nov 2018
Faster Attractor-Based Indexes Gonzalo Navarro 1,2 and Nicola Prezza 3 1 CeBiB Center for Biotechnology and Bioengineering 2 Dept. of Computer Science, University of Chile, Chile. gnavarro@dcc.uchile.cl
More informationFast Fully-Compressed Suffix Trees
Fast Fully-Compressed Suffix Trees Gonzalo Navarro Department of Computer Science University of Chile, Chile gnavarro@dcc.uchile.cl Luís M. S. Russo INESC-ID / Instituto Superior Técnico Technical University
More informationOptimal Dynamic Sequence Representations
Optimal Dynamic Sequence Representations Gonzalo Navarro Yakov Nekrich Abstract We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on
More informationSpace-Efficient Re-Pair Compression
Space-Efficient Re-Pair Compression Philip Bille, Inge Li Gørtz, and Nicola Prezza Technical University of Denmark, DTU Compute {phbi,inge,npre}@dtu.dk Abstract Re-Pair [5] is an effective grammar-based
More informationA Simple Alphabet-Independent FM-Index
A Simple Alphabet-Independent -Index Szymon Grabowski 1, Veli Mäkinen 2, Gonzalo Navarro 3, Alejandro Salinger 3 1 Computer Engineering Dept., Tech. Univ. of Lódź, Poland. e-mail: sgrabow@zly.kis.p.lodz.pl
More informationString Range Matching
String Range Matching Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi Department of Computer Science, University of Helsinki Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Given strings
More informationLecture 2 September 4, 2014
CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 2 September 4, 2014 Scribe: David Liu 1 Overview In the last lecture we introduced the word RAM model and covered veb trees to solve the
More informationCompressed Representations of Sequences and Full-Text Indexes
Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Dipartimento di Informatica, Università di Pisa, Italy GIOVANNI MANZINI Dipartimento di Informatica, Università del Piemonte
More informationImproved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts
Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille 1, Rolf Fagerberg 2, and Inge Li Gørtz 3 1 IT University of Copenhagen. Rued Langgaards
More informationBreaking a Time-and-Space Barrier in Constructing Full-Text Indices
Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Wing-Kai Hon Kunihiko Sadakane Wing-Kin Sung Abstract Suffix trees and suffix arrays are the most prominent full-text indices, and their
More informationOn Compressing and Indexing Repetitive Sequences
On Compressing and Indexing Repetitive Sequences Sebastian Kreft a,1,2, Gonzalo Navarro a,2 a Department of Computer Science, University of Chile Abstract We introduce LZ-End, a new member of the Lempel-Ziv
More informationString Indexing for Patterns with Wildcards
MASTER S THESIS String Indexing for Patterns with Wildcards Hjalte Wedel Vildhøj and Søren Vind Technical University of Denmark August 8, 2011 Abstract We consider the problem of indexing a string t of
More informationReducing the Space Requirement of LZ-Index
Reducing the Space Requirement of LZ-Index Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile {darroyue, gnavarro}@dcc.uchile.cl 2 Dept. of
More informationAdvanced Text Indexing Techniques. Johannes Fischer
Advanced ext Indexing echniques Johannes Fischer SS 2009 1 Suffix rees, -Arrays and -rays 1.1 Recommended Reading Dan Gusfield: Algorithms on Strings, rees, and Sequences. 1997. ambridge University Press,
More informationSkriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT)
Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT) Disclaimer Students attending my lectures are often astonished that I present the material in a much livelier form than in this script.
More informationarxiv: v2 [cs.ds] 5 Mar 2014
Order-preserving pattern matching with k mismatches Pawe l Gawrychowski 1 and Przemys law Uznański 2 1 Max-Planck-Institut für Informatik, Saarbrücken, Germany 2 LIF, CNRS and Aix-Marseille Université,
More informationAdvanced Data Structures
Simon Gog gog@kit.edu - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Predecessor data structures We want to support
More informationSmall-Space Dictionary Matching (Dissertation Proposal)
Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length
More informationUniversità degli studi di Udine
Università degli studi di Udine Computing LZ77 in Run-Compressed Space This is a pre print version of the following article: Original Computing LZ77 in Run-Compressed Space / Policriti, Alberto; Prezza,
More informationON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION
ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION PAOLO FERRAGINA, IGOR NITTO, AND ROSSANO VENTURINI Abstract. One of the most famous and investigated lossless data-compression schemes is the one introduced
More informationLower Bounds for Dynamic Connectivity (2004; Pǎtraşcu, Demaine)
Lower Bounds for Dynamic Connectivity (2004; Pǎtraşcu, Demaine) Mihai Pǎtraşcu, MIT, web.mit.edu/ mip/www/ Index terms: partial-sums problem, prefix sums, dynamic lower bounds Synonyms: dynamic trees 1
More informationAdvanced Data Structures
Simon Gog gog@kit.edu - Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Predecessor data structures We want to support the following operations on a set of integers from
More informationOn Pattern Matching With Swaps
On Pattern Matching With Swaps Fouad B. Chedid Dhofar University, Salalah, Oman Notre Dame University - Louaize, Lebanon P.O.Box: 2509, Postal Code 211 Salalah, Oman Tel: +968 23237200 Fax: +968 23237720
More informationarxiv: v1 [cs.ds] 9 Apr 2018
From Regular Expression Matching to Parsing Philip Bille Technical University of Denmark phbi@dtu.dk Inge Li Gørtz Technical University of Denmark inge@dtu.dk arxiv:1804.02906v1 [cs.ds] 9 Apr 2018 Abstract
More informationStronger Lempel-Ziv Based Compressed Text Indexing
Stronger Lempel-Ziv Based Compressed Text Indexing Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile, Blanco Encalada 2120, Santiago, Chile.
More informationQuiz 1 Solutions. (a) f 1 (n) = 8 n, f 2 (n) = , f 3 (n) = ( 3) lg n. f 2 (n), f 1 (n), f 3 (n) Solution: (b)
Introduction to Algorithms October 14, 2009 Massachusetts Institute of Technology 6.006 Spring 2009 Professors Srini Devadas and Constantinos (Costis) Daskalakis Quiz 1 Solutions Quiz 1 Solutions Problem
More informationBurrows-Wheeler Transforms in Linear Time and Linear Bits
Burrows-Wheeler Transforms in Linear Time and Linear Bits Russ Cox (following Hon, Sadakane, Sung, Farach, and others) 18.417 Final Project BWT in Linear Time and Linear Bits Three main parts to the result.
More informationarxiv: v2 [cs.ds] 8 Apr 2016
Optimal Dynamic Strings Paweł Gawrychowski 1, Adam Karczmarz 1, Tomasz Kociumaka 1, Jakub Łącki 2, and Piotr Sankowski 1 1 Institute of Informatics, University of Warsaw, Poland [gawry,a.karczmarz,kociumaka,sank]@mimuw.edu.pl
More informationarxiv: v3 [cs.ds] 6 Sep 2018
Universal Compressed Text Indexing 1 Gonzalo Navarro 2 arxiv:1803.09520v3 [cs.ds] 6 Sep 2018 Abstract Center for Biotechnology and Bioengineering (CeBiB), Department of Computer Science, University of
More informationAverage Case Analysis of QuickSort and Insertion Tree Height using Incompressibility
Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility Tao Jiang, Ming Li, Brendan Lucier September 26, 2005 Abstract In this paper we study the Kolmogorov Complexity of a
More informationFinding Consensus Strings With Small Length Difference Between Input and Solution Strings
Finding Consensus Strings With Small Length Difference Between Input and Solution Strings Markus L. Schmid Trier University, Fachbereich IV Abteilung Informatikwissenschaften, D-54286 Trier, Germany, MSchmid@uni-trier.de
More informationRotation and Lighting Invariant Template Matching
Rotation and Lighting Invariant Template Matching Kimmo Fredriksson 1, Veli Mäkinen 2, and Gonzalo Navarro 3 1 Department of Computer Science, University of Joensuu. kfredrik@cs.joensuu.fi 2 Department
More informationSorting suffixes of two-pattern strings
Sorting suffixes of two-pattern strings Frantisek Franek W. F. Smyth Algorithms Research Group Department of Computing & Software McMaster University Hamilton, Ontario Canada L8S 4L7 April 19, 2004 Abstract
More informationCovering Linear Orders with Posets
Covering Linear Orders with Posets Proceso L. Fernandez, Lenwood S. Heath, Naren Ramakrishnan, and John Paul C. Vergara Department of Information Systems and Computer Science, Ateneo de Manila University,
More informationCOMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES
COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES DO HUY HOANG (B.C.S. (Hons), NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE SCHOOL OF COMPUTING NATIONAL
More information1 Approximate Quantiles and Summaries
CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity
More informationIn-Memory Storage for Labeled Tree-Structured Data
In-Memory Storage for Labeled Tree-Structured Data by Gelin Zhou A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer
More informationarxiv: v2 [cs.ds] 6 Jul 2015
Online Self-Indexed Grammar Compression Yoshimasa Takabatake 1, Yasuo Tabei 2, and Hiroshi Sakamoto 1 1 Kyushu Institute of Technology {takabatake,hiroshi}@donald.ai.kyutech.ac.jp 2 PRESTO, Japan Science
More informationInternal Pattern Matching Queries in a Text and Applications
Internal Pattern Matching Queries in a Text and Applications Tomasz Kociumaka Jakub Radoszewski Wojciech Rytter Tomasz Waleń Abstract We consider several types of internal queries: questions about subwords
More informationA Fully Compressed Pattern Matching Algorithm for Simple Collage Systems
A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems Shunsuke Inenaga 1, Ayumi Shinohara 2,3 and Masayuki Takeda 2,3 1 Department of Computer Science, P.O. Box 26 (Teollisuuskatu 23)
More informationNotes on Logarithmic Lower Bounds in the Cell Probe Model
Notes on Logarithmic Lower Bounds in the Cell Probe Model Kevin Zatloukal November 10, 2010 1 Overview Paper is by Mihai Pâtraşcu and Erik Demaine. Both were at MIT at the time. (Mihai is now at AT&T Labs.)
More informationCompressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Roberto Grossi Dipartimento di Informatica Università di Pisa 56125 Pisa, Italy grossi@di.unipi.it Jeffrey
More informationSelf-Indexed Grammar-Based Compression
Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca
More informationOptimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs
Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs Krishnendu Chatterjee Rasmus Ibsen-Jensen Andreas Pavlogiannis IST Austria Abstract. We consider graphs with n nodes together
More informationAn O(N) Semi-Predictive Universal Encoder via the BWT
An O(N) Semi-Predictive Universal Encoder via the BWT Dror Baron and Yoram Bresler Abstract We provide an O(N) algorithm for a non-sequential semi-predictive encoder whose pointwise redundancy with respect
More informationDynamic Ordered Sets with Exponential Search Trees
Dynamic Ordered Sets with Exponential Search Trees Arne Andersson Computing Science Department Information Technology, Uppsala University Box 311, SE - 751 05 Uppsala, Sweden arnea@csd.uu.se http://www.csd.uu.se/
More informationPreview: Text Indexing
Simon Gog gog@ira.uka.de - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Text Indexing Motivation Problems Given a text
More informationA Linear Time Algorithm for Ordered Partition
A Linear Time Algorithm for Ordered Partition Yijie Han School of Computing and Engineering University of Missouri at Kansas City Kansas City, Missouri 64 hanyij@umkc.edu Abstract. We present a deterministic
More informationJumbled String Matching: Motivations, Variants, Algorithms
Jumbled String Matching: Motivations, Variants, Algorithms Zsuzsanna Lipták University of Verona (Italy) Workshop Combinatorial structures for sequence analysis in bioinformatics Milano-Bicocca, 27 Nov
More informationQuiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)
Introduction to Algorithms October 13, 2010 Massachusetts Institute of Technology 6.006 Fall 2010 Professors Konstantinos Daskalakis and Patrick Jaillet Quiz 1 Solutions Quiz 1 Solutions Problem 1. We
More informationAdvanced Implementations of Tables: Balanced Search Trees and Hashing
Advanced Implementations of Tables: Balanced Search Trees and Hashing Balanced Search Trees Binary search tree operations such as insert, delete, retrieve, etc. depend on the length of the path to the
More informationRandomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th
CSE 3500 Algorithms and Complexity Fall 2016 Lecture 10: September 29, 2016 Quick sort: Average Run Time In the last lecture we started analyzing the expected run time of quick sort. Let X = k 1, k 2,...,
More informationComplexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler
Complexity Theory Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität Wien 15 May, 2018 Reinhard
More informationOutline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.
Complexity Theory Complexity Theory Outline Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität
More informationOptimal spaced seeds for faster approximate string matching
Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching
More informationOptimal spaced seeds for faster approximate string matching
Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching
More informationApproximate String Matching with Lempel-Ziv Compressed Indexes
Approximate String Matching with Lempel-Ziv Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2 and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,
More informationData selection. Lower complexity bound for sorting
Data selection. Lower complexity bound for sorting Lecturer: Georgy Gimel farb COMPSCI 220 Algorithms and Data Structures 1 / 12 1 Data selection: Quickselect 2 Lower complexity bound for sorting 3 The
More informationSelf-Indexed Grammar-Based Compression
Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca
More informationSIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding
SIGNAL COMPRESSION Lecture 7 Variable to Fix Encoding 1. Tunstall codes 2. Petry codes 3. Generalized Tunstall codes for Markov sources (a presentation of the paper by I. Tabus, G. Korodi, J. Rissanen.
More informationE D I C T The internal extent formula for compacted tries
E D C T The internal extent formula for compacted tries Paolo Boldi Sebastiano Vigna Università degli Studi di Milano, taly Abstract t is well known [Knu97, pages 399 4] that in a binary tree the external
More informationProblem. Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26
Binary Search Introduction Problem Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26 Strategy 1: Random Search Randomly select a page until the page containing
More informationEECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have
EECS 229A Spring 2007 * * Solutions to Homework 3 1. Problem 4.11 on pg. 93 of the text. Stationary processes (a) By stationarity and the chain rule for entropy, we have H(X 0 ) + H(X n X 0 ) = H(X 0,
More informationSliding Windows with Limited Storage
Electronic Colloquium on Computational Complexity, Report No. 178 (2012) Sliding Windows with Limited Storage Paul Beame Computer Science and Engineering University of Washington Seattle, WA 98195-2350
More informationMotif Extraction from Weighted Sequences
Motif Extraction from Weighted Sequences C. Iliopoulos 1, K. Perdikuri 2,3, E. Theodoridis 2,3,, A. Tsakalidis 2,3 and K. Tsichlas 1 1 Department of Computer Science, King s College London, London WC2R
More informationA GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS *
A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS * 1 Jorma Tarhio and Esko Ukkonen Department of Computer Science, University of Helsinki Tukholmankatu 2, SF-00250 Helsinki,
More informationCSE 202 Homework 4 Matthias Springer, A
CSE 202 Homework 4 Matthias Springer, A99500782 1 Problem 2 Basic Idea PERFECT ASSEMBLY N P: a permutation P of s i S is a certificate that can be checked in polynomial time by ensuring that P = S, and
More informationSkriptum VL Text-Indexierung Sommersemester 2010 Johannes Fischer (KIT)
1 Recommended Reading Skriptum VL Text-Indexierung Sommersemester 2010 Johannes Fischer (KIT) D. Gusfield: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. M. Crochemore,
More informationSORTING SUFFIXES OF TWO-PATTERN STRINGS.
International Journal of Foundations of Computer Science c World Scientific Publishing Company SORTING SUFFIXES OF TWO-PATTERN STRINGS. FRANTISEK FRANEK and WILLIAM F. SMYTH Algorithms Research Group,
More informationText matching of strings in terms of straight line program by compressed aleshin type automata
Text matching of strings in terms of straight line program by compressed aleshin type automata 1 A.Jeyanthi, 2 B.Stalin 1 Faculty, 2 Assistant Professor 1 Department of Mathematics, 2 Department of Mechanical
More information