arxiv: v1 [cs.ds] 25 Nov 2009

Size: px
Start display at page:

Download "arxiv: v1 [cs.ds] 25 Nov 2009"

Transcription

1 Alphabet Partitioning for Compressed Rank/Select with Applications Jérémy Barbay 1, Travis Gagie 1, Gonzalo Navarro 1 and Yakov Nekrich 2 1 Department of Computer Science University of Chile {jbarbay, tgagie, gnavarro}@dcc.uchile.cl arxiv: v1 [cs.ds] 25 Nov Department of Computer Science University of Bonn yasha@cs.uni-bonn.de Abstract. We show show how, if we have a data structure that efficiently supports access, rank and select queries on strings in compressed form, and another that supports those queries efficiently on strings over large alphabets, we can combine their strengths via alphabet partitioning. Specifically, we present a data structure that stores a string s[1..n] over alphabet [1..σ] in nh 0(s)+o(n)(H 0(s)+1) bits, where H 0(s) is the zero-order entropy of s, and supports access and rank queries in time O (log log σ) and select queries in O (1) time. We also show how our data structure can be used to store strings and text self-indexes compressed in terms of high-order entropies, as well as manage compressed permutations, compressed functions, and a compressed dynamic collection of disjoint sets, while supporting a rich functionality on those. 1 Introduction There has been a surge of interest recently in string data structures supporting access, rank and select queries. Given a string s, the query s.access(i) returns the ith character of s, which we denote s[i]; the query s.rank a (i) returns the number of occurrences of the character a up to position i; and the query s.select a (i) returns the position of the ith a in s. Researchers study these operations not only because they are interesting in themselves but also because they can be used as primitives to implement many other operations (see, e.g., [11,3,9,12] for recent discussions). Two of the most active areas of research on such data structures are, first, increasing the sizes of the alphabets they can handle and, second, decreasing their space bounds. Table 1 summarizes recent results in these two areas. Throughout, we write n to denote the length of the string s, σ to denote the alphabet size, log to denote log 2, and H k (s) to denote the kth-order empirical entropy of s (see, e.g., [4] for a definition and discussion). ( For example, ) Golynski, Munro and Rao [6] presented a data structure that stores s in n log σ + O nlog σ log log σ bits and supports access and rank in O (log log σ) time and select in O (1) time (see row 3 of Table 1); Ferragina, Manzini, Mäkinen and ( Navarro [4] presented ) a data structure called a multiary wavelet tree that stores s in nh 0 (s) + O n log σ log log n log n bits and supports access, rank ( ) and select in O 1 + log σ log log n time. When σ = log O(1) n, their space is nh 0 (s) + o(n) bits and their times are O (1) (see row 2 and the caption of Table 1). Comparing these results, we see that with Golynski et al. s data structure, we can handle large alphabets but we do not compress in terms of the usual entropy measures; with a multiary wavelet tree, the situation is reversed. The second and third authors were funded in part by the Millennium Institute for Cell Dynamics and Biotechnology (ICDB), Grant ICM P F, Mideplan, Chile.

2 Table 1. Recent and our new bounds for data structures supporting access, rank and select. The space bound in the first row holds when k = o(log σ n). In the second row the space formula holds for σ = o(n), and it becomes nh 0 (s) + o(n) when σ = log O(1) n. In the times for our results, σ can be changed to min(σ,n/occ (a,s)), where a is the character involved. space (bits) access rank select [1]+[5] nh k (s) + o(n) log σ + n o(log σ) O (1) O ( log log σ(log log log σ) 2) O (log log σ log log log σ) [4] nh 0(s) + o(n) log σ O ( ) 1 + log σ O ( ) 1 + log σ O ( ) 1 + log σ log log n log log n log log n [6] nlog σ + n o(log σ) O (log log σ) O (log log σ) O (1) [6] nlog σ + n o(log σ) O (1) O (log log σ log log log σ) O (log log σ) Ours nh 0(s) + o(n)(h 0(s) + 1) O (log log σ) O (log log σ) O (1) Ours nh 0(s) + o(n)(h 0(s) + 1) O (1) O (log log σ log log log σ) O (log log σ) In Section 2 we show how to combine the strengths of such a pair of data structures. Our idea is to partition the alphabet into sub-alphabets according to the characters frequencies in s, then use a multiary wavelet tree to store the string that results from replacing the characters in s by identifiers of their sub-alphabets, and store separate strings with the characters of s belonging to each subalphabet, this time using the structure for large alphabets. We achieve a data structure that stores a string s[1..n] in nh 0 (s)+o(n)(h 0 (s)+1) bits and supports queries in the times shown in Table 1 (rows 5 and 6 give two alternatives). In Section 3 we show how to achieve compression in terms of the kth-order empirical entropy H k (s) of s, by giving up or slowing down rank and select queries (this way giving a simpler construction to achieve row 1 in Table 1). We also show how our result can be used to improve an existing text index that achieves k-th order entropy [4]. In Sections 4, 5 and 6, respectively, we show how to apply our data structure to store a compressed permutation, a compressed function and a compressed dynamic collection of disjoint sets, while supporting a rich set of operations on those. This improves or gives alternatives to the best previous results [2,10,8]. 2 Alphabet partitioning Suppose s is a sequence over effective alphabet [σ], that is, every character appears in s, thus σ n. 1 The zero-order entropy of s is H 0 (s) = occ(a,s) n a [σ] n log occ(a,s), being occ(a,s) being the number of occurrences of character a in s. Note by convexity we have nh 0 (s) (σ 1)log n+(n σ + 1)log(n/(n σ + 1)), a property we will use later. Our results are based on the following alphabet partitioning scheme. Let t be the string in which each character s[i] is replaced by the number t[i] = log(n/occ (s[i],s))log n log 2 n. For 0 l log 2 n, let s l be the string consisting of those characters s[i] that were replaced by l in t, and let σ l be the number of distinct characters in s l. More precisely, s l will be a sequence over 1 The case of a general alphabet Σ can be handled by standard means, using a compressed bitvector [13] to map it to the effective range [1, σ]. This adds constant time to the operations and O ( ) σ log max Σ σ + o(max Σ) bits of space. 2

3 alphabet [1,σ l ]. The mapping with [1,σ] will be done by storing another sequence l[1,σ] so that l[a] will be the sub-alphabet number assigned to character a. Hence character a is mapped to character l.rank l[a] (a) in string s l[a], and symbol x [1,σ l ] in s l corresponds to character l.select l (x). Notice that, if both a and b are replaced by the same number in t, then log(n/occ (b,s)) log(n/occ (a,s)) < 1/log n and so occ (a,s) /occ (b,s) < 2 1/ log n. It follows that, if a is replaced by l in t, then σ l < 2 1/ log n s l /occ (a,s) (by fixing a and adding over all those b replaced by l). Since occ (a,s) = s l and occ (a,s) = s l = n, a l we have log(n/occ(a,s)) log n =l nh 0 (t) + l s l log σ l < l s l log(n/ s l ) + l log(n/occ(a,s)) log n =l ( ) occ(a,s)log 2 1/log n s l /occ (a,s) = a occ (a,s) log(n/occ (a,s)) + n/log n = nh 0 (s) + o(n). In other words, if we represent t with H 0 (t) bits per symbol and each s l with log σ l bits per symbol, then we achieve a good overall compression. Thus we can obtain a very compact representation of a string s by storing a compact representation of t and storing each s l as an uncompressed string over an alphabet of size σ l. Now we show how our approach can be used to obtain an fast and compact rank/select data structure. Suppose we have a data structure T that supports access, rank and select queries on t; another structure L that supports the same queries on l; and data structures S 1,...,S log 2 n that support the same queries on s 1,...,s log 2 n. With these data structures we can implement s.access(i) = l.select l (s l.access(t.rank l (i))), where l = t.access(i); s.rank a (i) = s l.rank c (t.rank l (i)), where l = l.access(a) and c = l.rank l (a); s.select a (i) = t.select l (s l.select c (i)) where l = l.access(a) and c = l.rank l (a). Assume we implement T and L as multiary wavelet trees [4], and implement each S l as either a multiary wavelet tree or an instance of Golynski et al. s [6] first access/rank/select data structure, depending on whether σ l log n or not. The wavelet tree for T uses at most nh 0 (t) + o(n) bits and operates in constant time, because its alphabet size is polylogarithmic. If S l is implemented as a wavelet tree, it uses at most s l H 0 (s l ) + o( s l ) ( bits and) operates in constant ( time for ) the same reason; otherwise it uses at most s l log σ l + O sl log σ l log log σ l s l log σ l + O sl log σ l log log log n bits (the latter because σ l > log n). Thus in all cases the space for s l is bounded by s l log σ l + o( s l log σ l ) bits. 2 Finally, since L is a sequence of length σ over an alphabet of size log 2 n, the wavelet tree for L takes at most O (σ log log n) bits. Because of the property we reminded in the beginning of 2 We note that this o( ) expression is in all cases asymptotic in n: in the case of multiary wavelet trees, it is achieved by using block sizes of length log n and not log s l, at the price of storing universal tables of size O ( 2 2 nlog O(1) n ). 3

4 this section, nh 0 (s) (σ 1)log n, this space is also o(nh 0 (s)). By these calculations, the space for T, L and the S l s adds up to nh 0 (s) + o(nh 0 (s)) + o(n). Depending on which time tradeoff we use for Golynski et al. s data structure, we obtain the results of Table 1. We can refine the time complexity by noticing that the only non-constant times are due to operating on some string s l, where the alphabet is of size σ l < 2 1/ log n s l /occ (a,s), where a is the character in question. Thus the actual times are, for example, O (log log σ l ) = O (log log min(σ,n/occ (a,s))). Theorem 1. We can store s[1,n] over effective alphabet [1,σ] in nh 0 (s) + o(n)(h 0 (s) + 1) bits and support access, rank and select queries in O (log log σ), O (log log σ), and O (1) time, respectively (variant (i)). Alternatively, we can support access, rank and select queries in O (1), O (log log σ log log log σ) and O (log log σ) time, respectively (variant (ii)). All the σ terms in the time complexities are actually min(σ,n/occ (a,s)), where a = s[i] for access, and a is the character argument for rank and select. If [σ] is not the effective alphabet, our structure handles it gracefully with the same structure L, using O (σ log log n) extra bits. 3 Higher-order entropies If we are willing to give up support of rank and select queries, then we can compress s well in terms of nh k (s). We note this result is not new: it was first proven by Sadakane and Grossi [14], simplified by González and Navarro [7] and then further simplified by Ferragina and Venturini [5]. The next theorem can be seen as yet a further simplification. Theorem 2. We can store s in nh k (s) + o(n)log σ bits for all k = o(log σ n) and perform access queries in O (1) time (more than that, retrieving any O (log σ n) contiguous symbols within that time). 3 Proof. We divide s into blocks of length b = log σ n 2 and assign new characters to the distinct blocks. There are at most σ b = n 1/2 ( new characters ) and ( we build ) a table that maps them to their blocks in O (1) time, which takes O n 1/2 blog σ = O n 1/2 log n bits. We build a new string s by replacing each block of s by its assigned new character and store s using Theorem 1(ii). Ferragina and Venturini [5] showed s H 0 (s ) nh k (s) + O ((n/b)k log σ) for k b, so for k = o(log σ n) we use a total of at most nh k (s) + o(n)log σ bits. To compute s[i], we find s [ i/b ] and report the (i mod b)th character of the associated block. This can be easily extended to output any O (log σ n) contiguous symbols in constant time. Barbay, He, Munro and Rao [1] showed how, given a data structure that supports only access queries, it is possible to build a succinct index supporting rank and select. For example, building a succinct index on top of Ferragina and Venturini s data structure whose size is bounded in terms of H k (s) gives them the bounds shown in the first row of Table 1. The same applies to our simplified result, of course. 3 Note that in this case we do not need σ to be the effective alphabet: In order to achieve any k > 0, we need (at the very least) that σ = o(n), and thus the extra space for our structure L is O (σ log log n) = o(σ log n) = o(n log σ), already considered in the space formula. 4

5 Alternatively, we can achieve k-th order entropy by partitioning the Burrows-Wheeler transform of the sequence and encoding each partition to zero-order entropy [4]. This representation, however, is more useful for self-indexing than for supporting access, rank, and select on the sequence. Selfindexes also represent a sequence, but they support other operations related to text searching. By using Theorem 1(i) to represent the partitions of Ferragina et al. [4], on which we need to carry out access and rank, we achieve the following result, improving previous ones [4,6]. Theorem 3. Let t[1,n] be a text string over alphabet [1,σ], σ = o(n). 4 Then we can represent s using nh k (s)+o(n)log σ bits of space, for any k α log σ n and constant 0 < α < 1. The following operations can be carried out: (i) count the number of occurrences of a pattern p[1,m] in t, in time O (m log log σ); (ii) locate any such occurrence in time O (log σ n log log n log log σ); (iii) extract t[l,r] in time O ((r l + log σ n log log n)log log σ). For this particular locating time we are sampling one out of log σ n log log n text positions. 4 Compressing permutations We now show how to use access/rank/select data structures to store a compressed permutation. We follow Barbay and Navarro s notation [2] and improve their space and, especially, time performance. They measure the compressibility of a permutation π in terms of the entropy of the distribution of the lengths of runs of different kinds. Let π be covered by ρ runs (of some sort) of lengths runs(π) = n 1,...,n ρ. Then H(runs(π)) = n i n log n n i log ρ is called the entropy of the runs (and, because n i 1, it also holds nh(runs(π)) (ρ 1)log n). We first give a result for the most general class of runs considered in there [2] (i.e., interleaved sequences of increasing or decreasing values) and then give specialized results for less general runs. We start by considering the application of the permutation π(i) and its inverse, π (i), and focus later on iterated applications of these. Theorem 4. Suppose π is a permutation on n elements that consists of ρ interleaved increasing or decreasing runs. We can store π in 2nH(runs(π))+o(n)(H(runs(π))+1) bits and perform π and π queries in O (log log ρ) time. Proof. We first replace all the elements of the rth run by r, for 1 r ρ. Let s be the resulting string and let s be s permuted according to π. We store s and s using Theorem 1(i) and store ρ bits indicating whether each run is increasing or decreasing. Notice s [π(i)] = s[i] and, if π(i) is part of an increasing run, then s.rank s[i] (π(i)) = s.rank s[i] (i), so ) π(i) = s.select s[i] (s.rank s[i] (i) ; if π(i) is part of a decreasing run, then s.rank s[i] (π(i)) = s.rank s[i] (n) + 1 s.rank s[i] (i), so ) π(i) = s.select s[i] (s.rank s[i] (n) + 1 s.rank s[i] (i). A π query is symmetric. The space of the bitmap is ρ = o(nh(runs(π))) because nh(runs(π)) (ρ 1)log n. 4 Again, [σ] does not need to be the effective alphabet. 5

6 We now consider the case of runs restricted to be strictly incrementing (+1) or decrementing ( 1), while still letting them be interleaved. Theorem 5. Suppose π is a permutation on n elements that consists of ρ interleaved incrementing or decrementing runs. For any constant ǫ > 0, we can store π in nh(runs(π)) + o(n)(h(runs(π)) + 1) + O (ρn ǫ ) bits and perform π queries in O (log log ρ) time and π queries in O (1/ǫ) time. Proof. We first replace all the elements of the rth run by r, for 1 r ρ, considering the runs in order by minimum element. Let s {1,...,ρ} n be the resulting string. We store s using Theorem 1(i); we also store an array containing the runs lengths, directions (incrementing or decrementing), and minima, in order by minimum element; and we store a predecessor data structure containing the runs minima as keys with their positions in the array as auxiliary information. The predecessor data structure, which we will describe in the full version of this paper, takes O (ρn ǫ ) bits and answers queries in O (1/ǫ) time. With the array and the predecessor data structure, we can retrieve a run s data given either its array index or any of its elements. If π(i) is the jth element in an incrementing run whose minimum element is m, then π(i) = m + j 1; on the other hand, if π(i) is the jth element of a decrementing run of length l whose minimum element is m, them π(i) = m+l j. It follows that, given i, we can compute π(i) by using the query j = s.rank s[i] (i) and then an array lookup at position s[i] to find m, l and the direction, finally computing π(i) from them. Also, given π(i), we can compute i by first using a predecessor query to find the run s array position r, then an array lookup to find m, l and the direction, then computing j = π(i) m + 1 (increasing) or j = m + l π(i) (decreasing), and finally using the query i = s.select r (j). Notice that, if π consists of ρ contiguous increasing or decreasing runs, then π consists of ρ interleaved incrementing or decrementing runs. Therefore, Theorem 5 applies to such permutations as well, with the time bounds for π and π queries reversed. Theorem 6. Suppose π is a permutation on n elements that consists of ρ contiguous increasing or decreasing runs. For any constant ǫ > 0, we can store π in nh(runs(π)) + o(n)(h(runs(π)) + 1) + O (ρn ǫ ) bits and perform π queries in O (1/ǫ) time and π queries in O (log log ρ) time. If π s runs are both contiguous and incrementing or decrementing so π s runs are, also then we can store π in O (ρn ǫ ) bits and answer π and π queries in O (1) time. To do this, we use two predecessor data structures: for each run, in one of the data structures we store the position j in π of the first element of the run, with π(j) as auxiliary information; in the other, we store π(j), with j as auxiliary information. To perform a query π(i), we use the first predecessor data structure to find the starting position j of the run containing i, and return π(j)+i j. A π query is symmetric. Decreasing runs are handled as before. Theorem 7. Suppose π is a permutation on n elements that consists of ρ contiguous incrementing or decrementing runs. For any constant ǫ > 0, we can store π in O (ρn ǫ ) bits and perform π and π queries in O (1/ǫ) time. We now prove a novel result achieving exponentiation (π k (i), π k (i)) within compressed space. The previous result by Munro, Raman, Raman and Rao [10] translated the problem to π and π queries on a different permutation. If one applies directly our results to the permutation they store, the runs are not those of the original π. The following construction, instead, retains the compressibility properties of π. 6

7 Theorem 8. Suppose we have a data structure D that stores a permutation π on n elements and supports queries π(i) in time g(π). Then for any t n, we can build a companion data structure that takes O ((n/t)log n) bits and, when used in conjunction with D, supports π k and π k queries in O (t g(π)) time. Proof. We decompose π into its cycles and, for every cycle of length at least t, store the cycle s length and an array containing pointers to every tth element in the cycle, which we call marked. We also store a compressed binary string, aligned to π, indicating the marked elements. For each marked element, we record to which cycle it belongs and its position in the array of that cycle. To compute π k (i), we repeatedly apply π at most t times until we either loop (in which case we need apply π at most t more times to find π k (i) in the loop) or we find a marked element. Once we have reached a marked element, we use its array position and cycle length to find the pointer to the last marked element in the cycle before π k (i), and the number of applications of π needed to map that to π k (i) (at most t). A π k query is similar (note it does not need to use π ). 5 Compressing functions Hreinsson, Krøyer and Pagh [8] recently showed how to store a compressed function f : [n] [σ] with constant-time access. To do this, they store a prefix-free encoding of the string f = f(1),...,f(n) of function values in such a way that, given i, they can find and decode the ith codeword in constant time. Their representation occupies at most (1 + δ)nh 0 (f) + n min(p max ,1.82(1 p max ))+o(σ) bits, where δ > 0 is a given constant and p max is the relative frequency of the most common function value. Notice that, by regarding f as a string, we can achieve constanttime access and a better space bounds using either Theorems 1 or 2. With Theorem 1 we can also find all the elements in [n] that f maps to a given element in [σ] (using select), find an element s rank among the elements with the same image, or the size of the preimage (using rank), etc. Theorem 9. Let f : [n] [σ] be a surjective function. 5 We can represent f using nh 0 (f) + o(n)(h 0 (f) + 1) bits so that any f(i) can be computed in O (1) time. Moreover, each element of f 1 (a) can be computed in O (log log σ) time, and f 1 (a) requires time O (log log σ log log log σ). Alternatively we can compute f(i) and f 1 (a) in time O (log log σ) and deliver any element of f 1 (a) in O (1) time. We can also achieve interesting results with our theorems from Section 4, as runs arise naturally in many real-life functions. For example, suppose we decompose f(1),...,f(n) into ρ interleaved non-increasing or non-decreasing runs. Then we can store it as a combination of the permutation π that stably sorts the values f(i), plus a compressed rank/select data structure storing a binary string b[1,n + σ + 1] with σ + 1 bits set to 1: if f maps i values in [n] to a value j in [σ] then, in b, there are i bits set to 0 between the jth and (j + 1)th bits set to 1. Therefore, f(i) = b.rank 1 (b.select 0 (π(i))) and the theorem below follows immediately from Theorem 4. Similarly, f 1 (a) is obtained by applying π to the area b.rank 0 (b.select 1 (a))+1... b.rank 0 (b.select 1 (a+1)), and f 1 (a) is computed in O (1) time. Notice H(runs(π)) = H(runs(f)) H 0 (f), and that b can be stored in O ( σ log n σ) + o(n) bits [13]. 5 So that [σ] is the effective alphabet size of string f. General functions with image of size σ < σ require O (σ log(σ/σ )) + o(σ) extra bits, or we can handle them using O (σ log log n) bits with our structure L. 7

8 Theorem 10. Suppose f : [n] [σ] is a surjective function 6 with f(1),...,f(n) consisting of ρ interleaved non-increasing or non-decreasing runs. Then we can store f in 2nH(runs(f)) + o(n)(h(runs(f)) + 1) + O ( σ log n σ) bits and compute any f(i), as well as retrieve any element in f 1 (a), in O (log log ρ) time. The size f 1 (a) can be computed in O (1) time. We can obtain a more competitive result if f is split into contiguous runs, but their entropy is not anymore bounded by the zero-order entropy of string f. Theorem 11. Suppose f : [n] [σ] is a surjective function with f(1),...,f(n) consisting of ρ contiguous non-increasing or non-decreasing runs. Then we can represent f in nh(runs(f)) + o(n)(h(runs(f)) + 1) + O (ρn ǫ ) + O ( σ log n σ) bits, for any constant ǫ > 0, and compute any f(i) in O (log log σ) time, as well as retrieve any element in f 1 (a) in O (1/ǫ) time. The size f 1 (a) can be computed in O (1) time. 6 Compressing dynamic disjoint sets Finally, we now prove what is, to the best of our knowledge, the first result about storing a compressed collection of disjoint sets. The key point in the next theorem is that, as the sets in the collection C are unioned, our space bound shrinks with the entropy of the distribution sets(c) of elements to sets. Theorem 12. Suppose C is a collection of disjoint sets whose union is {1,...,n}. For any constant ǫ > 0, we can store C in (1+ǫ)nH(sets(C))+O ( C log n)+o(n)(h(sets(c)+1)) bits and perform any sequence of m union and find operations in a total of O ((1/ǫ)(m + n) log log n) time. Proof. We first choose an arbitrary order for the sets and use Theorem 1(ii) to store the string s[1..n] in which each s[i] is the identifier of the set containing i. We then choose an arbitrary representative for each set and store the representatives in both an array and a standard disjoint-set data structure [15]. Together, our data structures take nh(sets(c)) +O ( C log n) +o(n)(h(sets(c) +1)) bits. We can perform a query find(i) on C by performing find(s[i]) on the representatives, and perform a union(i,j) operation on C by just performing the corresponding operation on the representatives. In order for our data structure to shrink as we union the sets, we keep track of H(sets(C)) and, whenever it shrinks by a factor of 1 + ǫ, we rebuild our entire data structure; we will show in the full version of this paper that this takes O (n) time. Since H(sets(C)) is always less than log n, we need to rebuild only O ( log 1+ǫ log n ) = O ((1/ǫ)log log n) times. Notice we can stop rebuilding once nh(sets(c)) = o(n) and the space bound becomes dominated by the o(n)(h(sets(c)) + 1) term. References 1. J. Barbay, M. He, J. I. Munro, and S. S. Rao. Succinct indexes for strings, binary relations and multi-labeled trees. In Proceedings of the 18th Symposium on Discrete Algorithms, pages , J. Barbay and G. Navarro. Compressed representations of permutations, and applications. In Proceedings of the 26th Symposium on Theoretical Aspects of Computer Science, pages , F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proceedings of the 15th Symposium on String Processing and Information Retrieval, pages , Otherwise we proceed as usual to map the domain to the effective one. 8

9 4. P. Ferragina, G. Manzini, V. Mäkinen, and G. Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3(2), P. Ferragina and R. Venturini. A simple storage scheme for strings achieving entropy bounds. Theoretical Computer Science, 372(1): , A. Golynski, J. I. Munro, and S. S. Rao. Rank/select operations on large alphabets: a tool for text indexing. In Proceedings of the 17th Symposium on Discrete Algorithms, pages , R. González and G. Navarro. Statistical encoding of succinct data structures. In Proceedings of the 17th Symposium on Combinatorial Pattern Matching, pages , J. B. Hreinsson, M. Krøyer, and R. Pagh. Storing a compressed function with constant time access. In Proceedings of the 17th European Symposium on Algorithms, pages , V. Mäkinen and G. Navarro. Rank and select revisited and extended. Theoretical Computer Science, 387(3): , J. I. Munro, R. Raman, V. Raman, and S. S. Rao. Succinct representations of permuations. In Proceedings of the 30th International Colloquium on Algorithms, Languages and Programming, pages , G. Navarro and V. Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1):article 2, N. Rahman and R. Raman. Rank and select operations on binary strings. In M.-Y. Kao, editor, Encyclopedia of Algorithms. Springer, R. Raman, V. Raman, and S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages , K. Sadakane and R. Grossi. Squeezing succinct data structures into entropy bounds. In Proceedings of the 17th Symposium on Discrete Algorithms, pages , R. E. Tarjan and J. van Leeuwen. Worst-case analysis of set union algorithms. Journal of the ACM, 31(2): ,

New Lower and Upper Bounds for Representing Sequences

New Lower and Upper Bounds for Representing Sequences New Lower and Upper Bounds for Representing Sequences Djamal Belazzougui 1 and Gonzalo Navarro 2 1 LIAFA, Univ. Paris Diderot - Paris 7, France. dbelaz@liafa.jussieu.fr 2 Department of Computer Science,

More information

Efficient Fully-Compressed Sequence Representations

Efficient Fully-Compressed Sequence Representations Algorithmica (2014) 69:232 268 DOI 10.1007/s00453-012-9726-3 Efficient Fully-Compressed Sequence Representations Jérémy Barbay Francisco Claude Travis Gagie Gonzalo Navarro Yakov Nekrich Received: 4 February

More information

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with

More information

arxiv: v1 [cs.ds] 15 Feb 2012

arxiv: v1 [cs.ds] 15 Feb 2012 Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl

More information

Rank and Select Operations on Binary Strings (1974; Elias)

Rank and Select Operations on Binary Strings (1974; Elias) Rank and Select Operations on Binary Strings (1974; Elias) Naila Rahman, University of Leicester, www.cs.le.ac.uk/ nyr1 Rajeev Raman, University of Leicester, www.cs.le.ac.uk/ rraman entry editor: Paolo

More information

Optimal Dynamic Sequence Representations

Optimal Dynamic Sequence Representations Optimal Dynamic Sequence Representations Gonzalo Navarro Yakov Nekrich Abstract We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on

More information

Lecture 18 April 26, 2012

Lecture 18 April 26, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and

More information

arxiv: v1 [cs.ds] 19 Apr 2011

arxiv: v1 [cs.ds] 19 Apr 2011 Fixed Block Compression Boosting in FM-Indexes Juha Kärkkäinen 1 and Simon J. Puglisi 2 1 Department of Computer Science, University of Helsinki, Finland juha.karkkainen@cs.helsinki.fi 2 Department of

More information

Alphabet Friendly FM Index

Alphabet Friendly FM Index Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM

More information

Compressed Representations of Sequences and Full-Text Indexes

Compressed Representations of Sequences and Full-Text Indexes Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Università di Pisa GIOVANNI MANZINI Università del Piemonte Orientale VELI MÄKINEN University of Helsinki AND GONZALO NAVARRO

More information

Smaller and Faster Lempel-Ziv Indices

Smaller and Faster Lempel-Ziv Indices Smaller and Faster Lempel-Ziv Indices Diego Arroyuelo and Gonzalo Navarro Dept. of Computer Science, Universidad de Chile, Chile. {darroyue,gnavarro}@dcc.uchile.cl Abstract. Given a text T[1..u] over an

More information

Alphabet-Independent Compressed Text Indexing

Alphabet-Independent Compressed Text Indexing Alphabet-Independent Compressed Text Indexing DJAMAL BELAZZOUGUI Université Paris Diderot GONZALO NAVARRO University of Chile Self-indexes are able to represent a text within asymptotically the information-theoretic

More information

Self-Indexed Grammar-Based Compression

Self-Indexed Grammar-Based Compression Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca

More information

Dynamic Entropy-Compressed Sequences and Full-Text Indexes

Dynamic Entropy-Compressed Sequences and Full-Text Indexes Dynamic Entropy-Compressed Sequences and Full-Text Indexes VELI MÄKINEN University of Helsinki and GONZALO NAVARRO University of Chile First author funded by the Academy of Finland under grant 108219.

More information

Succinct Suffix Arrays based on Run-Length Encoding

Succinct Suffix Arrays based on Run-Length Encoding Succinct Suffix Arrays based on Run-Length Encoding Veli Mäkinen Gonzalo Navarro Abstract A succinct full-text self-index is a data structure built on a text T = t 1 t 2...t n, which takes little space

More information

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:

More information

Self-Indexed Grammar-Based Compression

Self-Indexed Grammar-Based Compression Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca

More information

Compressed Representations of Sequences and Full-Text Indexes

Compressed Representations of Sequences and Full-Text Indexes Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Dipartimento di Informatica, Università di Pisa, Italy GIOVANNI MANZINI Dipartimento di Informatica, Università del Piemonte

More information

A Simple Alphabet-Independent FM-Index

A Simple Alphabet-Independent FM-Index A Simple Alphabet-Independent -Index Szymon Grabowski 1, Veli Mäkinen 2, Gonzalo Navarro 3, Alejandro Salinger 3 1 Computer Engineering Dept., Tech. Univ. of Lódź, Poland. e-mail: sgrabow@zly.kis.p.lodz.pl

More information

Preview: Text Indexing

Preview: Text Indexing Simon Gog gog@ira.uka.de - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Text Indexing Motivation Problems Given a text

More information

Optimal lower bounds for rank and select indexes

Optimal lower bounds for rank and select indexes Optimal lower bounds for rank and select indexes Alexander Golynski David R. Cheriton School of Computer Science, University of Waterloo agolynski@cs.uwaterloo.ca Technical report CS-2006-03, Version:

More information

LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations

LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations Jérémy Barbay 1, Johannes Fischer 2, and Gonzalo Navarro 1 1 Department of Computer Science, University of Chile, {jbarbay gnavarro}@dcc.uchile.cl

More information

arxiv: v1 [cs.ds] 22 Nov 2012

arxiv: v1 [cs.ds] 22 Nov 2012 Faster Compact Top-k Document Retrieval Roberto Konow and Gonzalo Navarro Department of Computer Science, University of Chile {rkonow,gnavarro}@dcc.uchile.cl arxiv:1211.5353v1 [cs.ds] 22 Nov 2012 Abstract:

More information

Compressed Index for Dynamic Text

Compressed Index for Dynamic Text Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution

More information

A Faster Grammar-Based Self-Index

A Faster Grammar-Based Self-Index A Faster Grammar-Based Self-Index Travis Gagie 1 Pawe l Gawrychowski 2 Juha Kärkkäinen 3 Yakov Nekrich 4 Simon Puglisi 5 Aalto University Max-Planck-Institute für Informatik University of Helsinki University

More information

A Space-Efficient Frameworks for Top-k String Retrieval

A Space-Efficient Frameworks for Top-k String Retrieval A Space-Efficient Frameworks for Top-k String Retrieval Wing-Kai Hon, National Tsing Hua University Rahul Shah, Louisiana State University Sharma V. Thankachan, Louisiana State University Jeffrey Scott

More information

Compact Data Strutures

Compact Data Strutures (To compress is to Conquer) Compact Data Strutures Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017 Agenda

More information

Reducing the Space Requirement of LZ-Index

Reducing the Space Requirement of LZ-Index Reducing the Space Requirement of LZ-Index Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile {darroyue, gnavarro}@dcc.uchile.cl 2 Dept. of

More information

Efficient Accessing and Searching in a Sequence of Numbers

Efficient Accessing and Searching in a Sequence of Numbers Regular Paper Journal of Computing Science and Engineering, Vol. 9, No. 1, March 2015, pp. 1-8 Efficient Accessing and Searching in a Sequence of Numbers Jungjoo Seo and Myoungji Han Department of Computer

More information

Simple Compression Code Supporting Random Access and Fast String Matching

Simple Compression Code Supporting Random Access and Fast String Matching Simple Compression Code Supporting Random Access and Fast String Matching Kimmo Fredriksson and Fedor Nikitin Department of Computer Science and Statistics, University of Joensuu PO Box 111, FIN 80101

More information

Practical Indexing of Repetitive Collections using Relative Lempel-Ziv

Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Gonzalo Navarro and Víctor Sepúlveda CeBiB Center for Biotechnology and Bioengineering, Chile Department of Computer Science, University

More information

String Range Matching

String Range Matching String Range Matching Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi Department of Computer Science, University of Helsinki Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Given strings

More information

Forbidden Patterns. {vmakinen leena.salmela

Forbidden Patterns. {vmakinen leena.salmela Forbidden Patterns Johannes Fischer 1,, Travis Gagie 2,, Tsvi Kopelowitz 3, Moshe Lewenstein 4, Veli Mäkinen 5,, Leena Salmela 5,, and Niko Välimäki 5, 1 KIT, Karlsruhe, Germany, johannes.fischer@kit.edu

More information

Stronger Lempel-Ziv Based Compressed Text Indexing

Stronger Lempel-Ziv Based Compressed Text Indexing Stronger Lempel-Ziv Based Compressed Text Indexing Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile, Blanco Encalada 2120, Santiago, Chile.

More information

New Algorithms and Lower Bounds for Sequential-Access Data Compression

New Algorithms and Lower Bounds for Sequential-Access Data Compression New Algorithms and Lower Bounds for Sequential-Access Data Compression Travis Gagie PhD Candidate Faculty of Technology Bielefeld University Germany July 2009 Gedruckt auf alterungsbeständigem Papier ISO

More information

arxiv: v2 [cs.ds] 28 Jan 2009

arxiv: v2 [cs.ds] 28 Jan 2009 Minimax Trees in Linear Time Pawe l Gawrychowski 1 and Travis Gagie 2, arxiv:0812.2868v2 [cs.ds] 28 Jan 2009 1 Institute of Computer Science University of Wroclaw, Poland gawry1@gmail.com 2 Research Group

More information

On Compressing and Indexing Repetitive Sequences

On Compressing and Indexing Repetitive Sequences On Compressing and Indexing Repetitive Sequences Sebastian Kreft a,1,2, Gonzalo Navarro a,2 a Department of Computer Science, University of Chile Abstract We introduce LZ-End, a new member of the Lempel-Ziv

More information

CRAM: Compressed Random Access Memory

CRAM: Compressed Random Access Memory CRAM: Compressed Random Access Memory Jesper Jansson 1, Kunihiko Sadakane 2, and Wing-Kin Sung 3 1 Laboratory of Mathematical Bioinformatics, Institute for Chemical Research, Kyoto University, Gokasho,

More information

arxiv: v1 [cs.ds] 1 Jul 2018

arxiv: v1 [cs.ds] 1 Jul 2018 Representation of ordered trees with a given degree distribution Dekel Tsur arxiv:1807.00371v1 [cs.ds] 1 Jul 2018 Abstract The degree distribution of an ordered tree T with n nodes is n = (n 0,...,n n

More information

Small-Space Dictionary Matching (Dissertation Proposal)

Small-Space Dictionary Matching (Dissertation Proposal) Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length

More information

On Pattern Matching With Swaps

On Pattern Matching With Swaps On Pattern Matching With Swaps Fouad B. Chedid Dhofar University, Salalah, Oman Notre Dame University - Louaize, Lebanon P.O.Box: 2509, Postal Code 211 Salalah, Oman Tel: +968 23237200 Fax: +968 23237720

More information

Fast Fully-Compressed Suffix Trees

Fast Fully-Compressed Suffix Trees Fast Fully-Compressed Suffix Trees Gonzalo Navarro Department of Computer Science University of Chile, Chile gnavarro@dcc.uchile.cl Luís M. S. Russo INESC-ID / Instituto Superior Técnico Technical University

More information

Compact Indexes for Flexible Top-k Retrieval

Compact Indexes for Flexible Top-k Retrieval Compact Indexes for Flexible Top-k Retrieval Simon Gog Matthias Petri Institute of Theoretical Informatics, Karlsruhe Institute of Technology Computing and Information Systems, The University of Melbourne

More information

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES DO HUY HOANG (B.C.S. (Hons), NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE SCHOOL OF COMPUTING NATIONAL

More information

Data Compression Techniques

Data Compression Techniques Data Compression Techniques Part 2: Text Compression Lecture 5: Context-Based Compression Juha Kärkkäinen 14.11.2017 1 / 19 Text Compression We will now look at techniques for text compression. These techniques

More information

Succinct Data Structures for the Range Minimum Query Problems

Succinct Data Structures for the Range Minimum Query Problems Succinct Data Structures for the Range Minimum Query Problems S. Srinivasa Rao Seoul National University Joint work with Gerth Brodal, Pooya Davoodi, Mordecai Golin, Roberto Grossi John Iacono, Danny Kryzanc,

More information

More Haste, Less Waste: Lowering the Redundancy in Fully Indexable Dictionaries

More Haste, Less Waste: Lowering the Redundancy in Fully Indexable Dictionaries More Haste, Less Waste: Lowering the Redundancy in Fully Indexable Dictionaries Roberto Grossi, Alessio Orlandi, Rajeev Raman, S. Srinivasa Rao To cite this version: Roberto Grossi, Alessio Orlandi, Rajeev

More information

LZ77-like Compression with Fast Random Access

LZ77-like Compression with Fast Random Access -like Compression with Fast Random Access Sebastian Kreft and Gonzalo Navarro Dept. of Computer Science, University of Chile, Santiago, Chile {skreft,gnavarro}@dcc.uchile.cl Abstract We introduce an alternative

More information

Data Compression Techniques

Data Compression Techniques Data Compression Techniques Part 1: Entropy Coding Lecture 4: Asymmetric Numeral Systems Juha Kärkkäinen 08.11.2017 1 / 19 Asymmetric Numeral Systems Asymmetric numeral systems (ANS) is a recent entropy

More information

Succincter text indexing with wildcards

Succincter text indexing with wildcards University of British Columbia CPM 2011 June 27, 2011 Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview

More information

arxiv:cs/ v1 [cs.it] 21 Nov 2006

arxiv:cs/ v1 [cs.it] 21 Nov 2006 On the space complexity of one-pass compression Travis Gagie Department of Computer Science University of Toronto travis@cs.toronto.edu arxiv:cs/0611099v1 [cs.it] 21 Nov 2006 STUDENT PAPER Abstract. We

More information

Text Indexing: Lecture 6

Text Indexing: Lecture 6 Simon Gog gog@kit.edu - 0 Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Reviewing the last two lectures We have seen two top-k document retrieval frameworks. Question

More information

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE General e Image Coder Structure Motion Video x(s 1,s 2,t) or x(s 1,s 2 ) Natural Image Sampling A form of data compression; usually lossless, but can be lossy Redundancy Removal Lossless compression: predictive

More information

Rotation and Lighting Invariant Template Matching

Rotation and Lighting Invariant Template Matching Rotation and Lighting Invariant Template Matching Kimmo Fredriksson 1, Veli Mäkinen 2, and Gonzalo Navarro 3 1 Department of Computer Science, University of Joensuu. kfredrik@cs.joensuu.fi 2 Department

More information

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Wing-Kai Hon Kunihiko Sadakane Wing-Kin Sung Abstract Suffix trees and suffix arrays are the most prominent full-text indices, and their

More information

A General Lower Bound on the I/O-Complexity of Comparison-based Algorithms

A General Lower Bound on the I/O-Complexity of Comparison-based Algorithms A General Lower ound on the I/O-Complexity of Comparison-based Algorithms Lars Arge Mikael Knudsen Kirsten Larsent Aarhus University, Computer Science Department Ny Munkegade, DK-8000 Aarhus C. August

More information

Opportunistic Data Structures with Applications

Opportunistic Data Structures with Applications Opportunistic Data Structures with Applications Paolo Ferragina Giovanni Manzini Abstract There is an upsurging interest in designing succinct data structures for basic searching problems (see [23] and

More information

arxiv: v3 [cs.ds] 6 Sep 2018

arxiv: v3 [cs.ds] 6 Sep 2018 Universal Compressed Text Indexing 1 Gonzalo Navarro 2 arxiv:1803.09520v3 [cs.ds] 6 Sep 2018 Abstract Center for Biotechnology and Bioengineering (CeBiB), Department of Computer Science, University of

More information

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding SIGNAL COMPRESSION Lecture 7 Variable to Fix Encoding 1. Tunstall codes 2. Petry codes 3. Generalized Tunstall codes for Markov sources (a presentation of the paper by I. Tabus, G. Korodi, J. Rissanen.

More information

Grammar Compressed Sequences with Rank/Select Support

Grammar Compressed Sequences with Rank/Select Support Grammar Compressed Sequences with Rank/Select Support Gonzalo Navarro and Alberto Ordóñez 2 Dept. of Computer Science, Univ. of Chile, Chile. gnavarro@dcc.uchile.cl 2 Lab. de Bases de Datos, Univ. da Coruña,

More information

Succinct 2D Dictionary Matching with No Slowdown

Succinct 2D Dictionary Matching with No Slowdown Succinct 2D Dictionary Matching with No Slowdown Shoshana Neuburger and Dina Sokol City University of New York Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d

More information

Optimal-Time Text Indexing in BWT-runs Bounded Space

Optimal-Time Text Indexing in BWT-runs Bounded Space Optimal-Time Text Indexing in BWT-runs Bounded Space Travis Gagie Gonzalo Navarro Nicola Prezza Abstract Indexing highly repetitive texts such as genomic databases, software repositories and versioned

More information

Università degli studi di Udine

Università degli studi di Udine Università degli studi di Udine Computing LZ77 in Run-Compressed Space This is a pre print version of the following article: Original Computing LZ77 in Run-Compressed Space / Policriti, Alberto; Prezza,

More information

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Roberto Grossi Dipartimento di Informatica Università di Pisa 56125 Pisa, Italy grossi@di.unipi.it Jeffrey

More information

arxiv: v1 [cs.ds] 8 Sep 2018

arxiv: v1 [cs.ds] 8 Sep 2018 Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space Travis Gagie 1,2, Gonzalo Navarro 2,3, and Nicola Prezza 4 1 EIT, Diego Portales University, Chile 2 Center for Biotechnology

More information

Complementary Contextual Models with FM-index for DNA Compression

Complementary Contextual Models with FM-index for DNA Compression 2017 Data Compression Conference Complementary Contextual Models with FM-index for DNA Compression Wenjing Fan,WenruiDai,YongLi, and Hongkai Xiong Department of Electronic Engineering Department of Biomedical

More information

An Approximation Algorithm for Constructing Error Detecting Prefix Codes

An Approximation Algorithm for Constructing Error Detecting Prefix Codes An Approximation Algorithm for Constructing Error Detecting Prefix Codes Artur Alves Pessoa artur@producao.uff.br Production Engineering Department Universidade Federal Fluminense, Brazil September 2,

More information

arxiv: v2 [cs.ds] 3 Oct 2017

arxiv: v2 [cs.ds] 3 Oct 2017 Orthogonal Vectors Indexing Isaac Goldstein 1, Moshe Lewenstein 1, and Ely Porat 1 1 Bar-Ilan University, Ramat Gan, Israel {goldshi,moshe,porately}@cs.biu.ac.il arxiv:1710.00586v2 [cs.ds] 3 Oct 2017 Abstract

More information

In-Memory Storage for Labeled Tree-Structured Data

In-Memory Storage for Labeled Tree-Structured Data In-Memory Storage for Labeled Tree-Structured Data by Gelin Zhou A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer

More information

Jumbled String Matching: Motivations, Variants, Algorithms

Jumbled String Matching: Motivations, Variants, Algorithms Jumbled String Matching: Motivations, Variants, Algorithms Zsuzsanna Lipták University of Verona (Italy) Workshop Combinatorial structures for sequence analysis in bioinformatics Milano-Bicocca, 27 Nov

More information

Advanced Text Indexing Techniques. Johannes Fischer

Advanced Text Indexing Techniques. Johannes Fischer Advanced ext Indexing echniques Johannes Fischer SS 2009 1 Suffix rees, -Arrays and -rays 1.1 Recommended Reading Dan Gusfield: Algorithms on Strings, rees, and Sequences. 1997. ambridge University Press,

More information

A Simpler Analysis of Burrows-Wheeler Based Compression

A Simpler Analysis of Burrows-Wheeler Based Compression A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan School of Computer Science, Tel Aviv University, Tel Aviv, Israel; email: haimk@post.tau.ac.il Shir Landau School of Computer Science,

More information

A Four-Stage Algorithm for Updating a Burrows-Wheeler Transform

A Four-Stage Algorithm for Updating a Burrows-Wheeler Transform A Four-Stage Algorithm for Updating a Burrows-Wheeler ransform M. Salson a,1,. Lecroq a, M. Léonard a, L. Mouchard a,b, a Université de Rouen, LIIS EA 4108, 76821 Mont Saint Aignan, France b Algorithm

More information

Lecture 4 : Adaptive source coding algorithms

Lecture 4 : Adaptive source coding algorithms Lecture 4 : Adaptive source coding algorithms February 2, 28 Information Theory Outline 1. Motivation ; 2. adaptive Huffman encoding ; 3. Gallager and Knuth s method ; 4. Dictionary methods : Lempel-Ziv

More information

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Compression Motivation Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Storage: Store large & complex 3D models (e.g. 3D scanner

More information

Optimal Color Range Reporting in One Dimension

Optimal Color Range Reporting in One Dimension Optimal Color Range Reporting in One Dimension Yakov Nekrich 1 and Jeffrey Scott Vitter 1 The University of Kansas. yakov.nekrich@googlemail.com, jsv@ku.edu Abstract. Color (or categorical) range reporting

More information

Approximate String Matching with Ziv-Lempel Compressed Indexes

Approximate String Matching with Ziv-Lempel Compressed Indexes Approximate String Matching with Ziv-Lempel Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2, and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,

More information

3F1 Information Theory, Lecture 3

3F1 Information Theory, Lecture 3 3F1 Information Theory, Lecture 3 Jossy Sayir Department of Engineering Michaelmas 2011, 28 November 2011 Memoryless Sources Arithmetic Coding Sources with Memory 2 / 19 Summary of last lecture Prefix-free

More information

the subset partial order Paul Pritchard Technical Report CIT School of Computing and Information Technology

the subset partial order Paul Pritchard Technical Report CIT School of Computing and Information Technology A simple sub-quadratic algorithm for computing the subset partial order Paul Pritchard P.Pritchard@cit.gu.edu.au Technical Report CIT-95-04 School of Computing and Information Technology Grith University

More information

A Faster Grammar-Based Self-index

A Faster Grammar-Based Self-index A Faster Grammar-Based Self-index Travis Gagie 1,Pawe l Gawrychowski 2,, Juha Kärkkäinen 3, Yakov Nekrich 4, and Simon J. Puglisi 5 1 Aalto University, Finland 2 University of Wroc law, Poland 3 University

More information

Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility

Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility Tao Jiang, Ming Li, Brendan Lucier September 26, 2005 Abstract In this paper we study the Kolmogorov Complexity of a

More information

Space-Efficient Data-Analysis Queries on Grids

Space-Efficient Data-Analysis Queries on Grids Space-Efficient Data-Analysis Queries on Grids Gonzalo Navarro a,1, Yakov Nekrich a,1, Luís M. S. Russo c,b,2 a Dept. of Computer Science, University of Chile. b KDBIO / INESC-ID c Instituto Superior Técnico

More information

Approximate String Matching with Lempel-Ziv Compressed Indexes

Approximate String Matching with Lempel-Ziv Compressed Indexes Approximate String Matching with Lempel-Ziv Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2 and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,

More information

Module 1: Analyzing the Efficiency of Algorithms

Module 1: Analyzing the Efficiency of Algorithms Module 1: Analyzing the Efficiency of Algorithms Dr. Natarajan Meghanathan Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu What is an Algorithm?

More information

Bloom Filters, general theory and variants

Bloom Filters, general theory and variants Bloom Filters: general theory and variants G. Caravagna caravagn@cli.di.unipi.it Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered.

More information

Disjoint-Set Forests

Disjoint-Set Forests Disjoint-Set Forests Thanks for Showing Up! Outline for Today Incremental Connectivity Maintaining connectivity as edges are added to a graph. Disjoint-Set Forests A simple data structure for incremental

More information

arxiv: v2 [cs.ds] 6 Jul 2015

arxiv: v2 [cs.ds] 6 Jul 2015 Online Self-Indexed Grammar Compression Yoshimasa Takabatake 1, Yasuo Tabei 2, and Hiroshi Sakamoto 1 1 Kyushu Institute of Technology {takabatake,hiroshi}@donald.ai.kyutech.ac.jp 2 PRESTO, Japan Science

More information

Shannon-Fano-Elias coding

Shannon-Fano-Elias coding Shannon-Fano-Elias coding Suppose that we have a memoryless source X t taking values in the alphabet {1, 2,..., L}. Suppose that the probabilities for all symbols are strictly positive: p(i) > 0, i. The

More information

String Matching with Variable Length Gaps

String Matching with Variable Length Gaps String Matching with Variable Length Gaps Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind Technical University of Denmark Abstract. We consider string matching with variable length

More information

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille 1, Rolf Fagerberg 2, and Inge Li Gørtz 3 1 IT University of Copenhagen. Rued Langgaards

More information

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding SIGNAL COMPRESSION Lecture 3 4.9.2007 Shannon-Fano-Elias Codes and Arithmetic Coding 1 Shannon-Fano-Elias Coding We discuss how to encode the symbols {a 1, a 2,..., a m }, knowing their probabilities,

More information

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak 4. Quantization and Data Compression ECE 32 Spring 22 Purdue University, School of ECE Prof. What is data compression? Reducing the file size without compromising the quality of the data stored in the

More information

UNIT I INFORMATION THEORY. I k log 2

UNIT I INFORMATION THEORY. I k log 2 UNIT I INFORMATION THEORY Claude Shannon 1916-2001 Creator of Information Theory, lays the foundation for implementing logic in digital circuits as part of his Masters Thesis! (1939) and published a paper

More information

Efficient Alphabet Partitioning Algorithms for Low-complexity Entropy Coding

Efficient Alphabet Partitioning Algorithms for Low-complexity Entropy Coding Efficient Alphabet Partitioning Algorithms for Low-complexity Entropy Coding Amir Said (said@ieee.org) Hewlett Packard Labs, Palo Alto, CA, USA Abstract We analyze the technique for reducing the complexity

More information

1 Introduction to information theory

1 Introduction to information theory 1 Introduction to information theory 1.1 Introduction In this chapter we present some of the basic concepts of information theory. The situations we have in mind involve the exchange of information through

More information

An Algorithmic Framework for Compression and Text Indexing

An Algorithmic Framework for Compression and Text Indexing An Algorithmic Framework for Compression and Text Indexing Roberto Grossi Ankur Gupta Jeffrey Scott Vitter Abstract We present a unified algorithmic framework to obtain nearly optimal space bounds for

More information

On Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004

On Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004 On Universal Types Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA University of Minnesota, September 14, 2004 Types for Parametric Probability Distributions A = finite alphabet,

More information

Internal Pattern Matching Queries in a Text and Applications

Internal Pattern Matching Queries in a Text and Applications Internal Pattern Matching Queries in a Text and Applications Tomasz Kociumaka Jakub Radoszewski Wojciech Rytter Tomasz Waleń Abstract We consider several types of internal queries: questions about subwords

More information

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal

More information

Maximal Unbordered Factors of Random Strings arxiv: v1 [cs.ds] 14 Apr 2017

Maximal Unbordered Factors of Random Strings arxiv: v1 [cs.ds] 14 Apr 2017 Maximal Unbordered Factors of Random Strings arxiv:1704.04472v1 [cs.ds] 14 Apr 2017 Patrick Hagge Cording 1 and Mathias Bæk Tejs Knudsen 2 1 DTU Compute, Technical University of Denmark, phaco@dtu.dk 2

More information