Rank and Select Operations on Binary Strings (1974; Elias)

Similar documents
New Lower and Upper Bounds for Representing Sequences

Lecture 18 April 26, 2012

arxiv: v1 [cs.ds] 25 Nov 2009

Efficient Accessing and Searching in a Sequence of Numbers

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

arxiv: v1 [cs.ds] 19 Apr 2011

Optimal lower bounds for rank and select indexes

Smaller and Faster Lempel-Ziv Indices

Simple Compression Code Supporting Random Access and Fast String Matching

LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations

arxiv: v1 [cs.ds] 15 Feb 2012

Lower Bounds for Dynamic Connectivity (2004; Pǎtraşcu, Demaine)

More Haste, Less Waste: Lowering the Redundancy in Fully Indexable Dictionaries

Preview: Text Indexing

Optimal Dynamic Sequence Representations

Compact Data Strutures

Alphabet Friendly FM Index

arxiv: v1 [cs.ds] 1 Jul 2018

A Simple Alphabet-Independent FM-Index

Succinct Data Structures for the Range Minimum Query Problems

Reducing the Space Requirement of LZ-Index

Compressed Index for Dynamic Text

Higher Cell Probe Lower Bounds for Evaluating Polynomials

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

arxiv: v1 [cs.ds] 22 Nov 2012

Compact Indexes for Flexible Top-k Retrieval

Efficient Fully-Compressed Sequence Representations

Cell-Probe Lower Bounds for Prefix Sums and Matching Brackets

A Tight Lower Bound for Dynamic Membership in the External Memory Model

Compressed Representations of Sequences and Full-Text Indexes

CRAM: Compressed Random Access Memory

Practical Indexing of Repetitive Collections using Relative Lempel-Ziv

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

Forbidden Patterns. {vmakinen leena.salmela

Space-Efficient Re-Pair Compression

Compressed Representations of Sequences and Full-Text Indexes

Space and Time-Efficient Data Structures for Massive Datasets

Optimal Color Range Reporting in One Dimension

Opportunistic Data Structures with Applications

Dynamic Entropy-Compressed Sequences and Full-Text Indexes

The Cell Probe Complexity of Succinct Data Structures

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES

Alphabet-Independent Compressed Text Indexing

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems

Data Structure for Dynamic Patterns

Lecture 4 Thursday Sep 11, 2014

Finger Search in the Implicit Model

E D I C T The internal extent formula for compacted tries

the subset partial order Paul Pritchard Technical Report CIT School of Computing and Information Technology

On-line String Matching in Highly Similar DNA Sequences

String Matching with Variable Length Gaps

The Cell Probe Complexity of Succinct Data Structures

1 Approximate Quantiles and Summaries

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

arxiv: v1 [cs.ds] 18 Nov 2013

On Adaptive Integer Sorting

SUFFIX TREE. SYNONYMS Compact suffix trie

The Burrows-Wheeler Transform: Theory and Practice

In-Memory Storage for Labeled Tree-Structured Data

A Faster Grammar-Based Self-Index

Asymptotic redundancy and prolixity

Jumbled String Matching: Motivations, Variants, Algorithms

arxiv: v3 [cs.ds] 6 Sep 2018

Problems. Mikkel Thorup. Figure 1: Mihai lived his short life in the fast lane. Here with his second wife Mirabela Bodic that he married at age 25.

A Space-Efficient Frameworks for Top-k String Retrieval

Grammar Compressed Sequences with Rank/Select Support

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

Cell Probe Lower Bounds and Approximations for Range Mode

arxiv: v1 [cs.ds] 31 May 2017

String Range Matching

Text matching of strings in terms of straight line program by compressed aleshin type automata

Succinct Data Structures for Text and Information Retrieval

Succinct 2D Dictionary Matching with No Slowdown

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

Succincter text indexing with wildcards

Notes on Logarithmic Lower Bounds in the Cell Probe Model

Kernelization Lower Bounds: A Brief History

Asymmetric Communication Complexity and Data Structure Lower Bounds

Reserved-Length Prefix Coding

Internal Pattern Matching Queries in a Text and Applications

Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility

Stronger Lempel-Ziv Based Compressed Text Indexing

Fast Fully-Compressed Suffix Trees

arxiv: v1 [cs.ds] 21 Nov 2012

Small-Space Dictionary Matching (Dissertation Proposal)

Lecture 2 September 4, 2014

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12

A Robust APTAS for the Classical Bin Packing Problem

Traversing Grammar-Compressed Trees with Constant Delay

CPT+: A Compact Model for Accurate Sequence Prediction

Text Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2

A General Lower Bound on the I/O-Complexity of Comparison-based Algorithms

LZ77-like Compression with Fast Random Access

arxiv: v2 [cs.ds] 5 Mar 2014

arxiv: v2 [cs.ds] 6 Jul 2015

Lecture 15 April 9, 2007

Space-Efficient Construction Algorithm for Circular Suffix Tree

On the Exponent of the All Pairs Shortest Path Problem

Maintaining Significant Stream Statistics over Sliding Windows

Transcription:

Rank and Select Operations on Binary Strings (1974; Elias) Naila Rahman, University of Leicester, www.cs.le.ac.uk/ nyr1 Rajeev Raman, University of Leicester, www.cs.le.ac.uk/ rraman entry editor: Paolo Ferragina INDEX TERMS: Succinct data structures, bit-vectors, predecessor search, sets, multisets. Synonyms: Binary bit-vector, compressed bit-vector, rank and select dictionary, fully indexable dictionary (FID). 1 PROBLEM DEFINITION Given a static sequence b = b 1... b m of m bits, to preprocess the sequence and to create a spaceefficient data structure that supports the following operations rapidly: rank 1 (i) takes an index i as input, 1 i m, and returns the number of 1s among b 1... b i. select 1 (i) takes an index i 1 as input, and returns the position of the i-th 1 in b, and 1 if i is greater than the number of 1s in b. The operations rank 0 and select 0 are defined analogously for the 0s in b. As rank 0 (i) = i rank 1 (i), one considers just rank 1 (abbreviated to rank), and refers to select 0 and select 1 collectively as select. In what follows, x denotes the length of a bit sequence x and w(x) denotes the number of 1s in it. b is always used to denote the input bit sequence, m to denote b and n to denote w(b). Models of Computation, Time and Space Bounds. Two models of computation are commonly considered. One is the unit-cost RAM model with word size O(lg m) bits [1]. The other model, which is particularly useful for proving lower bounds, is the bit-probe model, where the data structure is stored in bit-addressable memory, and the complexity of answering a query is the worst-case number of bits of the data structure that are probed by the algorithm to answer that query. In the RAM model, the algorithm can read O(lg m) consecutive bits in one step, so supporting all operations in O(1) time on the RAM model implies a solution that uses O(lg m) bit-probes, but the converse is not true. This entry considers three variants of the problem: in each variant, rank and select must be supported in O(1) time on the RAM model, or in O(lg m) bit-probes. However, the use of memory varies: Problem 1 (Bit-Vector). The overall space used must be m + o(m) bits. Problem 2 (Bit-Vector Index). b is given in read-only memory and the algorithm can create auxiliary data structures (called indices) which must use o(m) bits. Indices allow the representation of b to be de-coupled from the auxiliary data structure, e.g., b can be stored (in a potentially highly compressed form) in a data structure such as that of 1

[6, 9, 10] which allows access to O(lg m) consecutive bits of b in O(1) time on the RAM model. Most bit-vectors developed to date are bit-vector indices. Recalling that n = w(b), observe that if m and n are known to an algorithm, there are only l = ( m n) possibities for b, so an information-theoretically optimal encoding of b would require B(m, n) = lg l bits (it can be verified that B(m, n) < m for all m, n). The next problem is: Problem 3 (Compressed Bit-Vector). The overall space used must be B(m, n) + o(n) bits. It is helpful to understand the asymptotics of B(m, n) in order to appreciate the difference between the bit-vector and the compressed bit-vector problems: Using standard approximations of the factorial function, one can show [15] that: B(m, n) = n lg(m/n) + n lg e + O(n 2 /m) (1) If n = o(m), then B(m, n) = o(m), and if such a sparse sequence b were represented as a compressed bit-vector, then it would occupy o(m) bits, rather than m + o(m) bits. B(m, n) = m O(lg m), whenever m/2 n = O( m lg m). In such cases, a compressed bit-vector will take about the same amount of space as a bit-vector. Taking p = n/m, H 0 (b) = (1/p) lg(1/p) + (1/(1 p)) lg(1/(1 p)) is the empirical zerothorder entropy of b. If b is compressed using an entropy compressor such as non-adaptive arithmetic coding [18], the size of the compressed output is at least mh 0 (b) bits. However, B(m, n) = mh 0 (b) O(log m). Applying Golomb coding to the gaps between successive 1s, which is the best way to compress bit sequences that represent inverted lists [18], also gives a space usage close to B(m, n) [4]. Related Problems. Viewing b as the characteristic vector of a set S U = {1,..., m}, note that the well-known predecessor problem given y U, return pred(y) = max{z S z y} may be implemented as select 1 (rank 1 (y)). One may also view b as a multiset of size m n over the universe {1,..., n + 1} [5]. First, append a 1 to b. Then, take each of the n + 1 1s to be the elements of the universe, and the number of consecutive 0s immediately preceding a 1 to indicate their multiplicities. For example, b = 01100100 maps to the multiset {1, 3, 3, 4, 4}. Seen this way, select 1 (i) i on b gives the number of items in the multiset that are i, and select 0 (i) i + 1 gives the value of the i-th element of the multiset. Lower-Order Terms. From an asymptotic viewpoint, the space utilization is dominated by the main terms in the space bound. However, the second (apparently lower-order) terms are of interest for several reasons, primarily because the lower-order terms are extremely significant in determining practical space usage, and also because non-trivial space bounds have been proven for the size of the lower-order terms. 2 KEY RESULTS 2.1 Reductions It has been already noted that rank 0 and rank 1 reduce to each other, and that operations on multisets reduce to select operations on a bit sequence. Some other reductions, whereby one can support operations on b by performing operations on bit sequences derived from b are: Theorem 1. (a) rank reduces to select 0 on a bit sequence c such that c = m + n and w(c) = n. 2

(b) If b has no consecutive 1s, then select 0 on b can be reduced to rank on a bit sequence c such that c = m n and w(c) is either n 1 or n. (c) From b one can derive two bit sequences b 0 and b 1 such that b 0 = m n, b 1 = n, w(b 0 ), w(b 1 ) min{m n, n} and select 0 and select 1 on b can be supported by supporting select 1 and rank on b 0 and b 1. Parts (a) and (b) follow from Elias s observations on multiset representations (see the Related Problems paragraph), specialised to sets. For part (a), create c from b by adding a 0 after every 1. For example, if b = 01100100 then c = 010 10 0010 00. Then, rank 1 (i) on b equals select 0 (i) i on c. For part (b), essentially invert the mapping of part (a). Part (c) is shown in [3]. 2.2 Bit-Vector Indices Theorem 2 ([8]). There is an index of size (1 + o(1))(m lg lg m/ lg m) + O(m/ lg m) that supports rank and select in O(1) time on the RAM model. Elias previously gave an o(m)-bit index that supported select in O(lg m) bit-probes on average (where the average was computed across all select queries). Jacobson gave o(m)-bit indices that supported rank and select in O(lg m) bit-probes in the worst case. Clark and Munro [2] gave the first o(m)-bit indices that support both rank and select in O(1) time on the RAM. A matching lower bound on the size of indices has also been shown (this also applies to indices which support rank and select in O(1) time on the RAM model): Theorem 3 ([8]). Any index that allows rank or select 1 to be supported in O(lg m) bit-probes has size Ω(m lg lg m/ lg m) bits. 2.3 Compressed Bit-Vectors Theorem 4. There is a compressed bit-vector that uses: (a) B(m, n) + O(m lg lg m/ lg m) bits and supports rank and select in O(1) time. (b) B(m, n) + O(n(lg lg n) 2 / lg n) bits and supports rank in O(1) time, when n = m/(lg m) O(1). (c) B(m, n) + O(n lg lg n/ lg n) bits and supports select 1 in O(1) time. Theorem 4(a) and (c) were shown by Raman et al. [17] and Theorem 4(b) by Pagh [15]. Note that Theorem 4(a) has a lower-order term that is o(m), rather than o(n) as required by the problem statement. As compressed bit-vectors must represent b compactly, they are not bit-vector indices, and the lower bound of Theorem 3 does not apply to compressed bit-vectors. Coarser lower bounds are obtained by reduction to the predecessor problem on sets of integers, for which tight upper and lower bounds in the RAM model are now known. In particular the work of [16] implies: Theorem 5. Let U = {1,..., M} and let S U, S = N. Any data structure on a RAM with word size O(lg M) bits that occupies at most O(N lg M) bits of space can support predecessor queries on S in O(1) time only when N = M/(lg M) O(1) or N = (lg M) O(1). As noted in the paragraph Related Problems, the predecessor problem can be solved by the use of rank and select 1 operations. Thus, Theorem 5 has consequences for compressed bit-vector data structures, which are spelt out below: Corollary 1. There is no data structure that uses B(m, n) + o(n) bits and supports either rank or select 0 in O(1) time unless n = m/(lg m) O(1), or n = (log m) O(1). 3

Given a set S U = {1,..., m}, S = n, we have already noted that the predecessor problem on S is equivalent to rank and select 1 on a bit-vector c with w(c) = n, and c = m. However, B(m, n) + o(n) = O(n lg m). Thus, given a bit-vector that uses B(m, n) + o(n) bits and supports rank in O(1) time for m = n(lg n) ω(1), we can augment it with the trivial O(1)-time data structure for select 1, that stores the value of select 1 (i) for i = 1,..., n (which occupies a further O(n lg m) bits), solving the predecessor problem in O(1) time, a contradiction. The hardness of select 0 is shown in [17], but follows easily from Theorem 1(a) and Equation 1. 3 APPLICATIONS There are a vast number of applications of bit-vectors in succinct and compressed data structures (see e.g. [13]). Such data structures are used for, e.g., text indexing, compact representations of graphs and trees, and representations of semi-structured (XML) data. 4 EXPERIMENTAL RESULTS Several teams have implemented bit-vectors and compressed bit-vectors. When implementing bitvectors for good practical performance, both in terms of speed and space usage, the lower-order terms are very important, even for uncompressed bit-vectors 1, and can dominate the space usage even for bit-vector sizes that are at the limit of conceivable future practical interest. Unfortunately, this problem may not be best addressed purely by a theoretical analysis of the lower-order terms. Bit-vectors work by partitioning the input bit sequence into (usually equal-sized) blocks at several levels of granularity usually 2-3 levels are needed to obtain a space bound of m + o(m) bits. However, better space usage as well as better speed in practice can be obtained by reducing the number of levels, resulting in space bounds of the form (1 + ɛ)m bits, for any ɛ > 0, with support for rank and select in O(1/ɛ) time. Clark [2] implemented bit-vectors for external-memory suffix trees. More recently, an implementation using ideas of Clark and Jacobson was used by [7], which occupied (1 + ɛ)m bits and supported operations in O(1/ɛ) time. Using a substantially different approach, Kim et al. [12] gave a bit-vector that takes (2 + ɛ)n bits to support rank and select. Experiments using bit sequences derived from real-world data in [3, 4] showed that if parameters are set to ensure that [12] and [7] use similar space on typical inputs the Clark-Jacobson implementation of [7] is somewhat faster than an implementation of [12]. On some inputs, the Clark-Jacobson implementation can use significantly more space, whereas Kim et al s bit-vector appears to have stable space usage; Kim et al s bit-vector may also be superior for somewhat sparse bit-vectors. Combining ideas from [7, 12], a third practical bit-vector (which is not a bit-vector index) was described in [4], and appears to have desirable features of both [12] and [7]. A first implementational study on compressed bit-vectors can be found in [14] (compressed bit-vectors supporting only select 1 were considered in [4]). 5 URL to CODE Bit-vector implementations from [3, 4, 7] can be found at http://hdl.handle.net/2381/318. 6 CROSS REFERENCES Arithmetic Coding for Data Compression; Compressed Text Indexing; Succinct Encoding of Permutations and its Applications to Text Indexing; Tree Compression and Indexing. here. 1 For compressed bit-vectors, the lower-order o(m) or o(n) term can dominate B(m, n), but this is not our concern 4

7 RECOMMENDED READING [1] A. V. Aho, J. E. Hopcroft and J. D. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley, 1974. [2] D. Clark and J. I. Munro. Efficient suffix trees on secondary storage. In Proc. 7th ACM-SIAM SODA, pp. 383 391, 1996. [3] O. Delpratt, N. Rahman and R. Raman. Engineering the LOUDS succinct tree representation. In Proc. WEA 2006, LNCS 4007, pp. 134 145, Springer, 2006. [4] O. Delpratt, N. Rahman and R. Raman. Compressed prefix sums. In Proc. SOFSEM 2007, LNCS 4362, pp. 235 247, 2007. [5] P. Elias. Efficient storage retrieval by content and address of static files. Journal of the ACM, 21(2):246 260, April 1974. [6] P. Ferragina and R. Venturini. A simple storage scheme for strings achieving entropy bounds. Theoretical Computer Science 372 (2007), pp 115 121. [7] R. F. Geary, N. Rahman, R. Raman and V. Raman. A simple optimal representation for balanced parentheses. Theoretical Computer Science 368 (2006), pp. 231 246. [8] A. Golynski. Optimal lower bounds for rank and select indexes. In Proc. ICALP 2006, Part I., LNCS 4051, pp. 370 381, 2006. [9] Rodrigo González and Gonzalo Navarro Statistical encoding of succinct data structures. In Proc. CPM 2006, LNCS 4009, Springer, 2006, pp. 294-305. [10] K. Sadakane and R. Grossi. Squeezing succinct data structures into entropy bounds. In Proc. 17th ACM-SIAM SODA, pp. 1230 1239. ACM Press, 2006. [11] G. Jacobson. Space-efficient static trees and graphs. In Proc. 30th FOCS, 549-554, 1989. [12] D. K. Kim, J. C. Na, J. E. Kim and K. Park. Efficient implementation of Rank and Select functions for succinct representation. In Proc. WEA 2005, LNCS 3505, pp. 315 327, 2005. [13] J. I. Munro and S. Srinivasa Rao. Succinct representation of data structures. Chapter 37 in Handbook of Data Structures with Applications, D. Mehta and S. Sahni, eds. Chapman and Hall/CRC Press, 2005. [14] D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9th ACM-SIAM Workshop on Algorithm Engineering and Experiments (ALENEX 07), SIAM, to appear, 2007. [15] R. Pagh. Low redundancy in static dictionaries with constant query time. SIAM J. Comput, 31 (2001), 353 363. [16] M. Patrascu and M. Thorup, Time-space trade-offs for predecessor search. In Proc. 38th ACM STOC, pp. 232 240, 2006. [17] R. Raman, V. Raman and S. S. Rao. Succinct indexable dictionaries, with applications to representing k-ary trees and multisets. In Proc. 13th ACM-SIAM SODA, 233 242, 2002. [18] I. Witten, A. Moffat, I. Bell. Managing Gigabytes, 2e. Morgan Kaufmann, 1999. 5