Minimal Indices for Successor Search

Size: px

Start display at page:

Download "Minimal Indices for Successor Search"

Bernice Hill
5 years ago
Views:

1 TEL-AVIV UNIVERSITY RAYMOND AND BEVERLY SACKLER FACULTY OF EXACT SCIENCES SCHOOL OF COMPUTER SCIENCE Minimal Indices for Successor Search This thesis was submitted in partial fulfillment of the requirements for the M.Sc. degree in the School of Computer Science, Tel-Aviv University by Sarel Cohen The research work for this thesis has been carried out at Tel-Aviv University under the supervision of Prof. Amos Fiat and Prof. Haim Kaplan December 2012

2 Acknowledgements I would like to thank my advisors Prof. Amos Fiat and Prof. Haim Kaplan for their help, guidance, and numerous contributions to this work. The many talks we had during the writing of this dissertation have inspired me to new and wonderful ideas. I would like to thank my parents and my sister for their endless love, support and encouragement throughout my life. All the support you give me over the years is the greatest gift anyone has ever given me. Without your love and support this thesis wouldn t have been possible. I would like to thank Moshik Hershcovitch who coauthored the extended abstract of this work, for his many ideas and numerous contributions to this work. I would like to thank Nir Shavit for introducing us to the problems of contention in multicore environments, for posing the question of multicore efficient data structures, and for many useful discussions. We also wish to thank Mikkel Thorup for his kindness and useful comments. i

3 Abstract We give a new successor data structure which improves upon the index size of the Pǎtraşcu-Thorup data structures, reducing the index size from O(nw 4/5 ) bits to O(n log w) bits, with optimal probe complexity. Alternately, our new data structure can be viewed as matching the space complexity of the (probe-suboptimal) z-fast trie of Belazzougui et al. Thus, we get the best of both approaches with respect to both probe count and index size. The penalty we pay is an extra O(log w) interregister operations. Our data structure can also be used to solve the weak prefix search problem, the index size of O(n log w) bits is known to be optimal for any such data structure. The technical contributions include highly efficient single word indices, with out-degree w/ log w (compared to the w 1/5 out-degree of fusion tree based indices). To construct such high efficiency single word indices we device highly efficient bit extractors which, we believe, are of independent interest. ii

4 Contents Abstract ii 1 Introduction The word RAM model Related Work The Successor Problem Blind Tries and Important Bits Fusion Trees Implicit and Succinct Data Structures I/O Efficient Data Structures Reducing Probe Complexity: New Motivation from Multicore Architectures Our Results A Succinct, Probe-Optimal Data-Structure for Successor Search α, β, γ nodes Introducing Bit-Extractors and Building a (k, k)-bit Extractor Optimality Of Results: Probes, Operations, and Space α-nodes and β-nodes α-nodes β-nodes The Prefix Partitioning Lemma Construction of a B-structure Querying the B-node Correctness of the B-structure Making it succinct: from B-structure to β-structure iii

5 CONTENTS iv 5 Bit Extractors The ever changing x and I Phase 1: Packing blocks to the left (Figure 5.1) Phase 2: Sorting blocks by size (Figure 5.4) Phase 3: Dispersing bits (Figure 5.5) Phase 4: Packing bits (Figure 5.6) Phase 5: Spacing the bits (Figure 5.7) Phase 6: Duplicating bits (Figure 5.8) Phase 7: Final Positioning (Figure 5.9) Permuting elements in a word by simulating a Benes network Overview of the Benes-Network Permuting the b = w/ log w leftmost bits of the word Permuting the w/ log w leftmost blocks of the word γ-nodes Construction of Slow γ-nodes Improving the running time Blind binary search Comparing x[i q ] and ζ q Precomputed constants so as to compute ζ q efficiently Choosing the right sequence I Computing ζ q and x[i q ] during the binary search Applications of High Degree Word Indices Weak Prefix Search Open Issues 42 Bibliography 43

6 Chapter 1 Introduction A fundamental problem in data structures is the successor problem: given a RAM with w bit word operations, and n keys (each w bits long), give a data structure that answers successor queries efficiently. Of particular interest are data structures with small index structures, sublinear in the input size, that can fit within a high speed cache. The simplest successor data structure is a sorted list, this requires no index, and O(log n) probes for binary search. This is highly inefficient for large data sets. Memory addresses accessed are widely dispersed. Given a high speed cache, one can improve performance by building a small index via x-fast tries [Wil83]: use a x-fast trie for a set of equally spaced splitters, ranks poly(w) apart. Such an index requires only O(n/poly(w)) bits. Unfortunately, the number of probes external to the index is still relatively large, O(log w) probes. Another approach is to employ one or more levels of Fusion tree nodes, such nodes have outdegree B = w 1/5. Done appropriately, this gives an index of size O(nw/B) = O(nw 4/5 ) bits, and only O(1) probes external to the index. This motivates searching for alternatives to Fusion tree nodes that have outdegree > w 1/5. This we do in this paper, obtaining outdegree w/ log w. Yet another consideration, due to multicore shared cache contention issues, makes it desirable, 1

7 CHAPTER 1. INTRODUCTION 2 not only to reduce the index size, but also to reduce the number of probes to the index. Pǎtraşcu and Thorup [PT06] solve the successor problem optimally (to within an O(1) factor) for any possible point along the probe count/space tradeoff, and for any value of n and w. However, they do not distinguish between the space required to store the input (which is exactly nw bits) and the extra space required. They consider only the total space. This distinction is of interest when the index size (the extra space) is small, o(nw) bits. Conversely, the probabilistic z-fast trie of Belazzougui et al. [BBPV09] reduces the extra space requirement to O(n log w) extra bits, but requires a (suboptimal) expected O(log w) probes (and O(log n) probes in the worst case). We give a (deterministic) data structure that, for any n, w, answers a successor query with an optimal number of probes (within an O(1) factor), and requires only O(n log w) extra bits. We remark that we only access O(1) words in the extra storage (which is true of Pǎtraşcu-Thorup data structures as well, with minor modifications). The penalty we pay is an additional O(log w) in the time complexity. Many of the successor data structures make use of bucketing techniques, as do Pǎtraşcu and Thorup, and as we do here. A key component of the Pǎtraşcu-Thorup construction is a single word fusion-tree [FW93]. Analogously to the use of fusion-nodes, we make use of new single word indices, denoted by α, β, and γ nodes hereinafter. At the heart of our construction are single word indices with outdegree w/ log w (vs. outdegree w 1/5 for fusion nodes). 1.1 The word RAM model We assume: A RAM model of computation with w bits per word. Generally, we assume word operations take O(1) time. We give several alternative constructions, our bit extractors and our penultimate γ-node construction does not require multiplication and meets the restrictions of a Practical RAM, as defined by Peter Bro Miltersen, [Mil96]. A key (or query) is one word (w bits long).

8 Chapter 2 Related Work 2.1 The Successor Problem Previous work on the successor problem include van Emde Boas priority queues [veb77], Fusion trees, due to Fredman and Willard, ([FW93]), the Beame and Fich successor data structure [PF99], x-fast tries and y-fast tries, due to Willard, ([Wil83]) z-fast tries due to Belazzougui et al., ([BBPV09, BBPV08, BBPV10]), and the upper and lower bounds of Pǎtraşcu & Thorup [PT06]. See Table 2.1 for a comparison between these data structures with respect to the space and probe parameters under consideration. See Figure 2.1 for a graphical description of the minimal probe complexity as a function of n, w Blind Tries and Important Bits Blind tries (such as PATRICIA, [Mor68]) which can be used to answer successor queries, were devised to prevent excessive access to external memory. Blind tries delay making comparisons as long as possible by skipping over unimportant segments of the search pattern. See Figure 2.2. Given some set of binary bit strings, S, consider the blind trie whose leaves are the elements of S. Define the important indices to be the indices of bits used to branch (somewhere) in the trie. The number of important indices for S is no more than S. 3

9 CHAPTER 2. RELATED WORK 4 Data Structure Ref. Index size # Non-index Total (in bits) Probes # Probes # operations Binary Search O(log n) O(log n) O(#probes) van Emde Boas [veb77] O(2 w ) O(1) O(log w) O(#probes) x-fast trie [Wil83] O(nw 2 ) O(1) O(log w) O(#probes) y-fast trie [Wil83] O(nw) O(1) O(log w) O(#probes) x-fast trie Folklore on splitters O(n/poly(w)) O(log w) O(log w) (Anon reviewer) poly(w) apart ) O(#probes) Beame & Fich [PF99] Θ(n 1+ɛ w) O(1) O Fusion Trees [FW93] O(nw 4/5 ) O(1) O z-fast trie [BBPV09, BBPV08, BBPV10] O(n log w) Pǎtraşcu & Thorup FT + γ-nodes {x, y}-tries + γ-nodes P & T + γ-nodes [PT06] O(nw) or O(nw 4/5 ) exp. w.c. O(1) O(log n) O(1) ( log w log log w ( ) log n log w O(log w) O(log n) Optimal given linear space See Figure 2.1 This paper O(n log w) O(1) As above + O(1) Table 2.1: Comparison of Probe and Space requirements of various data structures for the successor problem. Word length is w. Adding layers of γ-nodes to Fusion Trees, {x, y}-fast tries, or any of the probe-optimal Pǎtraşcu & Thorup constructions, reduces the index space requirements to O(n log w) bits while preserving the probe complexity, and increasing the number of operations by + log w. O(#probes) O(#probes) O(#probes) O(#probes) O(#probes+ log w) Fusion Trees Fusion Trees [FW93] are successor data structures that answer successor queries in time O(log n/ log w). A fusion node, used in a Fusion tree, occupies one w bit word and can be used to index w 1/5 keys in O(1) time. Like blind tries, the search is based on extracting and comparing important bits rather than the entire query. In our terminology (to be defined in section 3.3, Fusion trees apply a (w 1/5, w 4/5 ) bit-extractor to every key, producing a sketch of length w 4/5 that contains w 1/5 important bits of the key. The sketches of many keys (i.e., w/w 4/5 = w 1/5 keys) are packed into a single w bit word, this forms a fusion node. The key observation is that multiple parallel operations can be performed on all these w 1/5 sketches by single word operation. Omitting lots of (critical) detail, this allows Fusion trees to search amongst these w 1/5 keys in O(1) time, resulting in the O(log n/ log w) bound above. Note that the shorter the sketch, the more sketches can be packed into a word, the greater the parallelism and the greater the resulting speedup. We remark that Andersson, Miltersen, and Thorup [AMT99] give an AC(0) implementation of Fusion trees, i.e., they use special purpose hardware to implement a (k, k) bit-extractor (that

10 CHAPTER 2. RELATED WORK 5 Figure 2.1: The Pǎtraşcu-Thorup lower bounds, opt(n, w), where w is fixed and n ranges over all possible values. The formulas are correct up to (absolute) constants, and also match the Pǎtraşcu- Thorup upper bounds. To make this figure illustrative, an absurdly long word size was chosen. produces a sketch of length k containing k bits of the key). Ignoring other difficulties, computing a [perfect] sketch in AC(0) is easy: just lead wires connecting the source bits to the target bits. With this interpretation, our bit-extractor is a software implementation in O(log w) time that implements the special purpose hardware implementation of [AMT99]. 2.2 Implicit and Succinct Data Structures Space efficient data structures have been extensively studied. Implicit data structures are data structures that require no extra space other than the space required for the data itself (see e.g. [FM06]). Succinct data structures (see e.g. [RRS07]) do make use of extra space, but no more than

11 CHAPTER 2. RELATED WORK 6 Figure 2.2: A Compact Trie and a Blind Trie. Testing bits 1,3,5,6, and 7 of a query will produce a candidate successor. As not all bits have been considered, this requires verification. o(1) of the information theoretic lower bound required to represent the data. A motivation for small extra space is to devise data structures where most of the memory probes are to a small set of addresses, the so-called working set. When this set fits in fast memory then overall performance is improved. 2.3 I/O Efficient Data Structures I/O efficient data structures [Vit06] seek to minimize the number of data transfers between different levels of memory. Cache oblivious data structures [Bro04] are designed to achieve the same goal even when the architecture of the memory hierarchy is unknown. 2.4 Reducing Probe Complexity: New Motivation from Multicore Architectures Multicore architectures pose new performance challenges [Dre07, Sha11]. When multiple cores access shared cache/memory, contention arises. Such contention is deviously problematic because it may cause serialization of memory accesses, making a mockery of multicore parallelism. Multiple memory banks and other hardware are attempts to deal with such problems, to various degrees. Thus, the goals of reducing the index size, the number of probes extraneous to the index, and the

12 CHAPTER 2. RELATED WORK 7 number of probes within the index, become even more critical in the multicore setting.

13 Chapter 3 Our Results 3.1 A Succinct, Probe-Optimal Data-Structure for Successor Search In this paper we give data structures for successor queries that greatly improve upon the size of the index over previous designs, are optimal with respect to the total number of probes, and perform O(1) probes outside the index (one needs one extraneous probe simply so as to access the successor). The total number of operations required is bounded by the number of probes plus an extra O(log w) (inter-register) operations. Fundamental to our construction are high outdegree single word indices to reduce space requirements and probes. We give three such variants: α nodes, β nodes, and γ nodes. Each of these index w/ log w keys and answer successor queries using only O(1) w-bit words, O(log w) time, and O(1) extra-index probes (in expectation for α and β nodes, worst case for γ nodes). Combining such α, β, or γ nodes with space efficient data structures such as Fusion Trees to index the buckets, gives us successor queries requiring only O(log n/ log w) probes and much reduced index size, while paying a minor increase in the number of inter-register operations (O(log w)), see last row of Table 2.1. If the top levels consist of a y-fast-trie data structure, we get an improvement upon the recently introduced [probabilistic] z-fast-trie, [BBPV09, BBPV08], see Chapter 7 (we omit the probabilistic prefix hereinafter). The worst-case query time and probes improves from O(log n) to O(1) by using γ-nodes within the last-level buckets, and the query becomes deterministic. If the top levels consist of a Pǎtraşcu-Thorup data structure, we get a probe-optimal successor 8

14 CHAPTER 3. OUR RESULTS 9 data-structure, using an index of O(n log w) bits. The number of probes external to the index is O(1). 3.2 α, β, γ nodes The α node is simply a z-fast trie applied to w/ log w keys. Given the small number of keys, the z- fast trie can be simplified. A major component of the z-fast trie involves translating a prefix match (between a query and a trie node) to the query rank. As there are only w/ log w keys involved, we can discard this part of the z-fast trie and store ranks explicitly in O(1) words. One drawback of the original z-fast trie was that the worst case number of probes, external to the index, was O(log n). Therefore, when using α nodes, the worst case number of probes drops to O(log w). Based on a different set of ideas, β nodes are arguably simpler than the z-fast trie, and have the same performance as the α nodes. Our last variant, γ nodes, has the advantages that it is deterministic and gives worst case O(1) non-index probes, and, furthermore, requires no multiplication. We introduce highly efficient bit-extractors and use them to build γ-nodes. Essentially, a bitextractor selects a multiset of bits from a binary input string and outputs a rearrangement of these bits within a shorter output string. 3.3 Introducing Bit-Extractors and Building a (k, k)-bit Extractor A (k, L) bit-extractor, 1 k L w, consists of a preprocessing phase and a query phase, (see Figure 3.1): The preprocessing phase: The input is a sequence of length k (with repetitions), I = I[1], I[2],..., I[k], where 0 I[j] w 1 for all j = 1,..., k. Given I, we compute the following: A sequence of k strictly increasing indices, 0 j 1 < j 2 < < j k L 1, and,

15 CHAPTER 3. OUR RESULTS 10 Figure 3.1: A (k, L) bit-extractor, 1 k L w. Given a sequence I of k indices in the range [0, w 1], we preprocess I into an O(1) word data-structure D(I), such that given queries x {0, 1} w we can extract the bits of x in positions I[1], I[2],..., I[k] to positions 0 j 1 < j 2 < < j k L 1, with zeros between them. Note that I[l], 1 l k, values can include repetitions and need not be in any order, however the j l sequence is an ascending subsequence of 1, 2,..., L. An O(1) word data structure, D(I). The query phase: given an input word x, and using D(I), produces an output word y such that y jl = x I[l], 1 l k, y m = 0, m {0,... w 1} {j l } k l=1. One main technically difficult result is a construct for (k, k) bit-extractors for all 1 k w/ log w (Chapter 5). The bit extractor query time is O(log w), while the probe complexity and space are constant. With respect to upper bounds, Brodnik, Miltersen, and Munro, [BMM97], give several bit manipulation primitives, similar to some of the components we use for bit-extraction, but the setting of [BMM97] is somewhat different, and there is no attempt to optimize criteria such as memory probes and index size. The use of Benes networks to represent permutations also appears in Munro et. al [MRRR03].

16 CHAPTER 3. OUR RESULTS 11 Note that, for (k, k)-bit-extractors, it must be that j l = l 1, 1 l k, independently of I. Bit-extractors are implicit in Fusion Trees and lie at the core of the data structure, Figure 3.2 compares the Fusion tree bit-extractor with our construction. k L D(I) in words # Operations Multiplication? Fusion tree extractor 1 k w 1/5 k 4 O(1) O(1) Yes Our extractor 1 k w/ log w k O(1) O(log w) No Figure 3.2: The bit-extractor used for Fusion Trees in [FW93] vs. our bit-extractor. 3.4 Optimality Of Results: Probes, Operations, and Space Our succinct data structures, and components thereof, meet the following lower bounds: 1. The number of probes for successor search is optimal, to within a constant factor, for any set S of n w-bit keys. The number of probes is within an (absolute) O(1) factor of the lower bounds from [PT06], specialized for the case of no more than linear memory. These bounds appear in Figure 2.1. (Other probe lower bounds for successor search appear in [BF01, PT06, PT07, AMT99, SV08]). 2. Our bit-extractors, used as a critical component for the successor data structure, are also optimal with respect to query time, when considering implementation on a practical RAM. This follows from Brodnik et al. [BMM97] (Theorem 17) who prove that in the practical RAM model (without multiplication), any (k, k)-bit-extractor, with k log 10 w, requires at least Ω(log k) time per bit-extractor query. (Observe that the bit-reversal of Theorem 17 in [BMM97] is a special case of bit-extraction). 3. Moreover, generalizing the practical RAM lower bound, Thorup [Tho03] gives a lower bound on the number of operations (not probes) for successor search in a standard AC(0) model. It is impossible to have O(1) time successor search, for any non-constant number of keys, unless one uses enormous space, 2 Ω(w), where w is the number of bits per word. This means that it would be impossible to derive an improved γ-node (or Fusion tree node) with O(1) time successor search in the standard AC(0) model.

17 CHAPTER 3. OUR RESULTS Weak prefix search takes a query string p, and, if p is a prefix of one or more keys in S, returns the rank interval of all keys with p as a prefix. If p is not a prefix of any key in S, the output is arbitrary. It is easy to modify the index of our successor data structure (with the γ-nodes), O(n log w) bits, to derive a new data structure for weak prefix search [BBPV10, Fer11] with the same space (See Section 7.1). Belazzougui et al., [BBPV10], show that any data structure supporting weak prefix search must have size Ω(n log w) bits. Hence, our index size is optimal for this related problem.

18 Chapter 4 α-nodes and β-nodes 4.1 α-nodes We refer to an α node (our term) as a z-fast trie, [BBPV09], applied to w/ log w keys. Because of the small number of keys, the z-fast trie can be simplified. Note that it fits into one word. The z-fast trie begins by mapping any input query into one of a linear number of strings, computed during preprocessing. This part is unchanged. Next, the z-fast trie uses monotone minimal perfect hash functions so as to compute the rank of the string, and this rank is used as an index to compute the rank of the successor to the query. This part becomes trivial in the context of α-nodes. 4.2 β-nodes We give an alternate single word index, the β-node, which, like the α-node, is randomized. It s expected query time is O(log w). Its worst-case probe complexity is inferior compared to the γ- node, but it may be simpler to understand / implement than the α-node. The β-node does not require the use of our bit-extractors, instead, like the z-fast trie, it compares hash values. Let S 0 be a set of n w-bit binary strings stored consecutively in ascending order in the memory. A β-structure is a randomized succinct index data-structure, which supports rank S0 (x) queries for any x {0, 1} w (recall that rank S (x) is the rank of x in S). Its size is O(n(log w + log n) 1 w ) w-bit words, the query time is O(log n) (in expectation and w.h.p), and the number of probes it makes 13

19 CHAPTER 4. α-nodes AND β-nodes 14 outside the index is O(1) (in expectation and w.h.p). Here w.h.p means that the probability that a query will take O(log n) time and O(1) probes outside the index is at least 1 1 n. By a β-node we refer to a β-structure with n = w log w keys. The size of a β-node is O(1), the query time is O(log w) (in expectation and w.h.p) and the number of probes outside the index is O(1) (in expectation and w.h.p). We start by describing a non-succinct version of a β-structure, which we refer to as B-structure, and then we describe how to transform the non-succinct B-structure into a succinct β-structure which occupies only O(n(log w + log n)) bits The Prefix Partitioning Lemma Let us start by defining a prefix-partition operator : Let S {0, 1} be an arbitrary set of binary strings, and let p {0, 1} be a binary string, we partition S into two subsets: S p and S p. S p is the set of all the elements of S which start with p, and S p = S S p, is the set of all the elements of S which don t start with p. The following lemma proves that there exists a prefix p which partitions S into two approximately equal subsets. Theorem For every set S of binary strings, S 2, there exists a binary string p {0, 1} s.t. 1 3 S S p 2 3 S and 1 3 S S p 2 3 S Proof. Let initially p = ɛ be the empty string. While ( S p > 2 3 S ), if S p0 S p1 add a 0-bit to the end of p, otherwise add a 1-bit to the end of p. We stop the loop at the first time that S p 2 3 S, and it is easy to verify that at that point we also have that S p 2 3 S Construction of a B-structure Let S 0 be the initial set of n w-bit binary strings. Assume without loss of generality that 0 w S 0. We can assume that, since if 0 w S 0 then rank S0 (x) = rank S0 {0}(x) 1 for every x > 0. The B-structure is a binary tree containing a prefix of the strings in S 0 in each internal node, and at most 3 keys of S 0 at each leaf. We define the B-node recursively for a subset S S 0 (starting with S = S 0 ). Let p be a prefix of a key in S as in Lemma 4.2.1, define p min, p max S to be the minimum and maximum keys

20 CHAPTER 4. α-nodes AND β-nodes 15 respectively in S which starts with p. Store p in the root of the B-structure of S, the right child is a B-structure of S R = S p StrictPredecessor(S, p min ) (we define here StrictPredecessor(S, p min ) to the the maximal key in S which is smaller than p min, or p min itself if it is the minimal key in S), the left child is a B-structure of S L = S p p max. We associate S with the root, S R with its right child, and S L with its left child. When S 3, we stop the recursion and store the (at most 3) keys of S in the leaf of the B-structure. According to Lemma 4.2.1, S R, S L 2 3 S + 1, and since S L + S R S + 2, we get that the height of the resulting tree is O(log S 0 ), and the number of nodes in the tree is O( S 0 ) Querying the B-node Let x {0, 1} w be the query word. Start the search in the root. The root contains a prefix p, if x starts with p continue the search in the right child, otherwise, continue the search in the left child. Continue the search similarly in every internal node that we reach, until we reach a leaf. In the leaf at most 3 keys are stored, denote them by s 1 < s 2 < s 3. If s 1 x < s 2 output rank S0 (s 1 ) (that is, the rank of s 1 in S 0 ). If s 2 x < s 3 output rank S0 (s 2 ). Otherwise, x > s 3 output rank S0 (s 3 ) Correctness of the B-structure Given x {0, 1} w we need to prove that the output of the search procedure is rank S0 (x). Let y = Pred(S 0, x) be the predecessor of x in S 0. At the end of the search we reach a leaf which stores between 1 and 3 elements of S 0. We need to prove that y is one of these keys. Let (v 0, v 1,..., v k ) be the root-to-leaf path traversed during the search. Let p i be the prefix stored at v i and let S i be the subset of S 0 associated with v i, for i = 0, 1,..., k. We prove by induction on i, for i = 0, 1,..., k, that y is a member of S i. This is correct at the root (i=0), since y S 0 by our assumption that 0 w S 0. For the inductive step, we assume y S i, and prove that y S i+1. Lemma Let y = Pred(S 0, x). If y S i then y S i+1. Proof. Let p min, p max S i be the minimum and maximum keys in S i respectively which start with p i.

21 CHAPTER 4. α-nodes AND β-nodes 16 If v i+1 is a right child, then x starts with p i and S i+1 = (S i ) pi StrictPredecessor(S i, p min ). If y (S i ) pi we are done, otherwise it must be that y = StrictPredecessor(S i, p min ) since x starts with p i and y is its predecessor in S i. If v i+1 is a left child, then x doesn t start with p i and S i+1 = (S i ) pi p max. If x < p max then x < p min (because x doesn t start with p i ) and hence y doesn t start with p i, i.e. y (S i ) pi. Otherwise, x > p max then either y = p max, or y > p max. If y > p max then y (S i ) pi. We get that in all the cases, y (S i ) pi p max as required Making it succinct: from B-structure to β-structure We now describe the succinct variant of the B-structure, which we call β-structure. Its index occupies O(n(log w + log n)) bits, and its search time is O(log n) (w.h.p), and the number of probes outside the index is O(1) (w.h.p). Each node of the β-structure occupies log w + log n bits, defined as follows: Inner nodes: Replace every prefix p in the B-structure with a pair < p, h(p)>, p is the length of the prefix p (log w bits) and h(p) being a signature of p of length (2 log n) bits, computed using a universal hash function. To test if x starts with p check if h(x[1.. p ]) = h(p). If so, then x starts with p with probability at least log n = 1 ( 1 n )2, otherwise x doesn t start with p. Leaves: In the leaves, replace the O(1) (at most 3) keys stored at each leaf with their rank in S 0 (O(log n) bits). In the search procedure, when reaching a leaf, retrieve these keys (from the static sorted list of the keys of S 0 ) using their ranks in O(1) word-accesses. Finally, at the end of the search assume the algorithm suggests that rank S0 (x) = i. We can test if it s correct using O(1) word-accesses by checking that s i x < s i+1 (recall the notation S 0 = s 1 < s 2 <... < s n ). With probability at most 2 2 log n log n < 1 n we will get an error at some node along the search path. When an error is detected we do a binary search to find the predecessor of x among the static set of n keys, this takes O(log n) time and probes. So the binary search takes o(1) on average. When no error occurs, the search time is O(log w) (this happens with probability at least 1 1 n ). Hence, the query time is O(log w) (in expectation and w.h.p), and we accesses only O(1) words outside the index (in expectation and w.h.p).

22 Chapter 5 Bit Extractors In this chapter we describe both the preprocessing and extraction operations for our bit-extractors. We sketch the extraction process, which makes use of D(I), the output of the preprocessing. Recall the definitions of: a (k, L)-bit extractor, I and D(I) from section 3.3. The bit-extractor, D(I) consists of O(1) words and includes precomputed constants used during the extraction process. As D(I) is O(1) words, we assume that D(I) is loaded into registers at the start of the extraction process. Also, the total working memory required throughout the extraction is O(1) words, all of whom we assume to reside within the set of registers. Partition the sequence σ = 0, 1,..., w 1 into w/ log w blocks (consecutive, disjoint, subsequences of σ), each of length log w. Let B j denote the jth block of a word, i.e., B j = j log w, j log w+ 1,..., (j + 1) log w 1, 0 j w/ log w 1. Given an input word x and the precomputed D(I), the extraction process goes through the seven phases sketched below. The following sections include full details and figures. Phase 1: Packing Blocks to the Left: All bits of x whose index appears in I are shifted to the left within their block. Let the number of such bits in block j be b j. Phase 2: Sorting Blocks by Size: Blocks are sorted, in descending order of b j (defined in Phase 1 above). This block permutation requires a Benes network simulation. The precomputed D(I) includes O(1) words to encode this Benes network. The values (block contents) have log w bits and there are w/ log w such values so the number of operations required for this simulation is O(log w). 17

23 CHAPTER 5. BIT EXTRACTORS 18 Phase 3: Dispersing bits. The goal of Phase 3 is to reorganize the word produced in Phase 2 so that each of the different bits whose index is in I will occupy the leftmost bit of a unique block. As there may be less distinct indices in I than blocks, some of the blocks may be empty, and these will be the rightmost blocks. This process requires O(log w) word operations to reposition the bits. Phase 4: Packing bits. The goal now is to move the bits positioned by Phase 3 at the leftmost bits of the leftmost r blocks (r being the number of indices in I without repetitions). Again, by appropriate bit manipulation, this can be done with O(log w) word operations. We remark that if r = k, i.e., if I contains no duplicate indices, then we can skip Phases 5 and 6 whose purpose is to duplicate those bits required several times in I. Phase 5: Spacing the bits. Once again, we simulate a Benes network on the k leftmost bits. The purpose of this permutation is to space out and rearrange the bits so that bits who appear multiple times in I are placed so that multiple copies can be made. Phase 6: Duplicating bits - we duplicate the bits for which space was prepared during Phase 5. Phase 7: Final positioning: The bits are all now bunched up in the k leftmost positions of a word, every bit appears the same number of times it s index appears in I, and we need to run one last Benes network simulation so as to permute these k bits. This permutation gives the final outcome. 5.1 The ever changing x and I As we process the various phases and sub phases of the bit extraction, the original bits of x are permuted, duplicated, or set to zero. Phases 1 5 and 7 simply permute the bits of x. Each permutation is performed in O(log w) operations. Phase 6 duplicates some of the bits in x (those bits with multiplicity > 1 in the sequence I). Each of the phases requires O(1) precomputed words throughout its execution. Let x 0 be the original word and I 0 = I be the original sequence of indices. Moreover, let x t be

24 CHAPTER 5. BIT EXTRACTORS 19 the word x after phases 1 to t, and let I t be a sequence of indices such that for all j = 0,..., k 1, x t [I t [j]] = x 0 [I 0 [j]] For all 0 t 5, 0 j k 1, the multiplicity of I t [j] is equal to the multiplicity of I 0 [j]. For t = 6, 7, 0 j k 1, the multiplicity of I t [j] is one. For any t, 0 < t 6, imagine that I t is obtained by changing I t 1 so as to reflect the bit permutation performed on x t 1 to get x t. This permutation on I t 1 need not be actually done, these permutations are implicitly used by the bit extraction algorithm. During phase 6, where bits are duplicated so that the number of copies of each bit is equal to the multiplicity of the index of the bit in I 0 (or I 5 ), imagine that I 6 is produced from I 5 by removing multiplicities and substituting i + j 1 for the jth appearance of index i in I 5 Initially, all bits not appearing in I 0 are set to zero simply by setting x 0 x AND M where M is a mask with it s ith bit equal to one iff i appears in I. The final output of the bit extraction, from left to right, is a word x 0 [I] 0 w I. For brevity, we use x as a continuously changing variable throughout the description of the different phases. The sequences I t are needed during the preprocessing phase, the query phase requires only a constant number of precomputed words. We describe how the preprocessing phase keeps track of the various I t sequences and the permutations applied to x implicitly through the description of the phases. 5.2 Phase 1: Packing blocks to the left (Figure 5.1) We now describe the procedure for rearranging the bits of x so that for all blocks B, the bits x[b I] are assigned to the leftmost positions of x[b], preserving their order. This will be done for all blocks in parallel by the inherent parallelism of word operations. For block B, let suff i (B) be the length i suffix of B. Phase 1 requires log w subphases. We maintain the following invariant after subphase i, 1 i log w: The bits x[suff i (B) I] are assigned to the leftmost positions of x[suff i (B)] and the other bits of x[suff i (B)] are set to zero. Bits of x whose indices are not in suff i (B) do not change. Note

Let Z i be a word with 1 at the ith largest index of each block, and zeros elsewhere. We need Z i during subphase i of Phase 1.

25 CHAPTER 5. BIT EXTRACTORS 20 that this invariant initially holds for i = 1. At subphase i, for i = 2,..., log w, for each block B whose ith largest index is not in I 0 we assign x[suff i 1 (B)] 0 to x[suff i (B)]. Figure 5.1: An illustration of Phase 1. Let Z i be a word with 1 at the ith largest index of each block, and zeros elsewhere. We need Z i during subphase i of Phase 1. Z 1 can be constructed on the fly in a register, in time O(log w), Z i is simply a left shift of Z 1 by i 1. Let L i = M AND Z i, and let S i = Z i (Z i (i 1)). See Figure 5.2. Figure 5.2: The masks used during Phase 1. The ith subphase is as follows: We compute y 1 = x AND S i which gives the bits that have to

See Figure 5.3. 5.3 Phase 2: Sorting blocks by size (Figure 5.4) We permute x[b 0 ],..., x[b w log w 1 ] such that they are in non-increasing order of B j I 1.

26 CHAPTER 5. BIT EXTRACTORS 21 Figure 5.3: A subphase of Phase 1. be left shift by one position, and we compute y 2 = x AND S i which gives a word containing the bits which are to remain in their positions. Finally, we set x = (y 1 1) OR y 2. See Figure Phase 2: Sorting blocks by size (Figure 5.4) We permute x[b 0 ],..., x[b w log w 1 ] such that they are in non-increasing order of B j I 1. Note that we know this permutation when preprocessing I 1. We implement this step using a simulation of a Benes-network (described in Section 5.9). This simulation requires O(log w) operations, and uses O(1) precomputed constants stored in D(I), and O(1) registers. Figure 5.4: An illustration of Phase 2.

27 CHAPTER 5. BIT EXTRACTORS Phase 3: Dispersing bits (Figure 5.5) Recall that r is the number of distinct values in I. Let σ = B 0 [0], B 1 [0],..., B r 1 [0], i.e., σ is a sequence of the first (and smallest) index in every block. For any sequence π, define τ(π) = τ 1, τ 2,..., τ r be a subsequence of π where π[j] is discarded if π[j ] = π[j] for some j < j. In Phase 3 we disperse the bits of x[i 2 ], so that x[σ] x[ τ], for some sequence τ produced by some permutation on the order of τ(i). The description of τ, is implicit in the description of Phase 3 below. Following Phase 2, we have that B 0 I 2 B 1 I 2... B w/ log w 1 I 2. Therefore, for 1 i log w, we can define a i = {j B j I 2 i}. Let A i = i l=1 a l for i = 1,..., log w, and define A 0 = 0. We can now define the sequence σ i, 1 i log w, σ i = σ[a i 1 ], σ[a i 1 + 1],..., σ[a i 1], and the sequence ξ i = B 0 [i 1], B 1 [i 1],..., B ai 1[i 1], 1 i log w. Phase 3 has log w subphases. Subphase i of Phase 3 performs the assignment x[σ i ] x[ξ i ], this assignment can be implemented using O(1) operations. Isolate the bits to be moved (indices ξ i ), shift them to their new locations (indices σ i, note that σ i [j] ξ[j] = σ i [j ] ξ[j ] for all 1 j, j σ i ), producing word y. Next,update x by setting x[ξ i ] to zero and taking the OR with y. The log w values A[i] are stored as part of D(I) (in total log 2 w bits). These values suffice so as to generate all the masks and operations required in Phase 3.

CHAPTER 5. BIT EXTRACTORS 23 Figure 5.5: An illustration of Phase 3. 5.5 Phase 4: Packing bits (Figure 5.6) Let σ and τ(π) be as defined at the start of Phase 3.

28 CHAPTER 5. BIT EXTRACTORS 23 Figure 5.5: An illustration of Phase Phase 4: Packing bits (Figure 5.6) Let σ and τ(π) be as defined at the start of Phase 3. in Phase 4 we push the bits x[τ(i 3 )] to the left, i.e., x[0, 1,..., r 1] x[ τ], where τ is a sequence produced by some permutation on the order of τ(i 3 ). As in Phase 3, the description of τ, is implicit in the description of Phase 4 below. Note that τ(i 3 ) is a permutation of σ. There are log w subphases in Phase 4. Let q = r/ log w. In subphases 1,..., log w 1 we fill x[b 0 ], x[b 1 ],..., x[b q 1 ], with some permutation of the first q log w bits of x[σ]. The last subphase is used to copy the k < log w leftover bits of x[σ] into x[b q [0], B q [1],..., B q [k 1]]. For 1 i < log w define the sequences υ i = i, log w+i,..., (q 1) log w+i and ζ i = iq log w, (iq+ 1) log w,..., ((i + 1)q 1) log w. In Subphase i of Phase 4 we perform the assignment x[υ i ] x[ζ i ]. To do this using word operations, we first isolate the bits of x[ζ i ], shift them so as to be in their target locations, υ i, and or them into place.

CHAPTER 5. BIT EXTRACTORS 24 The last subphase copies the remaining bits one by one, for a total of O(log w) operations. Figure 5.6: An illustration of Phase 4. 5.6 Phase 5: Spacing the bits (Figure 5.

Our goal is now to space the bits so as to make space for duplication of those bits whose indices appear multiple times in I 4. In this phase we space the bits x[0, 1,.

29 CHAPTER 5. BIT EXTRACTORS 24 The last subphase copies the remaining bits one by one, for a total of O(log w) operations. Figure 5.6: An illustration of Phase Phase 5: Spacing the bits (Figure 5.7) At the end of Phase 4 x[0, 1,..., r 1] is a permutation of the bits of x[τ[i 4 ]], and x[j], j r, are zero. Our goal is now to space the bits so as to make space for duplication of those bits whose indices appear multiple times in I 4. In this phase we space the bits x[0, 1,..., r 1] by inserting j 1 zeros between x[l] and x[l + 1] iff l appears j times in I 4. We do this by permuting the bits of x[0, 1,..., k]. There is one unique permutation that achieves this goal. This is done by simulating a Benes sorting network, in time O(log w), and using only O(1) precomputed constants and O(1) registers. Figure 5.7: An illustration of Phase 5.

30 CHAPTER 5. BIT EXTRACTORS Phase 6: Duplicating bits (Figure 5.8) For a sequence ϱ let m l (ϱ) be the number of occurrences of l in ϱ. At the end of Phase 5, for every l I 5 such that m l (I 5 ) > 1 we have that x[l + 1, l + 2,..., l + m l (I 5 ) 1] contain zeros and none of the indices l + 1, l + 2,..., l + m l (I 5 ) 1 appear in I 5. Phase 6 consists of log w subphases, i = 1,..., log w. Subphase i duplicates a subset of the bits of x specified by a w/ log w bit mask M i. All these masks a precomputed at preprocessing and store in a single word with D(I). We compute the masks M i as follows. Let I i 6 be the sequence which describes the positions of the bits of I at the end of subphase i, and let I 0 6 = I 5. When a bit is copied we split its remaining multiplicity among the two copies. Let i = 2 log w i. Subphase i = 1,..., log w duplicates those bits x[l] for which m l (I i 1 6 ) > i. (5.1) So M i is set to one in all positions l from which Equation (5.1) holds and it is zero on all other places. I i 6 is computed from Ii 1 6 as follows: For every l that appears somewhere in I i 1 6 let i 1 < i 2 < < i t be the indices of all occurrences of l in I i 1 6. Let I i 6 [i j] = l for j < i (unchanged from I i 1 6 [i j ]), and set I i 6 [i j] = l+ i otherwise. This effectively splits the multiplicity of l between l and its new copy l + i. Bit l has now multiplicity i 1 and bit l + i has the remaining multiplicity. At query time in subphase i we set x = (x OR ((x AND M i ) i )). 5.8 Phase 7: Final Positioning (Figure 5.9) At this stage we need permute the bits x[1,..., k], so as to get the final output. Note that I 6 is a permutation and it s inverse permutation, I 1 6 is the permutation we need to apply to x. This too requires simulation of a Benes network, see Section 5.9.

31 CHAPTER 5. BIT EXTRACTORS 26 Figure 5.8: An illustration of Phase 6. Figure 5.9: An illustration of Phase 7.

32 CHAPTER 5. BIT EXTRACTORS Permuting elements in a word by simulating a Benes network We show how to prepare a set C of O(1) words such that given C a Benes network implementing a given permutation σ can be applied to a word x in O(log w) operations (shift, and, or). We use such networks in two contexts: To permute the b w/ log w leftmost bits of x. We need this in Phases 5 and 7 of bit extraction. To permute w/ log w blocks of bits (each block of length log w). We need this during Phase 2 of bit extraction. Figure 5.10: One of many variants for the Benes network Overview of the Benes-Network Assume that n is a power of 2. A Benes network, B(n), of size n consists of two Benes networks of size n/2, Bu(n/2), and B d (n/2). For 1 i n/2, inputs i and i + n/2 of B(n) can be routed to the ith input of Bu(n/2) or to the ith input of B d (n/2). The outputs are connected similarly. For every 1 i n/2 we define inputs i and i + n/2 as mates, analogously we define outputs i and i + n/2 as mates. Note that mates cannot both be routed to the same subnetwork. See Figure The looping algorithm: A Benes network can realize any permutation σ of its inputs as follows. We start with an arbitrary input, say 1, and route it to Bu(n/2). This implies that the output

33 CHAPTER 5. BIT EXTRACTORS 28 σ(1) is also routed to Bu(n/2). The mate o of σ(1) must then be routed to B d (n/2). This implies that σ 1 (o) is routed to B d (n/2). If the mate of σ 1 (o) is 1 we completed a cycle and we start again with an arbitrary input which we haven t routed yet. Otherwise, if the mate of σ 1 (o) is not 1 then we route this mate to Bu(n/2) and repeat the process. Levels of the Benes network: If we lay out the Benes network then the 1st level of the recursion above gives us 2 stages consisting of n 2 2 switches, stage #1 connected to the inputs to Bu(n/2) and B d (n/2), and stage # 2 log n 1 connecting Bu(n/2) and B d (n/2) to the outputs. Opening the recursion gives us 2 log n 1 stages, each consisting of n 2 2 switches. To implement any specific permutation, one needs to set each of these switches Permuting the b = w/ log w leftmost bits of the word We now describe an O(1) word representation for any permutation σ on b = w/ log w elements that allows us to apply σ to the b leftmost bits of a query word x while doing only O(log b) operations. We obtain this data structure by encoding the Benes network for σ in O(1) words. To answer a query we use this encoding to apply each of the 2 log(b) 1 = O(log b) stages of the Benes network for σ to the leftmost b bits of x. Every stage requires O(1) operations giving a total of O(log b) = O(log w) operations. Figure 5.11: Benes network control bits C i,j and Dir i,j.

34 CHAPTER 5. BIT EXTRACTORS 29 During preprocessing we prepare two b (2 log b 1) binary matrices Dir and C. Both Dir and C have 2b log b b 2w bits, so they can fit into 2 w-bit words. The jth column of these matrices correspond to stage j of the Benes network, the ith row of these matrices corresponds to the ith input of the stage. Pictorially, we imagine that inputs are numbered top-down. The use of C and Dir is illustrated in Figure Recall that the mate of input i in stage j is some other input i of stage j. If (i < i) we define Dir i,j = 0, otherwise, (i > i), and we define Dir i,j = 1, this is defined for i = 1,..., b, j = 1,..., 2 log b 1. The matrix C is computed as follows: C i,j = 0 if input i of stage j routes to input i of stage j + 1 (i.e. goes straight ), and C i,j = 1 otherwise. We pack the b (2 log b 1) binary matrix C into 2 w-bit words C 1, C 2 as follows: C 1 [1, 2,..., w] C 1,1,..., C 1,b, C 2,1,..., C 2,b,..., C log b,1,..., C log b,b ; C 2 [1, 2,..., w] C log b+1,1,..., C log b+1,b,..., C 2 log b 1,1,..., C 2 log b 1,b ; We pack the matrix Dir into words Dir 1 and Dir 2 analogously. Figure 5.12: Applying stage 1 of the Benes network to x.

35 CHAPTER 5. BIT EXTRACTORS 30 During query processing we apply stage i (for i = 1... log b) of the Benes network by computing x = ( ) x AND C 1 [1,..., b] OR ( (x AND C 1 [1,..., b] AND Dir 1 [1,..., b]) (b/2 i ) ) ( ) OR (x AND C 1 [1,..., b] AND Dir 1 [1,..., b]) (b/2 i ). (5.2) This should be parsed as follows: (x AND C 1 [1,..., b]) gives the bits of x that are not going to change position at stage i. (x AND C 1 [1,..., b] AND Dir 1 [1,..., b]) (b/2 i ) takes the bits of x that are to move up at stage i and shifts them accordingly. (x AND C 1 [1,..., b] AND Dir 1 [1,..., b]) (b/2 i )) takes the bits of x that are to move down at stage i and shifts them accordingly. In preparation for the next stage we also compute C 1 = C 1 b, Dir 1 = Dir 1 b to prepare the control bits for the next stage of the Benes network. Analogously, during stages i = log b + 1, log b + 2,..., 2 log b 1 we use the words C 2 and Dir 2 rather than C 1 and Dir 1. An example of applying stage 1 of a Benes network of size 8 is shown in figure Permuting the w/ log w leftmost blocks of the word To operate the permutation on blocks of bits, we need masks that replicate the appropriate C i,j and Dir i,j values log w times so that they operate upon all bits of the block simultaneously and not only on one single bit. To precompute and store such replications in advance requires O(log w) words of storage, and we allow, in total, only O(1) words of storage for the entire bit extraction. Thus, we need compute these expansions on the fly, and in O(log w) operations. We now add all-zero columns on the left of matrices C and Dir so that each of them have exactly 2 log w columns. This is well defined because 2 log b 1 2 log w. Let these new matrices be C and Dir. Let C R be the rightmost log w columns of C, and let C L be the leftmost log w columns of C. Also, let Dir R, and Dir L be defined analogously.

36 CHAPTER 5. BIT EXTRACTORS 31 Previously, we packed the C and Dir matrices into words (C 1, C 2 ) and (Dir 1,Dir 2 ), respectively, column by column. To perform Block permutations we do so row by row as follows: The matrix C L is packed into the word C 1, row by row. Likewise, C R is packed, row by row, into C 2, Dir L into Dir 1 and Dir R into Dir 2. The C and Dir bits associated with stage j of the Benes network will be spaced out, log w bits apart. See Figure For j < log w these bits are in C 1 and Dir 1. Figure 5.13: Benes network control bits C i,j and Dir i,j. Given C i or Dir i, we seek to isolate and replicate the bits associated with stage 1 j log w. We define a transformation g : {0, 1} w {1,..., log w} such that for any w bit word z, and any 1 j log w, g(z, j) is a mask such that for any block B, all bits of g(z, j)[b] are equal to z[b[j]]. We compute g(z, j) in O(1) time as follows: Let Z j, 1 j log w, be a bit pattern with w/ log w 1 s at the jth index of every block. The operation y 0 = Z j AND z isolates the j th bits of every block in z. Let y 1 = y 0 j 1 and y 2 = y 0 (log w j), let y 3 = y 1 y 2. Blocks B for which the bit z[b[j]] = 1, now have y 3 [B] = , blocks B for which the bit was zero now have y 3 [B] consisting only of zeros. Finally, set y 4 = y 3 OR y 1. Now, all bits of y 4 [B] are equal to the bit z[b[j]]. To simulate the Benes network and sort blocks rather than bits, for stages j log w we use the masks g(c 1, j) and g(dir 1, j), analogously to our use of the masks C 1 [1,..., b] and Dir 1 [1,..., b] as used in Equation 5.2. For stages j > log w we use g(c 2, j log w) and g(dir 2, j log w) analogously to the use of C 2 [1,..., b] and Dir 2 [1,..., b]. Given this transformation, we can simulate the Benes network in parallel, on entire blocks, and

Lecture 2 September 4, 2014

CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 2 September 4, 2014 Scribe: David Liu 1 Overview In the last lecture we introduced the word RAM model and covered veb trees to solve the