Multiple Choice Tries and Distributed Hash Tables

Size: px

Start display at page:

Download "Multiple Choice Tries and Distributed Hash Tables"

Erick Hill
5 years ago
Views:

1 Multiple Choice Tries and Distributed Hash Tables Luc Devroye and Gabor Lugosi and Gahyun Park and W. Szpankowski January 3, 2007 McGill University, Montreal, Canada U. Pompeu Fabra, Barcelona, Spain U. Wisconsin, Whitewater, USA Purdue University, W. Lafayette, USA

2 Outline of the Talk 1. Digital Tries and Their Applications 2. Known Results 3. Our Main Results Two Choice Trie (greedy and optimal algorithms) Algorithmic Considerations Multiple Choice Trie 4. Distributed Hash Table

3 Trie and Its Parameters F n D n H n x 3 x 6 x 7 x 8 x 1 x2 x 4 x 5 x 9 x 10 F n fill up level; D n typical depth; H n height.

4 Application of Tries Tries are popular and efficient data structures that were initially developed and analyzed by Fredkin (1960) and Knuth (1973) as an efficient method for searching and sorting digital data. dynamic hashing conflict resolution algorithms leader election algorithms IP address lookup Lempel-Ziv compression schemes distributed hash tables (for ID management though tries were never explicitly named).

5 Distributed Hashing Tables interval owned by ID ID ID interval owned by ID internal node external node leaf (a) (b) 0 1 (a) (b) Figure 1: (a) IDs are randomly generated on the perimeter and either IDs own the intervals to their left in clockwise order or boundaries are determined by virtue of a trie or digital search tree. T he objective is to make all intervals of about equal length, so that all hosts receive about equal traffic. (b) A standard trie for five strings. The correspondence between nodes and intervals in a dyadic partition of the unit interval is shown. The leaf ID assigned is read from the path to the root (0 for left, 1 for right). The external nodes, not normally part of the trie, are shown as well. Together, external nodes and leaves define a partition of the unit interval (shaded boxes). The fill-up level of this tree is one, while the height is four. Balance B n is defined as B n = 2 H n Fn+O(1).

6 Some Known Results Probabilistic Assumption: A memoryless or Markov (mixing) source generates n binary sequences over a finite or infinite alphabet with p i being the probability of emitting the ith symbol. Depth D n (path distance to the root): D n log n 1 h in probability as n, where h = P i p i log p i is the entropy of the distribution. The mean, variance and the central limit theorem for D n were first obtained by Jacquet and Régnier (1986), Pittel (1985) and W.S.(1988). Height H n (maximum distance between root and leaves): H n log n 2 Q in probability as n, where! Q = log X i p 2 i Pittel (1985), Clement, Flajolet, Vallee (2001). Asymptotic distribution is of the extreme type.

7 Known Results for Generalized Tries Height in b-tries (i.e., one allows to store in an external node up to b strings) H n log n b log Pi pb+1 i «in probability Flajolet (1980), Pittel (1985), see also W.S. (2001). Height in PATRICIA trees and DIGITAL SEARCH TREES: H n log n 1 log 1 max i 1 p i ) = 1 h in probability Pittel (1985), where h = log max i 1 p i. Moreover, Pittel, and Knessl & W.S showed H n = 1 h log n + O( p log n).

8 More Known Results for Tries Fill-up F n (the last level that has full set of internal nodes): F n log n 1 log(1/p min ) = 1 in probability h where p min = min i {p i } and h = log(1/p min ) is the Rényi entropy of infinite order (cf. W.S. (2001)). Furthermore, F n concentrates on two points k n and k n + 1 where k n = 1 log(1/p min ) (log n log log log n) + O(1) while for symmetric sources (i.e., sources with p 1 = p 2 = 1/2) k n = log 2 n log 2 log 2 n + O(1), Pittel (1986), Devroye (1992), and Knessl and W.S. (2005).

9 Important Relationship From Jensen s inequality and (max i p i ) 2 X i p 2 i max i p i, we conclude log 1 1 min i 1 p i ) 1 h 1 Q 1 log 1 max i 1 p i ) 2 Q, so that the height is always at least twice as big as the typical depth of a node. For distributed hash tables the so called load balancing ratio B n is important where B n = 2 H n Fn.

10 Two-choice Tries In many applications (e.g., distributed hashing) one needs to construct a well balanced trie: height as small as possible and as close to its fillup level. Two-choice Trie: Each datum (key) has two strings, X i and Y i, that is, there n pairs of strings (X i, Y i ), and we can select one of the two to insert in the trie. A Greedy Heuristic: Choose the string which, at the time of insertion would yield the leaf nearest to the root. Note: Once the selection is made, it cannot be undone! Main Results: With high probability H n 1 log n 3 2Q in probability as n.

11 Sketch of Proof Theorem 1. For all integer d > 0 and any t > 0 j P H n 3 log n + t ff 4e t + 2n 1/4 e 3t/4. 2Q If p 1 = = p V = 1/V (symmetric case), then j ff lim P (3 ǫ) log n H n n 2Q = 0. Upper Bound: Define C(X, Y ) the length of the longest common prefix of X and Y ; Z i the string to be selected for the ith datum (i.e., Z i = X i or Z i = Y i ); P r = P i pr i ; note that Q = log P 2. {H n > d} = n[ [ {C(Z i, X l ) > d} {C(Z j, Y l ) > d} l=1 1 i,j<l hence P(H n > d) 4n 3 p 2d 2 + 2n2 p d 3 4n 3 p 2d 2 + 2n2 p 3/2d 2 since P 3 P 3/2 2.

12 Optimized off-line Algorithm Define: Z i (0) = X i and Z i (1) = Y i, {i 1,..., i n } {0, 1} n. Then H n (i 1,..., i n ) height over Z 1 (i 1 ),..., Z n (i n ). Finally H n = min H n (i 1,..., i n ) i 1,...,in the minimal height over all these 2 n tries. Theorem 2. If max i p i < 1, then In particular, for fixed t, H n /log n 1/Q in probability. P j H n log n + t ff Q 8e t. Also, for all ǫ > 0, j lim P H n n ff (1 ǫ) log n Q = 0.

13 Upper Bound Proof 1. Construct an infinite trie over 2n strings. 2. Let T j (1 j 2 d ) be a subtree rooted at distance d from the root. 3. A bad datum is with both strings (of the same datum) fall in the same T j. 4. A colliding pair of data is such that for some j k, each datum in the pair delivers one string to T j and one string to T k. Define λ = P i p2 i = P 2. Lemma 1. (i) The probability that there exists a bad datum anywhere is not more than nλ d. (ii) The probability that there is a colliding pair of data anywhere is not more than 2n 2 λ 2d.

14 A Multigraph Representation 5. Construct a multigraph G(d) whose vertices represent the T j. We connect T j with T l if a datum deposits one string in each of these trees. T 1 1 T 3 d T 2 3 T 1 T 2 T 3 graph G Figure 2: The multigraph G and an infinite trie for n = 3 pairs of strings, denoted by (1, 1 ), (2, 2 ) and (3, 3 ). Note that (2, 2 ) and (3, 3 ) is a colliding pair.

15 Cycles in G 6. Consider cycles of length at least 3. Lemma 2. The probability that G has a cycle of length 3 is not more than (4n) 3 λ 3d 1 4nλ d. Sketch of Proof. The probability of a cycle of length l can be bounded by the number of possible data assignments times the probability that the l pairs of data are in the given lists: 2 l (2n) l λ dl. The probability of a cycle of length 3 does not exceed X (4n) l λ dl = (4n)3 λ 3d 1 4nλ d. l=3

16 Selection Process 7. Assume that there is no: (i) bad datum, (ii) no colliding data, (iii) and no cycle (so that G is a forest with no multiedges and one can select one string for each node). We can assign strings as follows. We choose any one of the strings in the root node s list. For all other strings in the root s list, choose the companion string of the same datum (found by following edges away from the root). This either terminates, or has an impact on one or more child trees. But for the child tree of the root, we have fixed one string (as we did for the root), and thus choose again companion strings for that child list, and so forth. This process is continued until one string of each datum is chosen for the trie.

17 Finally If the height H n is at least d, then the height H 2n is at at least d, hence H n > d if there exists a bad datum there exists a colliding pair there exists a cycle. Thus P{H n > d} P{there exists a bad datum} + P{there exists a colliding pair} +P{there exists a cycle} nλ d + 2n 2 λ 2d + (4n)3 λ 3d 1 4nλ d. If we set A = nλ d, then P{H n > d} 4AI [A 1/8] + I [A>1/8] 4AI [A 1/8] + 8AI [A>1/8] 8nλ d. Algorithm: Using parent pointer data representations for forests, we can find the optimal selection in O(n log n) time.

18 Multiple-choice Tries Consider now k strings per datum. Consider n data, each composed of k independent strings of i.i.d. symbols drawn from a memoryless distribution. Let H n (k) denote the minimal height of any trie of n strings that takes one string of each datum. Theorem 3. Assume H <. For all ǫ > 0, there exists k large enough such that j ff (1 ǫ)log n lim P H (1 + ǫ)log n n n (k) = 1. h h Observe that D n H n (k).

19 Uniform Distribution for k = O(log n) Consider the interval [0, 1] and let X 1,..., X n be n independent vectors of k = clog n i.i.d. uniform [0, 1] random variables X i,j, 1 i n, 1 j k, where c > 0 is a constant. Theorem 4. Let α (0, 1/3) and c = 2/α. Then there exists a selection Z 1,..., Z n such that the height H n and fillup level F n of the associated trie for X 1,Z1,..., X n,z n satisfy, for n 8, P{H n F n 2} 1 3 n. Thus B n = O(1) (existential result). For DHT a greedy heuristic (on-line algorithm) for k = O(log n) suffices to yield H n F n 7, in probability.

MULTIPLE CHOICE TRIES AND DISTRIBUTED HASH TABLES

MULTIPLE CHOICE TRIES AND DISTRIBUTED HASH TABLES Luc Devroye Gábor Lugosi Gahyun Park and Wojciech Szpankowski School of Computer Science ICREA and Department of Economics Department of Computer Sciences