The space requirement of m-ary search trees: distributional asymptotics for m 27

Size: px

Start display at page:

Download "The space requirement of m-ary search trees: distributional asymptotics for m 27"

Abel Leonard
6 years ago
Views:

1 1 The space requirement of m-ary search trees: distributional asymptotics for m 7 James Allen Fill 1 and Nevin Kapur 1 Applied Mathematics and Statistics, The Johns Hopkins University, 3400 N. Charles St., Baltimore MD Computer Science, California Institute of Technology, MC 56-80, 100 E.California Blvd., Pasadena CA 9115 Abstract. We study the space requirement of m-ary search trees under the random permutation model when m 7 is fixed. Chauvin and Pouyanne have shown recently that X n, the space requirement of an m-ary search tree on n keys, equals µ(n Re [Λn λ ] + ɛ n n Re λ, where µ and λ are certain constants, Λ is a complex-valued random variable, and ɛ n 0 a.s. and in L as n. Using the contraction method, we identify the distribution of Λ. Keywords. m-ary search trees, space requirement, limiting distributions, contraction method. 1 Introduction We start by giving a brief overview of search trees, which are fundamental data structures in computer science used in searching and sorting. For integer m, the m-ary search tree, or multiway tree, generalizes the binary search tree. The quantity m is called the branching factor. According to [10], search trees of branching factors higher than were first suggested by Muntz and Uzgalis [1] to solve internal memory problems with large quantities of data. For more background we refer the reader to [7, 8] and [10]. An m-ary tree is a rooted tree with at most m children for each node (vertex, each child of a node being distinguished as one of m possible types. Recursively expressed, an m-ary tree either is empty or consists of a distinguished node (called the root together with an ordered m-tuple of subtrees, each of which is an m-ary tree. An m-ary search tree is an m-ary tree in which each node has the capacity to contain m 1 elements of some linearly ordered set, called the set of keys. In typical implementations of m-ary search trees, the keys at each node are stored in increasing order and at each node one has m pointers to the subtrees. By spreading the input data in m directions instead of only, as is the case for a binary search tree, one seeks to have shorter path lengths and thus quicker searches. We consider the space of m-ary search trees on n keys, and assume that the keys are linearly ordered. Hence, without loss of generality, we can take the set of keys to be [n] := {1,,...,n}. We construct an m-ary search tree from a sequence s of n distinct keys in the following way: (i If n < m, then all the keys are stored in the root node in increasing order. Research supported by NSF grant DMS and by The Johns Hopkins University s Acheson J. Duncan Fund for the Advancement of Research in Statistics. Research partially supported by NSF grant

2 (ii If n m, then the first m 1 keys in the sequence are stored in the root in increasing order, and the remaining n (m 1 keys are stored in the subtrees subject to the condition that if σ 1 < σ < < σ m 1 denotes the ordered sequence of keys in the root, then the keys in the jth subtree are those that lie between σ j 1 and σ j, where σ 0 := 0 and σ m := n + 1, sequenced as in s. (iii All the subtrees are m-ary search trees that satisfy conditions (i, (ii, and (iii. For example the m-ary search constructed from the sequence (10,7,1,4,1,8,5,6,9,14,11,,15,13,3 is show in Figure 1. Note that empty nodes (also called external nodes are represented as circles in the figure; m such nodes arise as children of a given node when that node becomes filled to its capacity of m 1 keys. In this paper the total number of nodes (empty and nonempty in an m-ary search tree is called the space requirement of the tree. Fig. 1. An m-ary search tree with space requirement 13. The uniform distribution on the space of permutations of [n] induces a distribution of the space of m-ary search trees with n keys. This is known as the random permutation model. Several authors have studied the limiting distribution of the space requirement under the random permutation model. Mahmoud and Pittel [11] showed that when m 15, the limiting distribution is normal. The result was later extended to include m 6 by Lew and Mahmoud [9]. Chern and Hwang [3] proved that when m 7, the space requirement centered by its mean and scaled by its standard deviation does not have a limiting distribution. Our result, stated as Theorem 1, for the case m 7 was inspired by a recent development (stated at the beginning of Section of Chauvin and Pouyanne []. Summary Let X n denote the space requirement of an m-ary search tree on n keys chosen under the random permutation model. Recently, Chauvin and Pouyanne [] have used martingale techniques to show that when m 7, we have X n = X n + n σ ɛ n, where 1 X n := H m 1 (n Re Λ], (1 [nλ

3 3 with Λ some complex-valued random variable and ɛ n 0 a.s. and in L. [In fact, they derive the asymptotics of the random vector (S n (0,...,S n (m 1, where S n (i denotes the number of nodes with i keys in a tree with n keys, but we shall be content here to study X n = m 1 i=0 S(i n.] In this representation, λ = σ + iτ is the root of the polynomial φ(z φ m (z := (z + 1 (z + m 1 m! ( having second-largest real part and positive imaginary part. It is our goal to describe the distribution of the random variable Λ. To begin, we define the following distributional transform T on M (µ, the space of probability distributions with a certain mean µ defined at (7 and finite second absolute moment: ( m T : M (µ M (µ, L(W L S λ k W k, (3 where (W k m are independent copies of W. Here S (S 1,...,S m is the vector of spacings of m 1 independent Uniform(0,1 random variables U 1,...,U m 1 ; i.e., if U (1,...,U (m 1 are their order statistics and U (0 := 0, U (m := 1, then S j := U (j U (j 1, j = 1,...,m. (4 Furthermore, we take S to be independent of (W k m. Next, define the metric d on M (µ by d (F,G := min{ X Y : L(X = F, L(Y = G}, with X := (E X 1/ denoting the L -norm. In the sequel, for notational convenience we will write d (X,Y instead of d (L(X, L(Y. Our main result is the following. (See the remark below Lemma 7 for a strengthening. Theorem 1. Let X n denote the space requirement of an m-ary search tree on n keys under the random permutation model with m 7. Define V n := X n 1 (n + 1 H m 1 and V n := Re [n λ Y ]. Here Y is a random variable with distribution equal to the unique fixed point L(Y of the distributional transform (3. Then d (V n, V n = o(n σ and consequently Λ has the same distribution as Y. The proof of Theorem 1 is presented in Section 3, with the existence of the unique fixed point established in Section 3.1 and bounds on the d -distance derived in Section 3.. n,...,s n (m 1 Remark. As discussed in [] and [6], the study of the random vector (S (0 can be recast as a generalized Pólya urn scheme which in turn can be studied by embedding into a continuous-time Markov multitype branching process. Janson [6] obtains asymptotic distributional results for a very general class of urn schemes and multitype branching processes. These include results for m-ary search trees, with (1 as a notable example. We anticipate that our contraction-method technique for identifying L(Λ in (1 will extend quite generally to oscillatory cases of Janson s results; this is the subject of ongoing research.

4 4 In the sequel we will use 1 =: λ 1,λ,...,λ m 1 to denote the m 1 roots of ( in nonincreasing order of real parts and roots with positive imaginary parts listed before their conjugates. In [10, 3.3] and [5], the polynomial ψ(λ = φ(λ 1 is considered. The properties of the roots of φ that we employ follow immediately from those known for the roots of ψ. 3 Proofs As preliminaries, note that the space requirement X n has initial conditions X 0 = X 1 = = X m = 1, and for n m 1 that the number of keys not stored in the root is n := n (m 1. It is well known that, under the random permutation model, X n satisfies the distributional recurrence X n L = m X (k J k + 1, n m 1, (5 where L = denotes equality in law (i.e., in distribution, and where, on the right, the random vector J (J 1,...,J m is uniformly distributed over all m-tuples (j 1,...,j m of nonnegative integers with j j m = n ; for each k = 1,...,m, we have X (k L j = X j ; the quantities J;X (1 0,...,X(1 n ;X( 0,...,X( n ;...;X(m 0,...,X (m n are all independent. Using (5, we get a distributional recurrence for V n, with notation as for the X s: V n L = m V (k J k, n m 1. (6 The initial conditions here are V j = 1 j+1 H m 1 for j = 0,1,...,m. The asymptotics of the mean of V n can be derived using [5, Equation (.7]: EV n = µn λ + µn λ3 + O(n Re λ4, (7 where µ is a constant. Note that no two roots of ( have the same real part unless they are mutually conjugate, so that Reλ 4 < Re λ 3 = Re λ = σ. For the reader s convenience, we state here a part of the Asymptotic Transfer Theorem of [5]. We will use this result in Section 3.. The constant K can be expressed in terms of K, but we shall have no use here for such an expression. Proposition. For fixed m, consider the recurrence a n = b n + ( m n m 1 n j=0 ( n 1 j a j, n m 1, m with specified initial conditions (a j m j=0. If b n = Kn v + o(n v with v > 1 and K a constant, then a n = K n v + o(n v where K is a constant.

5 5 3.1 Fixed point The existence and uniqueness of the fixed point of the map T at (3 follows from the contraction method (see, e.g., [13]. Indeed a routine modification of the argument presented in [5, 6] yields that T is a contraction on M (µ with contraction factor [ ] 1/ [ ] 1/ Γ(σ + 1 m! ρ = m! = < 1, Γ(σ + m (σ + m 1 (σ + 1 since for m 7, we have σ > 1/ [10, 5]. 3. d bounds We begin by defining d n := d (V n, V n and f(t := Re t = t + t. Unless otherwise noted we will henceforth assume n m 1. Throughout j will denote a sum over all m-tuples (j 1,...,j m of nonnegative integers summing to n. By the triangle inequality, d n a n + b n, (8 where, taking (Y k m to be independent copies of the random variable Y in Theorem 1 and J and S each independent of (Y k m, a n := d (V n, f(j λ k Y k (9 and b n := d ( m f(j λ k Y k, f(n λ S λ k Y k. (10 We proceed by deriving upper bounds for a n and b n separately. The bound on b n is proved as Lemma 4. For a n a crude bound can be derived as follows. Even though this bound is not sufficient to show that d n = o(n σ, it will be employed in Lemma 6, which in turn will be used to derive the estimate that we need. Lemma 3. With a n defined at (9, a n = O(n σ. Proof. By the triangle inequality, a n V n + f(j λ k Y k = V n + m f(j λ 1 Y 1. Since J 1 n and Y 1 <, we have f(j λ 1 Y 1 = O(n σ. Using independence of the V (k s, (6, and (7, we have V n = P[J = j]e V (k j k = ( 1 n j m 1 V jk + O(n σ j = ( m n m 1 n (m 1 j=0 ( n 1 j m V j + O(n σ. It follows from Theorem that V n = O(n σ, and the result follows.

6 6 To sharpen Lemma 3, we employ the following coupling between the distributions of V n and of m f(jλ k Y k. The L distance exhibited by this coupling serves as an upper bound on the d -distance. For k = 1,...,m, let (V (k 1,V (k,...;y k be independent copies of (V 1,V,...;Y such that the coupling between V j and Y is d -optimal for each j. [To construct such a coupling, first choose optimally-coupled V 1 and Y ; having chosen (V 1,...,V j ;Y, choose V j+1 so that it is optimally-coupled with Y.] Then, with J (J k m independent of everything else, Now a n V (k J k f(j λ k Y k = j P[J = j] V (k f(j λ k Y k. (11 = = V (k (k V d + f(j λ k Y k f(j λ k Y k + E 1 k l m E [ V (k 1 k l m [ V (k f(j λ k Y k ][ V (l j l f(j λ l Y l ] f(j λ k Y k ] E [ V (l j l f(j λ l Y l ] (1 If we choose the mean EY to be µ, it follows from (7 that E [ V n f(n λ Y ] = O(n Re λ4. It follows then that the second sum in (1 is O(n Re λ4 = o(n σ uniformly in j. Thus, from (11 and (1, a n E d J k + r n, (13 where r n = o(n σ. Next, we proceed to bound b n. Lemma 4. With b n defined at (10, b n = o(n σ. Proof. We take Y 1,...,Y m to be independent copies of Y and (J,S independent of Y 1,...,Y m. The conditional distribution of J given S = s (s 1,...,s m is taken to be Multinomial(n,s. Indeed this yields the distribution of the vector of sizes of

7 7 the subtrees rooted at the root of a random m-ary search tree [4]. Then b n f(j λ k Y k f(n λ S λ k Y k f(j λ k Y k f(n λ S λ k Y k [J λ k (ns k λ ]Y k (by definition of f m = Y J λ k (ns k λ (by independence = m Y J λ 1 (ns 1 λ. (by symmetry We know that Y <, and by Lemma 5 to follow the last factor above is o(n σ. Lemma 5. With σ > 1/ denoting Re λ, J λ 1 (ns 1 λ = o(n σ. Proof. Given ɛ > 0 we will show that the L -norm in question is bounded by a constant times ɛ 1/ n σ. The lemma then follows by letting ɛ 0. Observe that J λ 1 (ns 1 λ = E J λ 1 (ns 1 λ = EE [ J λ 1 (ns 1 λ S 1 ]. (14 Until further notice assume s > ɛ, and note that the conditional expectation E[ J λ 1 (ns 1 λ S 1 = s] equals P[J 1 = j S 1 = s] j λ (ns λ = n j=0 0 j n(s ɛ + n(s ɛ<j<n(s+ɛ + n(s+ɛ j n The conditional distribution of J 1 given S 1 = s is Binomial(n,s. The last sum on the right is o(1 uniformly in s since, by [7, Ex ], P[J 1 n(s + ɛ S 1 = s] P[J 1 n (s + ɛ S 1 = s] exp ( ɛ n /. For the first sum observe that, for n large enough (independently of s, [ ( P[J 1 n(s ɛ S 1 = s] P J 1 n s ɛ ] S1 = s exp( ɛ n /8, the last inequality being a consequence of the aforementioned exercise. Thus the first sum is also o(1 uniformly in s. On the other hand, for the range of summation in the middle sum, by the mean value theorem and the assumed inequality ɛ < s/ we have ( λ j s λ n ɛ λ max ζ (s ɛ,s+ɛ ζ σ 1 ɛ λ c σ s σ 1,.

8 8 where c σ is (3/ σ 1 if σ 1 and (1/ σ 1 if σ < 1. Thus j λ (ns λ = n σ ( j n λ s λ Hence the middle sum is at most ɛ λ c σs (σ 1 n σ. Note that S 1 has distribution Beta(1,m and that since σ > 1/. So Finally, ɛ 0 1 ɛ 1 0 s (σ 1 (1 s m 1 ds = ɛ λ c σs (σ 1 n σ. Γ(mΓ(σ 1 Γ(m + σ 1 < E[ J λ 1 (ns 1 λ S 1 = s]p[s 1 ds] constant ɛ n σ. E[ J λ 1 (ns 1 λ S 1 = s]p[s 1 ds] Combining (8 and (13, we get a n E (a Jk + b Jk + r n = E constant n σ P[S 1 ɛ] constant ɛn σ. a J k + E a Jk b Jk + E b J k + r n. (15 Next we bound the terms on the right-hand side, so that (15 will yield a recursive inequality. Lemma 6. E b J k = o(n σ. Proof. By linearity of expectation and symmetry, E b J k = Eb J k = meb J 1. Now, the conditional distribution of J 1 given S 1 = s is Binomial(n,s. We show that the conditional expectation E[b J 1 S 1 = s] is o(n σ. To that end, let X be distributed Binomial(n,s. For ɛ > 0, n Eb X = P[X = j]b j = +. j=0 0 j n(s ɛ n(s ɛ j n Now an argument similar to the one used in the proof of Lemma 5 can be employed. The first sum on the right is o(n σ. On the other hand, we use the fact that b n = o(n σ from Lemma 4 to conclude that the second sum is o(n σ.

9 9 Lemma 7. E a Jk b Jk = o(n σ. Proof. The proof (using the crude bound on a n established in Lemma 3 is very similar to that of Lemma 6. We omit the details. We now complete the proof of Theorem 1. Using (15 and Lemmas 7 and 6 we find a n E = ( m n m 1 a J k + g n = 1 ( n m 1 j n (m 1 j=0 ( n 1 j m a + g n = ( m n m 1 j a j + g n, a j 1 + g n where g n = o(n σ. It follows from Proposition that a n = o(n σ, so that d n a n + b n = o(n σ, as desired. Remark. The o-estimates in Lemmas 4 7 can be improved to O-estimates. In the proof of Lemma 5, choosing ɛ as a function of n (specifically, taking ɛ n to be a suitable constant multiple of n 1/ log n sharpens the estimate o(n σ to O(n σ 1 4 log n, so that b n = O(n σ 1 4 log n in Lemma 4. In turn, Lemmas 6 and 7 are then immediately strengthened to O(n σ 1 lnn and O(n σ 1 4 log n, respectively. This leads to d (V n, V n = O(n Re λ4 + O(n σ 1 8 (log n 1 4. Numerics strongly support the conjecture that σ Re λ 4 0 as m. If this is true, then d (V n, V n is O(n Re λ4 whenever m Due to the presence of r n = O(n Re λ4 in (13, this large-m rate of convergence cannot be improved by the methods of this paper and presumably is the exact rate. Finally, to prove equality in distribution of Λ and Y, we show that d (Λ,Y = 0. Indeed with Λ = Λ e iθ and Y = Y e it, we have d ( Re(n λ Λ,Re (n λ Y = d ( Re(n σ+iτ Λ e iθ,re (n σ+iτ Y e it = d (n σ Λ cos(τ lnn + Θ,n σ Y cos(τ lnn + T. But d ( Re(n λ Λ,Re (n λ Y = o(n σ so that, as n, d ( Λ cos(τ lnn + Θ, Y cos(τ lnn + T 0. For any fixed φ [0,π we can choose n such that (τ lnn mod π φ. Then Λ cos(φ + Θ and Y cos(φ + T have the same distribution. It follows from the Cramer Wold device [1, Theorem 9.4] that the random vectors ( Λ cos Θ, Λ sin Θ and ( Y cos T, Y sin T have the same distribution. In particular, Λ = Λ e iθ and Y = Y e it have the same distribution, as claimed. This completes the proof of Theorem 1.

10 10 References 1. P. Billingsley. Probability and measure. John Wiley & Sons Inc., New York, third edition, A Wiley-Interscience Publication.. B. Chauvin and N. Pouyanne. m-ary search trees when m 7: a strong asymptotics for the space requirements. Random Structures Algorithms, 4(: , H.-H. Chern and H.-K. Hwang. Phase changes in random m-ary search trees and generalized quicksort. Random Structures Algorithms, 19(3-4: , 001. Analysis of algorithms (Krynica Morska, L. Devroye. Universal limit laws for depths in random trees. SIAM J. Comput., 8(: (electronic, J. A. Fill and N. Kapur. Transfer theorems and asymptotic distributional results for m-ary search trees, arxiv:math.pr/ Submitted for publication. 6. S. Janson. Functional limit theorems for multitype branching processes and generalized Pólya urns. Stochastic Process. Appl., 110(:177 45, D. E. Knuth. The art of computer programming. Volume 1. Addison-Wesley Publishing Co., Reading, Mass.-London-Don Mills, Ont., 3rd edition, D. E. Knuth. The art of computer programming. Volume 3. Addison-Wesley Publishing Co., Reading, Mass.-London-Don Mills, Ont., nd edition, W. Lew and H. M. Mahmoud. The joint distribution of elastic buckets in multiway search trees. SIAM J. Comput., 3(5: , H. M. Mahmoud. Evolution of random search trees. John Wiley & Sons Inc., New York, 199. A Wiley-Interscience Publication. 11. H. M. Mahmoud and B. Pittel. Analysis of the space of search trees under the random insertion algorithm. J. Algorithms, 10(1:5 75, R. R. Muntz and R. C. Uzgalis. Dynamic storage allocation for binary search trees in a two-level memory. In Proceedings of Princeton Conference on Information Sciences and Systems, volume 4, pages , U. Rösler and L. Rüschendorf. The contraction method for recursive algorithms. Algorithmica, 9(1-:3 33, 001. Average-case analysis of algorithms (Princeton, NJ, 1998.

arxiv: v1 [math.pr] 21 Mar 2014

arxiv: v1 [math.pr] 21 Mar 2014 Asymptotic distribution of two-protected nodes in ternary search trees Cecilia Holmgren Svante Janson March 2, 24 arxiv:4.557v [math.pr] 2 Mar 24 Abstract We study protected nodes in m-ary search trees,