MARKING A BINARY TREE PROBABILISTIC ANALYSIS OF A RANDOMIZED ALGORITHM

Size: px

Start display at page:

Download "MARKING A BINARY TREE PROBABILISTIC ANALYSIS OF A RANDOMIZED ALGORITHM"

Patience Townsend
5 years ago
Views:

1 MARKING A BINARY TREE PROBABILISTIC ANALYSIS OF A RANDOMIZED ALGORITHM XIANG LI Abstract. This paper centers on the analysis of a specific randomized algorithm, a basic random process that involves marking a binary tree, in light of concepts and techniques from probability theory. The first part of the paper is based on an interesting assumption that significantly simplifies the problem and thus presents us with a solution for figuring out the expectation of the time steps required by this algorithm. The essential part of the solution is a coupon-collector model that mainly makes use of the geometric probability distribution. The rest of the paper aims to verify the legitimacy of our previous assumption with a balls-into-bins model and the concept of Poisson distribution. Contents. Presenting the Problem: Marking a Binary Tree. Random Variables, Probability and Expectation 3. The expectation of Total Time Steps Assuming the Existence of a Bottleneck. The Balls-into-bins Model and Poisson Distribution 6 5. Verifying the Existence of a Bottleneck 7 6. Conclusion 9 Acknowledgments 9 7. bibliography 9 References 9. Presenting the Problem: Marking a Binary Tree The research of a random process often involves the understanding of its patterns or mechanisms at a higher level, with formal mathematical proofs developed to support such understanding. A basic random process of marking a binary tree is presented below, and we are particularly interested in the number of time steps required to mark the entire tree. Consider a complete binary tree with N = n nodes and depth of n. For example, Figure is a binary tree of depth 5. For a particular node, its parent is the node directly connected to it one level above. Its sibling is the node on the same level that shares the same parent. Its two children, if it has any at all, are the two nodes directly connected to it one level below. Date: AUGUST 6, 08.

2 XIANG LI Figure. A Complete Binary Tree of Depth 5 and Relationships between Nodes Initially, all nodes are unmarked, and our ultimate goal is to mark the entire tree with the process we shall describe. First we number each node with a unique identifying number within the range of {,,...,N}. Every time step, a random number, chosen uniformly at random from {,,...,N} is generated and sent as a signal to mark the corresponding node with the same identifying number. After you mark the sent node, a infecting process is also immediately invoked: If a node and its sibling are marked, its parent is marked. If a node and its parent are marked, the other sibling is marked. The marking rule is always applied recursively as much as possible before the next node is sent. For example, in Figure., the marked nodes are filled in. The arrival of the marked node labeled by an X will allow you to mark the remainder of the nodes as you apply the marking rule in the sequence of node,, 3. Figure. The Infection Caused by the Arrival of the Marked X Throughout the analysis of this process, the leaf nodes, which are nodes at the very bottom level, are of particular interest to us. Moreover, we ll frequently treat two leaf nodes that are siblings as a pair. The number of pairs of leaf node siblings will be denoted by N, which is equal to n =. Before diving into the task, we shall review some basic concepts of probability theory.. Random Variables, Probability and Expectation Definition.. A random variable on a sample space Ω is a real-valued function on Ω, denoted by X : Ω R.

3 MARKING A BINARY TREEPROBABILISTIC ANALYSIS OF A RANDOMIZED ALGORITHM3 Definition.. A probability function is any function Pr: F R that satisfies the following conditions:. for any event E, O Pr(E) ;. for the sample space Ω,Pr(Ω) = and 3. for any finite or countably infinite sequence of pairwise mutually disjoint events E,E,E 3,..., Pr( E i ) = i ). i i Pr(E Definition.3. The expectation of a discrete random variable X, denoted by E[X], is given by E(x) = i ipr(x = i), where the summation is over all values in the range of X. Theorem. (Linearity of expectation). For any finite collection of discrete random variables X,X,...,X n with finite expectations, n n E( X i ) = E(X i ). Proof. We can first prove the case for two random variables and derive from it the general case by induction. LetX andy betworandomvariables. E(X+Y) = (x+y)pr(x = x,y = y) x y = xpr(x = x,y = y)+ ypr(x = x,y = y) x y x) y = x x y Pr(X = x,y = y)+ y y Pr(X = x,y = y) = x xpr(x = x)+ y ypr(y = y) = E(X)+E(Y), where the last equality directly follows from the definition of expectations and all the summations are over the range of the corresponding random variables. Definition.5. A geometric random variable X with parameter p is given by the probability distribution Pr(X = n) = ( p) n p, where n takes on nonnegative integer values. Remark.6. It can be easily verified that the geometric probability distribution satisfies the three properties in Definition.. Now we turn to computing the expectation of a geometric random variable.

4 XIANG LI Lemma.7. Let X be a discrete random variable that takes on only nonnegative integer values. Then E(X) = Pr(X i) j=i Proof. Pr(X i) = Pr(X = j) = = jpr(x = j) = E(X) j= j= j Pr(X = j) The second equality is justified because all the terms being summed up are nonnegative. Theorem.8. The expectation of a geometric random variable X with parameter p is given by E(X) = p Proof. By Definition.5, Pr(X i) = Hence, E(X) = = = p ( p) i Pr(X i) ( p) j p = ( p) i. j=i We shall proceed by making use of the definitions and results presented so far in order to model the random process of marking trees from chapter in the light of probability theory. 3. The expectation of Total Time Steps Assuming the Existence of a Bottleneck For those with certain background in computer programming, it is easy to write a simulation program and have the program print out the sequence of nodes sent as the identifiers. After a few trials, one will find that the last identifier is almost always a leaf node, one of the n nodes at the bottom level. Such behavior of the random process is no surprise if we think about what is going on at the bottom level during the marking process. In order for a leaf node to be marked, since it doesn t have any children to infect it, it s either the node itself or its sibling (the adjacent node sharing the same parent) that is marked directly as an identifier. Therefore, a necessary condition of marking the entire tree is that at least one of each pair of siblings at the leaves is sent as an identifier. In other words, we must reach pair of leaf nodes. In the rest of the paper, we will first assume that with high probability the number of steps required for marking the entire tree is equivalent to that for marking, directly or indirectly, all the leaf nodes, and figure out the expected number of nodes sent based on our assumption. Then we will verify the legitimacy of such assumption.

5 MARKING A BINARY TREEPROBABILISTIC ANALYSIS OF A RANDOMIZED ALGORITHM5 Theorem 3. (A Coupon-collector Problem). The expected number of nodes sent in order to mark at least one of each pair of siblings at the leaves is Nln + Θ(N). Proof. A close analogy of this process is the classic coupon-collector model. Our goal is to collect distinct pairs of siblings, coupons. At each unit of time, you send an identifier that may or may not be a leaf node you haven t reached yet, which is analogous to opening a box that may or may not contain a coupon you haven t collected. Let X be the number of identifiers sent until at least one of every pair of leaf siblings is marked. Our goal is to determine E(X). Gaining insight from the coupon-collector model, we let X i be the number of identifiers sent when you have reached exactly i pairs of leaf nodes. In other words, you send X i nodes to go from reaching i pairs to reaching i pairs. It follows that X = When exactly i pairs of leaf nodes have been reached, each time an identifier is sent, the probability of reaching a new pair is p i = [ (i )] N = N i+5 N. Therefore, Pr(X i = r) = ( p i ) r p i, where r is an arbitrary nonnegative integer. By Definition.5, X i is a geometric random variable with parameter p i = N i+5 N. Applying Theorem.8, we have E(X i ) = p i. Then using Theorem., the linearity of expectation, we have E(X) = E( X i ) = E(X i ) = X i. N N i+5 = N i = N i. As we will show in Lemma 3.3, the summation is equal to ln i + Θ(). Therefore, the expected time steps needed to mark at least one of each pair of siblings at the leaves is Nln +Θ(N). Definition 3. (The Big Θ Notation). Let f and g both be real-valued functions defined on some unbounded subset of real positive numbers. f(x) = Θ(g(x)) as x if and only if there exists a positive real number M and a real number x0 such that f(x) Mg(x) for all x x 0. f(x) Mg(x) for all x x 0. Lemma 3.3. The following equation holds true for all n N : n i = lnn+θ(). Proof. According to the definition of natural log, we have both lnn = n x= x dx n x= n x dx = i

6 6 XIANG LI and n n lnn+ = x= x dx+ n x= x dx+ = i + = n i. i= n n Since lnn i lnn+, by Definition 3., we have i = lnn+θ(). Before we verify our assumption that the entire process can be reduced to marking the leaf nodes, a new model, the balls-into-bins model, is to be introduced, along with several relevant techniques of analysis.. The Balls-into-bins Model and Poisson Distribution The Balls-into-bins model refers to the process of randomly and uniformly placing m balls into n bins. This is similar to our task of marking the tree: m signals are distributed randomly and uniformly among N nodes. We are interested in the probability that a particular bin has r balls. We start by counting the number of different ( configurations of the r balls selected from a total of m balls. There are m ) r ways to select the r balls. For each configuration, the probability that all of the given r balls fall into our bin and no other balls do is given by ( n )r ( n )m r. Hence, the probability that our bin contains exactly r balls is equal to ( ) m ( r n )r ( n )m r = m(m ) (m r +) n r ( n )m r (m n )r ( n )m (m n )r e m n. Definition.. A Poisson random variable X with parameter µ is given by the probability distribution Pr(X = r) = µr e µ. Remark.. The probability distribution of Poisson variables satisfy the three properties in Definition.. In particular, we can verify that µ r e µ Pr(X = r) = = e µ r=0 r=0 r=0 µ r using the Taylor expansion of e x. = e µ e µ =, Definition.3. Let X,Y be two random variables. X and Y are independent if and only if for all possible i,j we have Pr((X = i) (Y = j)) = Pr(X = i) Pr(Y = j). Theorem.. Let X,X,...,X n be independent Poisson random bariables with parameter µ,µ,...,µ n respectively. Y = X +X + +X n is a Poisson random variable with parameter µ = µ +µ + +µ n.

7 MARKING A BINARY TREEPROBABILISTIC ANALYSIS OF A RANDOMIZED ALGORITHM7 Proof. Let X,Y be two Poisson random variables with parameter µ,µ respectively. Then Pr(X +Y = r) = Pr((X = j) (Y = r j)) = Pr(X = j) Pr(Y = r j)) = = e (µ+µ) µ j µr j j!(r j)! = e (µ+µ) µ j e µ j! ( r j µ r j e µ (r j)! ) µ j µr j = e (µ+µ) (µ +µ ) r, where the last equality uses the binomial theorem. The more general case regarding the sum of n variables can be proven by induction. 5. Verifying the Existence of a Bottleneck Here a discussion about the Poisson approximation is left out. The main conclusion of the Poisson approximation is that when we are considered with events with sufficiently extreme probability, throwing n balls into m bins yields roughly the same distribution as assigning each bin a number of balls that is Poisson distributed with µ = m n ; therefore, the loads in all bins are independent under the Poisson approximation. Let s first pay attention to the level with depth n- of the tree. Given that at least one of each pair of leaf node siblings is marked, if all the nodes on the n- level are also all marked, consequently we know that the entire tree is marked because nodes on the n- level can infect other nodes upwards. Hence, when at least one of each pair of leaf node siblings is marked, Pr(the tree is completely marked) = Pr(the n- level is completely marked) = ( Pr(node i with depth n- is not marked for an arbitrary i)), Therefore, we just need to compute the probability that a particular node i on the n- level of the tree is not marked. Let m be the total number of signals sent and X i be number of signals sent to node i. According to our previous analysis of the balls-into-bins model, X i is a Poisson variable with a probability distribution Pr(X i = r) = (m N )r e m N For the node i, let X p,x s,x c,x c be the number of signals sent to its parent, sibling, first child and second child respectively. Three pieces of information can be gained from an unmarked node i:. X i = 0.. X p = 0 or X s = X c = 0 or X c = 0. Note that these three combined are a necessary but not sufficient condition for node i to be unmarked because a zero value on X p doesn t necessarily make i s parent unmarked. Such simplification is valid here because we are only interested in an upper bound of the probability that i is unmarked. Under the Poisson approximation, X i,x p,x s,x c,x c are all independent Poisson random variables and we can therefore invoke Theorem.3. Without loss of.

8 8 XIANG LI generality, let s consider the case where Y = X i +X p +X c = 0. By Theorem.3, Y is again a Poisson random variable with parameter µ = 3m N. By Definition., Applying the union bound, we have Pr(Y = 0) = e 3m N. Pr(X i is not marked) e 3m N, because there are possible combinations among X p,x s,x c and X c, andpr(the tree is completely marked every pair of leaf node siblings is reached) ( e 3m N ). When m takes on the value N ln( ) + Θ(N), which is our previous result for the expectation of signals needed, we have Pr(the tree is completely marked every pair of leaf node siblings is reached) = ( N 3 )N where N =. For a sufficiently large N,( N 3 )N is close to, and it has a limit of when N approaches infinity. Therefore, we can conclude that when the total number of time steps taken is around nlnn, with high probability marking the nodes above leaves is completed before marking each pair of leaf nodes; that is, marking the leaves is with high probability the bottleneck of the entire process. It s worth pointing out that here we are only concerned with the situation where the total number of time steps is equivalent to N lnn +Θ(N) because, as we will show next, using the balls-into-bins model for N sufficiently large, with high probability the number of signals required to reach all pairs of leaf nodes stays very close to N lnn. Theorem 5.. If Y is the number of signals sent before reaching every pair of leaf nodes, then for any constant c we have N lnn lim Pr(X > +cn) = e e c. N Empirically, this means the random variable Y concentrates around its expectation N lnn + Θ(N). After plugging in some value of c, we can see that, for instance, the probability that X > N lnn +n is less than %. Proof. Again we assume Poisson approximation is appropriate in this case. Under the Poisson approximation, we can let the number of signals sent to each pair of leaf nodesbeapoissonrandomvariablewithparameterµ = lnn +c,andconsequently the expected number of total signals sent is N lnn +cn. By Definition., for a particular pair of leaf nodes and the corresponding X, Pr(X = 0) = e µ = e c N. Since all nodes are independent under the Poisson approximation, the probability that every pair of leaf nodes receives at least one signal is ( e c N ) N e e c ; that is, the probability that not all pairs of leaf nodes are marked when N lnn +cn signals are sent is e e c.

9 MARKING A BINARY TREEPROBABILISTIC ANALYSIS OF A RANDOMIZED ALGORITHM9 6. Conclusion At this point, we can conclude that the expected time steps needed in total is N ln equal to + ΘN and with high probability the number of total signals required stays very close to N ln. Acknowledgments I would like to thank Professor Greg Lawler for first introducing me into probability theory and stochastic process during the REU program. It is a pleasure to thank my mentor, Kevin Casto, for recommending reading materials and giving me insight into the topic of my paper. I would also like to thank Daniil Rudenko and Peter May for organizing the REU program. Your efforts are much appreciated. 7. bibliography References [] Michael Mitzenmacher, Eli Upfal. Probability and Computing- Randomized Algorithms and Probabilistic Analysis. Cambridge University Press. 005.

Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr( )

Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr Pr = Pr Pr Pr() Pr Pr. We are given three coins and are told that two of the coins are fair and the