Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr Pr = Pr Pr Pr() Pr Pr. We are given three coins and are told that two of the coins are fair and the third coin is biased, landing heads with probability 2/3 We permute the coins randomly, and then flip each of the coins MAT-72306 RandAl, Spring 2017 18-Jan-17 49 The first and second coins come up heads, and the third comes up tails? What is the probability that the first coin is the biased one? The coins are in a random order and so, before our observing the outcomes of the coin flips, each of the three coins is equally likely to be the biased one Let be the event that the th coin flipped is the biased one, and let be the event that the three coin flips came up heads, heads, and tails MAT-72306 RandAl, Spring 2017 18-Jan-17 50 1
Before we flip the coins Pr( ) = 1/3 for all The probability of the event conditioned on : Pr = Pr = 2 3 1 2 1 2 =1 6 and Pr = 1 2 1 2 1 3 = 1 12 Applying Bayes' law, we have Pr = 1 Pr Pr = 18 Pr Pr 1 18 + 1 18 + 1 = 2 5 36 The three coin flips increases the likelihood that the first coin is the biased one from 1/3 to 2/5 MAT-72306 RandAl, Spring 2017 18-Jan-17 51 In randomized matrix multiplication test, we want to evaluate the increase in confidence in the matrix identity obtained through repeated tests In the Bayesian approach one starts with a prior model, giving some initial value to the model parameters This model is then modified, by incorporating new observations, to obtain a posterior model that captures the new information If we have no information about the process that generated the identity then a reasonable prior assumption is that the identity is correct with probability 1/2 MAT-72306 RandAl, Spring 2017 18-Jan-17 52 2
Let be the event that the identity is correct, and let be the event that the test returns that the identity is correct We start with Pr = Pr = 1, 2and since the test has a one-sided error bounded by 1, 2 we have Pr =1and Pr 12 Applying Bayes' law yields Pr Pr Pr = Pr Pr + Pr Pr 12 1 2 +12 1 2 = 2 3 MAT-72306 RandAl, Spring 2017 18-Jan-17 53 Assume now that we run the randomized test again and it again returns that the identity is correct After the first test, we may have revised our prior model, so that we believe Pr( 2/3 and Pr( 1/3 Now let be the event that the new test returns that the identity is correct; since the tests are independent, as before we have Pr =1and Pr 1/2 MAT-72306 RandAl, Spring 2017 18-Jan-17 54 3
Applying Bayes' law then yields 23 Pr = 4 2 3 +13 12 5 In general: If our prior model (before running the test) is that Pr( 2 2 +1 and if the test returns that the identity is correct (event ), then Pr = 2 2 +1 =1 1 2 +1 Thus, if all 100 calls to the matrix identity test return that it is correct, our confidence in the correctness of this identity is 1/(2 + 1) MAT-72306 RandAl, Spring 2017 18-Jan-17 55 1.4. A Randomized Min-Cut Algorithm A cut-set in a graph is a set of edges whose removal breaks the graph into two or more connected components Given a graph =(,)with vertices, the minimum cut or min-cut problem is to find a minimum cardinality cut-set in Minimum cut problems arise in many contexts, including the study of network reliability MAT-72306 RandAl, Spring 2017 18-Jan-17 56 4
Minimum cuts also arise in clustering problems For example, if nodes represent Web pages (or any documents in a hypertext-based system) and two nodes have an edge between them if the corresponding nodes have a hyperlink between them, then small cuts divide the graph into clusters of documents with few links between clusters Documents in different clusters are likely to be unrelated MAT-72306 RandAl, Spring 2017 18-Jan-17 57 The main operation is edge contraction In contracting an edge {, }we merge vertices and into one, eliminate all edges connecting and, and retain all other edges in the graph The new graph may have parallel edges but no self-loops The algorithm consists of 2iterations Each iteration picks an edge from the existing edges in the graph and contracts that edge Our randomized algorithm chooses the edge uniformly at random from the remaining edges MAT-72306 RandAl, Spring 2017 18-Jan-17 58 5
Each iteration reduces # of vertices by one After 2iterations, there are two vertices The algorithm outputs the set of edges connecting the two remaining vertices Any cut-set in an intermediate iteration of the algorithm is also a cut-set of the original graph Not every cut-set of the original graph is one in an intermediate iteration, since some edges may have been contracted in previous iterations As a result, the output of the algorithm is always a cut-set of the original graph but not necessarily the minimum cardinality cut-set MAT-72306 RandAl, Spring 2017 18-Jan-17 59 MAT-72306 RandAl, Spring 2017 18-Jan-17 60 6
Theorem 1.8: The algorithm outputs a min-cut set with probability at least 2 (( 1) ). Proof: Let be the size of the min-cut set of. The graph may have several cut-sets of minimum size. We compute the probability of finding one specific such set. Since is a cut-set in the graph, removal of the set partitions the set of vertices into two sets, and, such that there are no edges connecting vertices in to those in. MAT-72306 RandAl, Spring 2017 18-Jan-17 61 Assume that, throughout an execution of the algorithm, we contract only edges that connect two vertices in or two in, but not edges in. In that case, all the edges eliminated throughout the execution will be edges connecting vertices in or vertices in, and after 2iterations the algorithm returns a graph with two vertices connected by the edges in. We may conclude that, if the algorithm never chooses an edge of in its 2iterations, then the algorithm returns as the minimum cut-set. MAT-72306 RandAl, Spring 2017 18-Jan-17 62 7
If the size of the cut is small, then the probability that the algorithm chooses an edge of is small at least when the number of edges remaining is large compared to. Let be the event that the edge contracted in iteration is not in, and let = be the event that no edge of was contracted in the first iterations. We need to compute Pr. Start by computing Pr = Pr. Since the minimum cut-set has edges, all vertices in the graph must have degree or larger. If each vertex is adjacent to at least edges, then the graph must have at least 2 edges. MAT-72306 RandAl, Spring 2017 18-Jan-17 63 Since there are at least /2 edges in the graph and since has edges, the probability that we do not choose an edge of in the first iteration is Pr = Pr =1 =12 Suppose that the first contraction did not eliminate an edge of. I.e., we condition on the event. Then, after the first iteration, we are left with an ( 1)-node graph with minimum cut-set of size. Again, the degree of each vertex in the graph must be at least, and the graph must have at least ( 1)/2 edges. MAT-72306 RandAl, Spring 2017 18-Jan-17 64 8
Pr 1 1 2 1 Similarly, Pr +1 2 2 +1 To compute Pr, we use Pr = Pr = Pr Pr = Pr Pr Pr Pr 2 1 = +1 +1 = 2 3 1 2 1 4 3 = 2 1. MAT-72306 RandAl, Spring 2017 18-Jan-17 65 2. Discrete Random Variables and Expectation Random Variables and Expectation The Bernoulli and Binomial Random Variables Conditional Expectation The Geometric Distribution The Expected Run-Time of Quicksort 9
In tossing two dice we are often interested in the sum of the dice rather than their separate values The sample space in tossing two dice consists of 36 events of equal probability, given by the ordered pairs of numbers {(1,1), (1,2),, (6, 6)} If the quantity we are interested in is the sum of the two dice, then we are interested in 11 events (of unequal probability) Any such function from the sample space to the real numbers is called a random variable MAT-72306 RandAl, Spring 2017 18-Jan-17 67 2.1. Random Variables and Expectation Definition 2.1: A random variable (RV) on a sample space is a real-valued function on ; that is,. A discrete random variable is a RV that takes on only a finite or countably infinite number of values For a discrete RV and a real value, the event " = " includes all the basic events of the sample space in which assumes the value I.e., " = " represents the set () =} MAT-72306 RandAl, Spring 2017 18-Jan-17 68 10
We denote the probability of that event by Pr = = Pr, () If is the RV representing the sum of the two dice, the event =4corresponds to the set of basic events {(1, 3), (2,2), (3, 1)} Hence, Pr = 4 = 3 36 = 1 12 MAT-72306 RandAl, Spring 2017 18-Jan-17 69 Definition 2.2: Two RVs and are independent if and only if Pr(( = (=)) = Pr( = ) Pr( = ) for all values and. Similarly, RVs,,, are mutually independent if and only if, for any subset [1, ] and any values,, Pr ( = ) = Pr = MAT-72306 RandAl, Spring 2017 18-Jan-17 70 11
Definition 2.3: The expectation of a discrete RV, denoted by E[], is given by = Pr =, where the summation is over all values in the range of. The expectation is finite if Pr( = ), converges; otherwise, it is unbounded. E.g., the expectation of the RV representing the sum of two dice is = 1 36 2+ 2 36 3+ 3 36 4++ 1 12=7 36 MAT-72306 RandAl, Spring 2017 18-Jan-17 71 As an example of where the expectation of a discrete RV is unbounded, consider a RV that takes on the value 2 with probability 12 for = 1,2, The expected value of is [] = 1 2 2 = 1 expresses that [] is unbounded MAT-72306 RandAl, Spring 2017 18-Jan-17 72 12
2.1.1. Linearity of Expectations By this property, the expectation of the sum of RVs is equal to the sum of their expectations Theorem 2.1 [Linearity of Expectations]: For any finite collection of discrete RVs,,, with finite expectations, = MAT-72306 RandAl, Spring 2017 18-Jan-17 73 Proof: We prove the statement for two random variables and (general case by induction). The summations that follow are understood to be over the ranges of the corresponding RVs: + = + Pr ( = (=) =Pr ( = (=) +Pr ( = (=) = Pr ( = (=) + Pr ( = (=) =Pr = + Pr = = []+[] The first equality follows from Definition 1.2. In the penultimate equation uses Theorem 1.6, the law of total probability. MAT-72306 RandAl, Spring 2017 18-Jan-17 74 13
Let us now compute the expected sum of two standard dice Let = +, where represents the outcome of die for = 1,2 Then = 1 6 = 7 2 Applying the linearity of expectations, we have = + =7 MAT-72306 RandAl, Spring 2017 18-Jan-17 75 Linearity of expectations holds for any collection of RVs, even if they are not independent Consider, e.g., the previous example and let the random variable = 1 + 1 2 We have = 1 + 2 1 = 1 + [ 2 1 ] even though 1 and 1 2 are clearly dependent Verify the identity by considering the six possible outcomes for 1 MAT-72306 RandAl, Spring 2017 18-Jan-17 76 14
Lemma 2.2: For any constant and discrete RV, = []. Proof: The lemma is obvious for =0. For 0, [] =Pr = = / Pr = / = Pr = =. MAT-72306 RandAl, Spring 2017 18-Jan-17 77 2.1.2. Jensen's Inequality Let us choose the length of a side of a square uniformly at random from the range [1,99] What is the expected value of the area? We can write this as [ ] It is tempting to think of this as being equal to, but a simple calculation shows that this is not correct In fact, = = 50 = 2500 whereas = 99503 3317 > 2500 MAT-72306 RandAl, Spring 2017 18-Jan-17 78 15
More generally, [ ] ( ) Consider =() The RV is nonnegative and hence its expectation must also be nonnegative []=[( []) ] = [ + ] = [ 2[ ]+( ) =[ ( ) To obtain the penultimate line, use the linearity of expectations To obtain the last line use Lemma 2.2 to simplify [[]] = [] [] MAT-72306 RandAl, Spring 2017 18-Jan-17 79 The fact that [ ] ( ) is an example of Jensen's inequality Jensen's inequality shows that, for any convex function, we have [()]([]) Definition 2.4: A function : is said to be convex if, for any, and 1, + +(1)( ) Lemma 2.3: If is twice differentiable function, then is convex if and only if "() 0 MAT-72306 RandAl, Spring 2017 18-Jan-17 80 16
MAT-72306 RandAl, Spring 2017 18-Jan-17 81 Theorem 2.4 [Jensen's Inequality]: If is a convex function, then [()]([]). Proof: We prove the theorem assuming that has a Taylor expansion. Let = []. By Taylor's theorem there is a value such that = + (x)+ () 2 + (x) since () >0by convexity. Taking expectations and applying linearity of and Lemma 2.2 yields: [ [ + ] =[ ]+()([ ) =()=([]). MAT-72306 RandAl, Spring 2017 18-Jan-17 82 17
2.2. The Bernoulli and Binomial Random Variables We run an experiment that succeeds with probability and fails with probability Let be a RV such that = 1iftheexperimentsucceeds, 0otherwise The variable is called a Bernoulli or an indicator random variable Note that, for a Bernoulli RV, [] = 1 + (1 ) 0 = = Pr( = 1) MAT-72306 RandAl, Spring 2017 18-Jan-17 83 If we, e.g., flip a fair coin and consider heads a success, then the expected value of the corresponding indicator RV is 1/2 Consider a sequence of independent coin flips What is the distribution of the number of heads in the entire sequence? More generally, consider a sequence of independent experiments, each of which succeeds with probability If we let represent the number of successes in the experiments, then has a binomial distribution MAT-72306 RandAl, Spring 2017 18-Jan-17 84 18
Definition 2.5: A binomial RV with parameters and, denoted by (, ), is defined by the following probability distribution on = 0,1,2,, : Pr = = I.e., the binomial RV (BRV) equals when there are exactly successes and failures in independent experiments, each of which is successful with probability Definition 2.5 ensures that the BRV is a valid probability function (Definition 1.2): Pr = = 1 MAT-72306 RandAl, Spring 2017 18-Jan-17 85 We want to gather data about the packets going through a router We want to know the approximate fraction of packets from a certain source / of a certain type We store a random subset or sample of the packets for later analysis Each packet is stored with probability and packets go through the router each day, the number of sampled packets each day is a BRV with parameters and To know how much memory is necessary for such a sample, determine the expectation of MAT-72306 RandAl, Spring 2017 18-Jan-17 86 19
If is a BRV with parameters and, then is the number of successes in trials, where each trial is successful with probability Define a set of indicator RVs,,, where =1if the th trial is successful and 0 otherwise Clearly, [ ]=and = and so, by the linearity of expectations, = = = MAT-72306 RandAl, Spring 2017 18-Jan-17 87 2.3. Conditional Expectation Definition 2.6: [ = ] =Pr = = where the summation is over all in the range of The conditional expectation of a RV is, like, a weighted sum of the values it assumes Now each value is weighted by the conditional probability that the variable assumes that value MAT-72306 RandAl, Spring 2017 18-Jan-17 88, 20
Suppose that we independently roll two standard six-sided dice Let be the number that shows on the first die, the number on the second die, and the sum of the numbers on the two dice Then = 2 = Pr = =2 = 1 = 11 6 2 MAT-72306 RandAl, Spring 2017 18-Jan-17 89 As another example, consider =5: = 5 = Pr = =5 = Pr = =5 Pr =5 = 136 = 5 436 2 MAT-72306 RandAl, Spring 2017 18-Jan-17 90 21
Lemma 2.5: For any RVs and, [] = Pr = [ = ], where the sum is over all values in the range of and all of the expectations exist. Proof: Pr = = Pr = = = Pr = = Pr = = Pr = = Pr = = = Pr = = [] MAT-72306 RandAl, Spring 2017 18-Jan-17 91 The linearity of expectations also extends to conditional expectations Lemma 2.6: For any finite collection of discrete RVs,,, with finite expectations and for any RV, = = [ = ] MAT-72306 RandAl, Spring 2017 18-Jan-17 92 22
Confusingly, the conditional expectation is also used to refer to the following RV Definition 2.7: The expression [ ] is a RV () that takes on the value [ = ] when = [ ] is not a real value; it is actually a function of the RV Hence [ ] is itself a function from the sample space to the real numbers and can therefore be thought of as a RV MAT-72306 RandAl, Spring 2017 18-Jan-17 93 In the previous example of rolling two dice, = Pr = = 1 6 = + 7 2 We see that is a RV whose value depends on If [ ] is a RV, then it makes sense to consider its expectation[ ] We found that = + 72 Thus, = + 7 2 =7 2 +7 2 =7=[] MAT-72306 RandAl, Spring 2017 18-Jan-17 94 23
More generally, Theorem 2.7: Y = [ ] Proof: From Definition 2.7 we have =, where takes on the value = when =. Hence = = Pr = The right-hand side equals Y by Lemma 2.5. MAT-72306 RandAl, Spring 2017 18-Jan-17 95 Consider a program that includes one call to a process Assume that each call to process recursively spawns new copies of the process, where the number of new copies is a BRV with parameters and We assume that these random variables are independent for each call to What is the expected number of copies of the process generated by the program? MAT-72306 RandAl, Spring 2017 18-Jan-17 96 24
To analyze this recursive spawning process, we use generations The initial process is in generation 0 Otherwise, we say that a process is in generation if it was spawned by another process in generation 1 Let denote the number of processes in generation Since we know that =1, the number of processes in generation 1 has a binomial distribution Thus, = MAT-72306 RandAl, Spring 2017 18-Jan-17 97 Similarly, suppose we knew that the number of processes in generation 1was, so = Then = = Applying Theorem 2.7, we can compute the expected size of the th generation inductively We have = = = [ ] By induction on, and using the fact that =1, we then obtain = MAT-72306 RandAl, Spring 2017 18-Jan-17 98 25
The expected total number of copies of process generated by the program is given by = = If 1then the expectation is unbounded; if <1, the expectation is 1 (1) The # of processes generated by the program is bounded iff the # of processes spawned by each process is less than 1 This is a simple example of a branching process, a probabilistic paradigm extensively studied in probability theory MAT-72306 RandAl, Spring 2017 18-Jan-17 99 2.4. The Geometric Distribution Let us flip a coin until it lands onheads? What is the distribution of the number of flips? This is an example of a geometric distribution It arises when we perform a sequence of independent trials until the first success, where each trial succeeds with probability Definition 2.8: A geometric RV with parameter is given by the following probability distribution on = 1,2, : Pr( = ) = MAT-72306 RandAl, Spring 2017 18-Jan-17 100 26
Geometric RVs are said to be memoryless because the probability that you will reach your first success trials from now is independent of the number of failures you have experienced Informally, one can ignore past failures they do not change the distribution of the number of future trials until first success Formally, we have the following Lemma 2.8: For a geometric RV with parameter and for >0, Pr( = + > ) = Pr( = ) MAT-72306 RandAl, Spring 2017 18-Jan-17 101 When a RV takes values in the set of natural numbers = {0,1,2,3, } there is an alternative formula for calculating its expectation Lemma 2.9: Let be a discrete RV that takes on only nonnegative integer values. Then Proof: [] = Pr = Pr Pr = = Pr = = Pr = = [] MAT-72306 RandAl, Spring 2017 18-Jan-17 102 27
For a geometric RV with parameter, Pr = = Hence = = 1 (1) = 1 Thus, for a fair coin where = 1/2, on average it takes two flips to see the first heads MAT-72306 RandAl, Spring 2017 18-Jan-17 103 Finding the expectation of a geometric RV with parameter using conditional expectations and the memoryless property of geometric RVs Recall that corresponds to the number of flips until the first heads given that each flip isheads with probability Let =0if the first flip istails and =1if the first flip isheads By the identity from Lemma 2.5, = Pr =0 =0 + Pr =1 [ = 1] = (1 )[ = 0] + [ = 1] MAT-72306 RandAl, Spring 2017 18-Jan-17 104 28
If = 1 then = 1, so [ = 1] = 1 If =0, then >1 In this case, let the number of remaining flips (after the first flip until the first heads) be Then, by the linearity of expectations, [] =(1)[+1]+1=(1)[]+1 By the memoryless property of geometric RVs, is also a geometric RV with parameter Hence [] =[], since they both have the same distribution We therefore have [] =(1)[]+1= (1)[]+1, which yields [] = 1/ MAT-72306 RandAl, Spring 2017 18-Jan-17 105 2.4.1. Example: Coupon Collector's Problem Each box of cereal contains one of different coupons Once you obtain one of every type of coupon, you can send in for a prize Coupon in each box is chosen independently and uniformly at random from the possibilities and that you do not collaborate to collect coupons? How many boxes of cereal must you buy before you obtain at least one of every type of coupon? MAT-72306 RandAl, Spring 2017 18-Jan-17 106 29
Let be the number of boxes bought until at least one of every type of coupon is obtained If is the number of boxes bought while you had exactly 1different coupons, then clearly = The advantage of breaking into a sum of random variables, =1,,, is that each is a geometric RV When exactly 1coupons have been found, the probability of obtaining a new coupon is =1 1 MAT-72306 RandAl, Spring 2017 18-Jan-17 107 Hence, is a geometric RV with parameter : = 1 = +1 Using the linearity of expectations, we have that = = +1 = 1 MAT-72306 RandAl, Spring 2017 18-Jan-17 108 30
The summation 1 harmonic number () is known as the Lemma 2.10: The harmonic number = 1satisfies () = ln (1). Thus, for the coupon collector's problem, the expected number of random coupons required to obtain all coupons is ln () MAT-72306 RandAl, Spring 2017 18-Jan-17 109 31