Expected Value II. 1 The Expected Number of Events that Happen

Similar documents
Massachusetts Institute of Technology Lecture J/18.062J: Mathematics for Computer Science 2 May 2000 Professors David Karger and Nancy Lynch

6.042/18.062J Mathematics for Computer Science November 28, 2006 Tom Leighton and Ronitt Rubinfeld. Random Variables

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

1. When applied to an affected person, the test comes up positive in 90% of cases, and negative in 10% (these are called false negatives ).

Grades 7 & 8, Math Circles 24/25/26 October, Probability

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 15. Random Variables: Distributions, Independence, and Expectations

Probability Experiments, Trials, Outcomes, Sample Spaces Example 1 Example 2

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 12. Random Variables: Distribution and Expectation

Probability (Devore Chapter Two)

1 Normal Distribution.

The topics in this section concern with the first course objective.

Probability and the Second Law of Thermodynamics

CS 124 Math Review Section January 29, 2018

MAT Mathematics in Today's World

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

CISC 1100/1400 Structures of Comp. Sci./Discrete Structures Chapter 7 Probability. Outline. Terminology and background. Arthur G.

Lecture 6 - Random Variables and Parameterized Sample Spaces

Discrete Mathematics and Probability Theory Fall 2010 Tse/Wagner MT 2 Soln

Discrete Mathematics and Probability Theory Fall 2012 Vazirani Note 14. Random Variables: Distribution and Expectation

Discrete Mathematics for CS Spring 2006 Vazirani Lecture 22

CS 237: Probability in Computing

6.041/6.431 Spring 2009 Quiz 1 Wednesday, March 11, 7:30-9:30 PM. SOLUTIONS

1 Maintaining a Dictionary

Lecture 4: Constructing the Integers, Rationals and Reals

X = X X n, + X 2

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

What is a random variable

Lecture Notes. 1 Leftovers from Last Time: The Shape of the Binomial Distribution. 1.1 Recap. 1.2 Transmission Across a Noisy Channel

5.3 Conditional Probability and Independence

P (E) = P (A 1 )P (A 2 )... P (A n ).

Uncertainty. Michael Peters December 27, 2013

MATH MW Elementary Probability Course Notes Part I: Models and Counting

Chapter 1 Review of Equations and Inequalities

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 20

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 16. Random Variables: Distribution and Expectation

Chapter 7: Section 7-1 Probability Theory and Counting Principles

Generating Function Notes , Fall 2005, Prof. Peter Shor

The Geometric Distribution

Lecture 6. Statistical Processes. Irreversibility. Counting and Probability. Microstates and Macrostates. The Meaning of Equilibrium Ω(m) 9 spins

Mathematical Foundations of Computer Science Lecture Outline October 18, 2018

19 Deviation from the Mean

STA Module 4 Probability Concepts. Rev.F08 1

Algebra. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Chapter 5 Simplifying Formulas and Solving Equations

An analogy from Calculus: limits

Chapter 4: An Introduction to Probability and Statistics

Properties of Arithmetic

Sample Spaces, Random Variables

Against the F-score. Adam Yedidia. December 8, This essay explains why the F-score is a poor metric for the success of a statistical prediction.

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Chapter 2.5 Random Variables and Probability The Modern View (cont.)

Chapter 14. From Randomness to Probability. Copyright 2012, 2008, 2005 Pearson Education, Inc.

the time it takes until a radioactive substance undergoes a decay

One-to-one functions and onto functions

Lecture Notes. Here are some handy facts about the probability of various combinations of sets:

6 Event-based Independence and Conditional Probability. Sneak peek: Figure 3: Conditional Probability Example: Sneak Peek

Senior Math Circles November 19, 2008 Probability II

A Event has occurred

Discrete Distributions

CSE 103 Homework 8: Solutions November 30, var(x) = np(1 p) = P r( X ) 0.95 P r( X ) 0.

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

CS 361: Probability & Statistics

( )( b + c) = ab + ac, but it can also be ( )( a) = ba + ca. Let s use the distributive property on a couple of

Lecture 10: Powers of Matrices, Difference Equations

Discrete Structures for Computer Science

MITOCW watch?v=vjzv6wjttnc

Lecture 3: Sizes of Infinity

MATH 3C: MIDTERM 1 REVIEW. 1. Counting

Expectation is linear. So far we saw that E(X + Y ) = E(X) + E(Y ). Let α R. Then,

Notes on Discrete Probability

Chapter 8: An Introduction to Probability and Statistics

Steve Smith Tuition: Maths Notes

CS280, Spring 2004: Final

With Question/Answer Animations. Chapter 7

Discrete Random Variables

MA 1125 Lecture 15 - The Standard Normal Distribution. Friday, October 6, Objectives: Introduce the standard normal distribution and table.

Lectures 5 & 6: Hypothesis Testing

Formalizing Probability. Choosing the Sample Space. Probability Measures

CSCI2244-Randomness and Computation First Exam with Solutions

Independence 1 2 P(H) = 1 4. On the other hand = P(F ) =

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Lines and Their Equations

Descriptive Statistics (And a little bit on rounding and significant digits)

Math Lecture 3 Notes

Quadratic Equations Part I

Discrete Probability

Statistics 1L03 - Midterm #2 Review

CS 246 Review of Proof Techniques and Probability 01/14/19

Probability Year 10. Terminology

ECE 450 Lecture 2. Recall: Pr(A B) = Pr(A) + Pr(B) Pr(A B) in general = Pr(A) + Pr(B) if A and B are m.e. Lecture Overview

Conditional Probability, Independence and Bayes Theorem Class 3, Jeremy Orloff and Jonathan Bloom

P (A B) P ((B C) A) P (B A) = P (B A) + P (C A) P (A) = P (B A) + P (C A) = Q(A) + Q(B).

What is proof? Lesson 1

10.1. Randomness and Probability. Investigation: Flip a Coin EXAMPLE A CONDENSED LESSON

Minilecture 2: Sorting Algorithms

Introduction to discrete probability. The rules Sample space (finite except for one example)

Chapter 2. Mathematical Reasoning. 2.1 Mathematical Models

Nondeterministic finite automata

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

Today we ll discuss ways to learn how to think about events that are influenced by chance.

Transcription:

6.042/18.062J Mathematics for Computer Science December 5, 2006 Tom Leighton and Ronitt Rubinfeld Lecture Notes Expected Value II 1 The Expected Number of Events that Happen Last week we concluded by showing that if you have a collection of events A 1, A 2,..., A N and T is the number of events that occur, then Ex (T ) Pr (A 1 ) + Pr (A 2 ) + + Pr (A N ), i.e., the expected number of events that happen is the sum of the probabilities for each event. Today we re going to start by seeing how this value can be used to determine if at least one of the events is likely to happen. This is particularly relevant in situations where the events correspond to failures that can kill an entire system. For example, suppose you want to compute the probability that a nuclear plant will have a meltdown. This is actually a real problem - in practice the government does this by assembling a panel of experts and they try to figure out all the events that could cause a meltdown. For example A 1 could be the event that an operator goes crazy and deliberately forces a meltdown, A 2 the even that an earthquake hits and a series of pipes break and the coolant stops flowing, and A 3 could be the event of a certain kind of terrorist attack, etc. Then the experts estimate the probability of each event and then add them up to get the expected number of failures. Hopefully this number is very small - far less than one, in which case they can conclude that the probability of meltdown is small. That s because of a simple theorem that says that the probability of one or more events happening is at most the expected number of events that happen. Theorem 1. Pr (T 1) Ex (T ). Proof. Since T is a non-negative integer-valued random variable, Ex (T ) Pr (T i) Pr (T 1). i1 So if there are a thousand ways a meltdown could happen and each happens with probability at most 1 in a million, we can say that the probability of a meltdown is at most 1 in a thousand. Note that we did not use any independence properties of the events.

2 Expected Value II Similarly, if you bought 1000 lottery tickets, each with a 1 in a million chance of winning, your chance of winning is at most 1 in a thousand. This is a simple tool, but is very widely used to bound the probability of failure or disaster. So that s the good case. Namely, that the expected number of events to happen is close to 0. But what happens if the answer comes out differently? Suppose the experts add up the probabilities and find that the expected number of events to happen is bigger than 1, say they get 10. Then it is unclear, without more information, how to upper bound the probability of a meltdown. It might still be unlikely for a meltdown to happen. For example, suppose there are 1000 events and each happens with a 1% chance. Then the expected number to occur is 10. However, if all the events are totally dependent, i.e., Pr (A i A j ) 1 for all i, j, then Pr (T 1) 1/100. That is to say, all events happen or none of them do. So there is a 99% chance there is no meltdown. Now let s suppose we know a bit more about these events. In particular, suppose they are all mutually independent, and suppose we expect 10 events to occur. In this case it turns out that the probability of meltdown is > 99.99%, which is surprisingly close to 1. This is because of the following theorem. Theorem 2. Given mutually independent events A 1, A 2,..., A N, then Pr (T 0) e Ex(T ), where T is a random variable denoting the number of events that happen. Thus, the probability that no event occurs is exponentially small! Proof. Pr (T 0) Pr ( ) Ā 1 Ā2 ĀN N Pr ( ) Ā i (mut.independent) i1 N (1 Pr (A i )) i1 N e Pr(Ai) ( x, 1 x e x ) i1 e P N i1 Pr(A i) e Ex(T ) Note that independence was critically used here. Thus, as a corollary, given mutually independent events, if we expect 10 or more events to occur, the probability that no event occurs is e 1 < 1/22000.

Expected Value II 3 Notice that in this theorem, there is no dependence on N, the total number of events. This is a really strong result since it says that if we expect some things to happen, then they almost surely will! That is, if you expect your system to have some faults, then it surely will. This is sort of a proof of Murphy s Law. If there are a lot of things that can go wrong, even if none of them is very likely, this result says that almost surely something will go wrong. As an example, if you have N components and each fails with probability p 10/N, then the expected number of failures is 10. Assuming mutual independence, the probability there is no failure is at most e 10. So even though the probability of failure of any component is small, and even though only 10 failures are expected, the odds of none is not 1/10, but rather at most e 10. This helps explain coincidences or acts of God, i.e., that crazy events that seem to have been very unlikely do happen. The reason is that there are so many unlikely events possible so some will be likely to happen. For example, someone does win the lottery. To see another example, let s play a mind-reading game. There is a deck of cards. All the face cards (Jack, Queen, King, Ace) are worth 1, as well as the card labeled 10. The remaining cards are worth the value of their label. First, you guess a number i 0 between 1 and 9. Then there is a single deck which is shuffled. Then each card is revealed from the deck one at a time. On the i 0 th card, you read how much it is worth, denote it i 1. Now you choose the i 1 th card after i 0. Suppose it has value i 2. You repeat this process until the deck is finished. At the end you have some value i r of the last card whose value you acquired. It turns out, that with high probability, over the shuffling of the cards, that any two initial guesses i 0 and i 0 result in the same final value i r. The reason is that if two different people (with two different guesses) ever land on the same card, then they finish on the same card. That is, after landing on the same card, the people track each other. Now, each time one of us hits a card and starts counting again, there is some chance that he/she will land on the other s position. If 1 away, there is a 5/13 chance of collision. If > 1 away, there is a 1/13 chance. There are also many times we jump. We expect > 10 times for a random shuffling. So about 20 jumps between the two players, so it is very likely that eventually the players collide. That is, we can define events A i to be the event that on the ith jump the players collide. Then, we can use our previous theorem to show that with high probability one of the A i occurs. Now in this case, the intuition I went through was pretty fuzzy. In fact, the theorem doesn t really apply in this example. This is because the events are not independent. If we have not collided so far, that tells us something about the numbers we have had and since we have a 52 card deck, that tells us something about the numbers to come, and that conditions the probability we hit a match. However, we could analyze the game formally if the cards were selected at random from an infinite number of decks. Then the event of a hit would be independent for each jump. We won t cover this more formal analysis in class. It turns out, though, that the probability I can read your mind is at least 90% for 52-card decks. The point of all this can

4 Expected Value II be summed up as follows. If the expected number of events to occur is small, then it is likely that no event occurs, whether or not the events are independent. But if the expected number of events is large, then surely an event will occur if the events are mutually independent. That is, 1 Ex (T ) Pr (T 0) e Ex(T ), where the right inequality only holds if the events are mutually independent. We ll talk more about the probability certain numbers of events happen next time, but first we need some additional properties of the expected value. We already know that the expected value of the sum of two random variables is the sum of their expected values. 2 Expected Value of a Product Enough with sums! What about the expected value of a product of random variables? If R 1 and R 2 are independent, then the expected value of their product is the product of their expected values. Theorem 3. For independent random variables R 1 and R 2 : Ex (R 1 R 2 ) Ex (R 1 ) Ex (R 2 ) Proof. We ll transform the right side into the left side: Ex (R 1 ) Ex (R 2 ) x Pr (R 1 x) x Range(R 1 ) x Range(R 1 ) y Range(R 2 ) x Range(R 1 ) y Range(R 2 ) x Range(R 1 ) xy Pr (R 1 x) Pr (R 1 y) xy Pr (R 1 x R 1 y) y Pr (R 2 y) The second line comes from multiplying out the product of sums. Then we used the fact that R 1 and R 2 are independent. Now let s group terms for which the product xy is the same: xy Pr (R 1 x R 1 y) z Range(R 1 R 2 ) x,y: xyz z Range(R 1 R 2 ) z Range(R 1 R 2 ) Ex (R 1 R 2 ) ( z x,y: xyz z Pr (R 1 R 2 z) Pr (R 1 x R 1 y) )

Expected Value II 5 2.1 The Product of Two Independent Dice Suppose we throw two independent, fair dice and multiply the numbers that come up. What is the expected value of this product? Let random variables R 1 and R 2 be the numbers shown on the two dice. We can compute the expected value of the product as follows: Ex (R 1 R 2 ) Ex (R 1 ) Ex (R 2 ) 3 1 2 31 2 12 1 4 On the first line, we re using Theorem 3. Then we use the result from last lecture that the expected value of one die is 3 1 2. 2.2 The Product of Two Dependent Dice Suppose that the two dice are not independent; in fact, suppose that the second die is always the same as the first. Does this change the expected value of the product? Is the independence condition in Theorem 3 really necessary? As before, let random variables R 1 and R 2 be the numbers shown on the two dice. We can compute the expected value of the product directly as follows: Ex (R 1 R 2 ) Ex ( ) R1 2 6 i 2 Pr(R 1 i) i1 12 6 + 22 6 + 32 6 + 42 6 + 52 6 + 62 6 15 1 6 The first step uses the fact that the outcome of the second die is always the same as the first. Then we expand Ex (R 2 1) using one of our formulations of expectation. Now that the dice are no longer independent, the expected value of the product has changed to 15 1 6. So the expectation of a product of dependent random variables need not equal the product of their expectations. 2.3 Corollaries Theorem 3 extends to a collection of mutually independent variables.

6 Expected Value II Corollary 4. If random variables R 1, R 2,..., R n are mutually independent, then Ex (R 1 R 2 R n ) Ex (R 1 ) Ex (R 2 ) Ex (R n ) The proof uses induction, Theorem 3, and the definition of mutual independence. We ll omit the details. Adjusting a random variable by an additive or multiplicative constant adjusts the expected value in the same way. Corollary 5. If R is a random variable and a and b are constants, then Ex (ar + b) a Ex (R) + b This corollary follows if we regard a and b as random variables that each take on one particular value with probability 1. Constants are always independent of other random variables, so the equation holds by linearity of expectation and Theorem 3. We now know the expected value of a sum or product of random variables. Unfortunately, the expected value of a reciprocal is not so easy to characterize. Here is a flawed attempt. False Corollary 6. If R is a random variable, then ( ) 1 1 Ex R Ex (R) As a counterexample, suppose the random variable R is 1 with probability 1 2 and is 2 with probability 1. Then we have: 2 1 Ex (R) 1 1 1 + 2 1 2 2 2 ( ) 3 1 Ex 1 R 1 1 2 + 1 2 1 2 3 4 The two quantities are not equal, so the corollary must be false. But here is another false corollary, which we can actually prove! False Corollary 7. If Ex (R/T ) > 1, then Ex (R) > Ex (T ). Proof. We begin with the if-part, multiply both sides by Ex (T ), and then apply Theorem 3: Ex (R/T ) > 1 Ex (R/T ) Ex (T ) > Ex (T ) Ex (R) > Ex (T )

Expected Value II 7 This proof is bogus! The first step is valid only if Ex (T ) > 0. More importantly, we can t apply Theorem 3 in the second step because R/T and T are not necessarily independent. Unfortunately, the fact that Corollary 7 is false does not mean it is never used! 2.3.1 A RISC Paradox The following data is taken from a paper by some famous professors. They wanted to show that programs on a RISC processor are generally shorter than programs on a CISC processor. For this purpose, they made a table of program lengths for some benchmark problems, which looked like this: Benchmark RISC CISC CISC / RISC E-string search 150 120 0.8 F-bit test 120 180 1.5 Ackerman 150 300 2.0 Rec 2-sort 2800 1400 0.5 Average 1.2 Each row contains the data for one benchmark. The numbers in the first two columns are program lengths for each type of processor. The third column contains the ratio of the CISC program length to the RISC program length. Averaging this ratio over all benchmarks gives the value 1.2 in the lower right. The authors conclude that CISC programs are 20% longer on average. But there s a pretty serious problem here. Suppose we redo the final column, taking the inverse ratio, RISC / CISC instead of CISC / RISC. Benchmark RISC CISC RISC / CISC E-string search 150 120 1.25 F-bit test 120 180 0.67 Ackerman 150 300 0.5 Rec 2-sort 2800 1400 2.0 Average 1.1 By exactly the same reasoning used by the authors, we could conclude that RISC programs are 10% longer on average than CISC programs! What s going on? 2.3.2 A Probabilistic Interpretation To shed some light on this paradox, we can model the RISC vs. CISC debate with the machinery of probability theory. Let the sample space be the set of benchmark programs. Let the random variable R be the length of the RISC program, and let the random variable C be the length of the CISC

8 Expected Value II program. We would like to compare the average length of a RISC program, Ex (R), to the average length of a CISC program, Ex (C). To compare average program lengths, we must assign a probability to each sample point; in effect, this assigns a weight to each benchmark. One might like to weigh benchmarks based on how frequently similar programs arise in practice. But let s follow the original authors lead. They assign each ratio equal weight in their average, so they re implicitly assuming that similar programs arise with equal probability. Let s do that same and make the sample space uniform. We can now compute Ex (R) and Ex (C) as follows: Ex (R) 150 4 + 120 4 + 150 4 + 2800 4 805 Ex (C) 120 4 + 180 4 + 300 4 + 1400 4 500 So the average length of a RISC program is actually Ex (R) / Ex (C) 1.61 times greater than the average length of a CISC program. RISC is even worse than either of the two previous answers would suggest! In terms of our probability model, the authors computed C/R for each sample point and then averaged to obtain Ex (C/R) 1.2. This much is correct. However, they interpret this to mean that CISC programs are longer than RISC programs on average. Thus, the key conclusion of this milestone paper rests on Corollary 7, which we know to be false! 2.3.3 A Simpler Example The root of the problem is more clear in the following, simpler example. Suppose the data were as follows. Benchmark Processor A Processor B B/A A/B Problem 1 2 1 1/2 2 Problem 2 1 2 2 1/2 Average 1.25 1.25 Now the statistics for processors A and B are exactly symmetric. Yet, from the third column we would conclude that Processor B programs are 25% longer on average, and from the fourth column we would conclude that Processor A programs are 25% longer on average. Both conclusions are obviously wrong. The moral is that averages of ratios can be very misleading. More generally, if you re computing the expectation of a quotient, think twice; you re going to get a value ripe for misuse and misinterpretation.

Expected Value II 9 3 Variance So we ve talked a lot about expectation, and we ve seen several ways of computing it, and have done lots of examples. We know it tells us the weighted average of a distribution and is useful for bounding the probability that 1 or more events occur. Actually, in some cases the expected value is very close to the observed values. For example, the number of heads when you flip 100 mutually independent coins. You expect to get 50 heads, and the probability you get 25 or 75 is at most 1 in 5 million. In general, observed random variables are very close to their expected value, with probability near 1 when you have a binomial distribution with a large n. For example, in our analysis of packet loss on a channel with a 1% expected number of errors, the probability you get 2% error rate was 2 60, which is incredibly small. But it is not always the case that observed values are close to expected values. Recall the packet latency example we covered last week, where the expected latency was but typical latencies were < 10ms. In this case, the expected value was not very useful. As a simpler example, consider the following Bernoulli random variable: Pr (R 1000) 1/2, and Pr (R 1000) 1/2. Then Ex (R) 0, which is the same of the Bernoulli random variable S defined by: Pr (S 1) 1/2 and Pr (S 0) 1/2. Clearly the distributions, though both Bernoulli, are very different from each other! If these corresponded to betting games or stock market investments, one would carry a lot higher risk and reward than the other. In an effort to get a better handle on the shape and nature of a distribution, mathematicians have defined several other properties of a distribution, in addition to the expected value, which help to describe it. After the expected value, the next most important measure of a random variable is its variance. Definition 1. The variance of R is Var [R] Ex ((R Ex (R)) 2 ). That is, the variance gives us the average of the squares of the deviations from the mean. The idea behind the variance is that it is large when R is usually far from its expectation, and if R is always close to its expectation, then the variance will be small. So the variance tries to capture how likely a random variable is to be near its mean. Let s see what the variance is for the two random variables with expectation 0 we just looked at. We have Pr (R Ex (R) 1000) Pr (R Ex (R) 1000) 1/2, and so (R Ex (R)) 2 10 6 with probability 1. Similarly, Pr (S Ex (S) 1) Pr (S Ex (S) 1) 1/2, and so (S Ex (S)) 2 1. Thus, Var [R] 10 6 while Var [S] is only 1. Hence, the variance really helps show us the difference between these two situations. A high variance indicates that the random variable is likely to stray from the mean. Risk-averse people should stay away from games or investment strategies with high variance. You might gain a lot, but you also might lose a lot! Why do we bother to square the deviation before taking expectation? Well, if we just

10 Expected Value II computed the expected deviation Ex (R Ex (R)) we d find that Ex (R Ex (R)) Ex (R) Ex (Ex (R)) Ex (R) Ex (R) 0, which is meaningless. The trouble is that the positive and negative deviations cancel. This must happen because the expected value is the center of mass of the distribution, so positive and negative swings balance out by definition. One advantage of squaring is that all deviations become positive and there is no cancellation. Of course, we could get around the cancellation problem without squaring. We could use Ex ( R Ex (R) ) or higher powers Ex ((R Ex (R)) 4 ). The latter is known as the kurtosis of R (sounds like a foot disease!). It turns out that squaring is easier to work with so that s why it is widely used. For example, there is a very nice linearity of variance rule similar to linearity of expectation, but if you used any other function, such as the absolute value or the 4th power, it wouldn t work. So we all use the square when measuring the expected distance from the mean. In an effort to remove the square, people often use a closely related measure called the standard deviation of the random variable. Definition 2. For any random variable R, the standard deviation of R is σ(r) Var [R] Ex ( deviation 2) This is also sometimes referred to as the root-mean-square deviation, for obvious reasons. This is a very popular measure in statistics, particularly when you are fitting curves to data. For the two random variables we saw earlier, σ(r) 10 6 1000, and σ(s) 1 1. This works out pretty well. The standard deviation for R and S very accurately reflect the expected distance from the mean. This is often true in practice, though is not always the case.