Expectation is linear. So far we saw that E(X + Y ) = E(X) + E(Y ). Let α R. Then,

Similar documents
X = X X n, + X 2

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 20

Lecture 13 (Part 2): Deviation from mean: Markov s inequality, variance and its properties, Chebyshev s inequality

Discrete Random Variables

Expectation of geometric distribution

Week 12-13: Discrete Probability

Expectation of geometric distribution. Variance and Standard Deviation. Variance: Examples

3 Multiple Discrete Random Variables

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 15. Random Variables: Distributions, Independence, and Expectations

Example 1. The sample space of an experiment where we flip a pair of coins is denoted by:

1 Probability Review. 1.1 Sample Spaces

Kousha Etessami. U. of Edinburgh, UK. Kousha Etessami (U. of Edinburgh, UK) Discrete Mathematics (Chapter 7) 1 / 13

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 12. Random Variables: Distribution and Expectation

CS280, Spring 2004: Final

Chernoff Bounds. Theme: try to show that it is unlikely a random variable X is far away from its expectation.

CS 361: Probability & Statistics

Discrete Mathematics for CS Spring 2006 Vazirani Lecture 22

Math 151. Rumbos Fall Solutions to Review Problems for Exam 2. Pr(X = 1) = ) = Pr(X = 2) = Pr(X = 3) = p X. (k) =

Joint Probability Distributions and Random Samples (Devore Chapter Five)

2. Variance and Covariance: We will now derive some classic properties of variance and covariance. Assume real-valued random variables X and Y.

Business Statistics 41000: Homework # 2 Solutions

CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms

Discrete Mathematics and Probability Theory Fall 2012 Vazirani Note 14. Random Variables: Distribution and Expectation

Bandits, Experts, and Games

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 16. Random Variables: Distribution and Expectation

1 Presessional Probability

Basic Probability. Introduction

What is a random variable

CS70: Jean Walrand: Lecture 19.

Topic 3 Random variables, expectation, and variance, II

MATH Notebook 5 Fall 2018/2019

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

7. Be able to prove Rules in Section 7.3, using only the Kolmogorov axioms.

CSE 103 Homework 8: Solutions November 30, var(x) = np(1 p) = P r( X ) 0.95 P r( X ) 0.

University of Regina. Lecture Notes. Michael Kozdron

Lecture 1. ABC of Probability

Lecture Notes 3 Convergence (Chapter 5)

Quick Tour of Basic Probability Theory and Linear Algebra

MATH/STAT 3360, Probability

Data Analysis and Monte Carlo Methods

6.1 Moment Generating and Characteristic Functions

Def :Corresponding to an underlying experiment, we associate a probability with each ω Ω, Pr{ ω} Pr:Ω [0..1] such that Pr{ω} = 1

Expected Value 7/7/2006

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

1 Basic continuous random variable problems

Example continued. Math 425 Intro to Probability Lecture 37. Example continued. Example

6.042/18.062J Mathematics for Computer Science November 28, 2006 Tom Leighton and Ronitt Rubinfeld. Random Variables

Tom Salisbury

Expected Value II. 1 The Expected Number of Events that Happen

Joint Distribution of Two or More Random Variables

Lecture 10: Bayes' Theorem, Expected Value and Variance Lecturer: Lale Özkahya

CS 246 Review of Proof Techniques and Probability 01/14/19

Random variables (discrete)

MATHEMATICS 154, SPRING 2009 PROBABILITY THEORY Outline #11 (Tail-Sum Theorem, Conditional distribution and expectation)

Lecture 11. Probability Theory: an Overveiw

2. Suppose (X, Y ) is a pair of random variables uniformly distributed over the triangle with vertices (0, 0), (2, 0), (2, 1).

Random Variable. Pr(X = a) = Pr(s)

University of Regina. Lecture Notes. Michael Kozdron

MATH/STAT 3360, Probability Sample Final Examination Model Solutions

MAS113 Introduction to Probability and Statistics

The expected value E[X] of discrete random variable X is defined by. xp X (x), (6.1) E[X] =

Senior Math Circles November 19, 2008 Probability II

Alex Psomas: Lecture 17.

Midterm 2 Review. CS70 Summer Lecture 6D. David Dinh 28 July UC Berkeley

2.1 Independent, Identically Distributed Trials

Lecture 2: Repetition of probability theory and statistics

Math 510 midterm 3 answers

Expectation MATH Expectation. Benjamin V.C. Collins, James A. Swenson MATH 2730

Random Variables and Expectations

. Find E(V ) and var(v ).

Disjointness and Additivity

A crash course in probability. Periklis A. Papakonstantinou Rutgers Business School

Class 8 Review Problems 18.05, Spring 2014

COMP2610/COMP Information Theory

1: PROBABILITY REVIEW

Tail Inequalities. The Chernoff bound works for random variables that are a sum of indicator variables with the same distribution (Bernoulli trials).

Homework 4 Solution, due July 23

Review of Probability Theory

19 Deviation from the Mean

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Class 26: review for final exam 18.05, Spring 2014

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

Parametric Models: from data to models

STAT 345 Spring 2018 Homework 4 - Discrete Probability Distributions

Lecture 1: August 28

Notes on Discrete Probability

Central Limit Theorem and the Law of Large Numbers Class 6, Jeremy Orloff and Jonathan Bloom

Great Theoretical Ideas in Computer Science

Random Variables. Lecture 6: E(X ), Var(X ), & Cov(X, Y ) Random Variables - Vocabulary. Random Variables, cont.

MATH2206 Prob Stat/20.Jan Weekly Review 1-2

Exam III #1 Solutions

Expectations and Variance

Lecture 8. October 22, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Lecture 2 Sep 5, 2017

Random Variables. Statistics 110. Summer Copyright c 2006 by Mark E. Irwin

Lecture 4: Probability and Discrete Random Variables

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

MATH 3510: PROBABILITY AND STATS June 15, 2011 MIDTERM EXAM

CISC 1100/1400 Structures of Comp. Sci./Discrete Structures Chapter 7 Probability. Outline. Terminology and background. Arthur G.

Transcription:

Expectation is linear So far we saw that E(X + Y ) = E(X) + E(Y ). Let α R. Then, E(αX) = ω = ω (αx)(ω) Pr(ω) αx(ω) Pr(ω) = α ω X(ω) Pr(ω) = αe(x). Corollary. For α, β R, E(αX + βy ) = αe(x) + βe(y ). 1

Expectation of ϕ(x) X is a random variable and ϕ : R R. We want the expectation of Y = ϕ(x). We can compute f Y (y) = Pr(ϕ(X) = y) = Pr({ω : X(ω) ϕ 1 (y)}), and use E(Y ) = y R Y yf Y (y), where R Y range of Y. Alternatively we have, Claim. E(ϕ(X)) = x R X ϕ(x)f X (x). Proof. E(ϕ(X)) = ω = = x x R X ϕ(x(ω)) Pr(ω) ω:x(ω)=x ω:x(ω)=x ϕ(x(ω)) Pr(ω) ϕ(x) Pr(ω) = ϕ(x)f X (x). x Example. For a random variable X, E(X 2 ) = x x 2 f X (x). is the 2

Variance of X Consider the following three distributions: { 1 x = 0 f X (x) = 0 otherwise { 1/2 y = 1, 1 f Y (y) = 0 otherwise { 1/2 z = 100, 100 f Z (z) = 0 otherwise What are the expectations of these distributions? Does the expectation tell the whole story? Clearly Z is much more spread about its mean than X and Y. An intuitively appealing measurement of the spread of X about its mean µ = E(X) is given by E( X µ ). Def. For convenience the variance of X is defined as V (X) = E(X µ) 2. Def. The standard deviation is σ(x) = V (X). 3

Examples Let X be Bernoulli(p). We saw that µ = p. V (X) = (0 p) 2 (1 p) + (1 p) 2 p = p(1 p)[p + (1 p)] = p(1 p). Claim. V (X) = E(X 2 ) µ 2. Proof. E(X µ) 2 = E(X 2 2µX + µ 2 ) = E(X 2 ) 2µE(X) + E(µ 2 ) = E(X 2 ) 2µ 2 + µ 2 = E(X 2 ) µ 2. X is the outcome of a roll of a fair die. We saw that E(X) = 7/2. E(X 2 ) = 1 2 1 6 + 22 1 6 + + 62 1 So, V (X) = 91 6 ( ) 7 2 2 = 35 12. 6 = 91 6. 4

V (X + Y ) Let X and Y be random variables with µ = E(X) and ν = E(Y ). Def. The covariance of X and Y is Cov(X, Y ) = E(XY ) E(X) E(Y ). Claim. V (X +Y ) = V (X)+V (Y )+2 Cov(X, Y ). Proof. E(X + Y ) = µ + ν, so V (X + Y ) = E[(X + Y ) 2 ] (µ + ν) 2 = E(X 2 + 2XY + Y 2 ) (µ 2 + 2µν + ν 2 ) = [E(X 2 ) µ 2 ] + [E(Y 2 ) ν 2 ] + 2 [E(XY ) µν] 5

Suppose X and Y are independent Claim. If X and Y are independent Cov(X, Y ) = 0. Proof. E(XY ) = (XY )(ω) Pr(ω) ω = X(ω) Y (ω) Pr(ω) x R X y R Y ω:x(ω)=x, Y (ω)=y = x y Pr(ω) x y ω:x(ω)=x, Y (ω)=y = x y Pr(X = x, Y = y) x y = x y Pr(X = x) Pr(Y = y) x y = x Pr(X = x) y Pr(Y = y) x y = E(X) E(Y ). Corollary. If X and Y are independent V (X + Y ) = V (X) + V (Y ). 6

The variance of B n,p Corollary. If X 1,... X n are independent then V (X 1 +X 2 + +X n ) = V (X 1 )+V (X 2 )+... V (X n ). Proof. By induction but note that we need to show that X 1 + + X k 1 is independent of X k. Let X be a B n,p random variable. Then X = n 1 X k where X k are independent Bernoulli p random variables. So, V (X) = V ( n ) n X k = V (X k ) = np(1 p). 1 For a fixed p the variance increases with n. Does this make sense? For a fixed n the variance is minimized for p = 0, 1 and maximized for p = 1/2. Does it make sense? Expectation and variance are just two measurements of the distribution. They cannot possibly convey the same amount of information that is in the distribution function. Nevertheless we can learn a lot from them. 7 1

Markov s Inequality Theorem. Suppose X is a nonnegative random variable and α > 0. Then Pr(X α) E(X) α. Proof. E(X) = x f X (x) x x f X (x) x α Example. If X is B 100,1/2, x α α f X (x) = α x α f X (x) = α Pr(X α). Pr(X 100) 50 100. This is not very accurate: the correct answer is... 2 100 10 30. What would happen if you try to estimate this way Pr(X 49)? 8

Chebyshev s Inequality Theorem. X is a random variable and β > 0. Pr( X µ β) V (X) β 2. Proof. Let Y = (X µ) 2. Then, So X µ β Y β 2, {ω : X(ω) µ β} = {ω : Y (ω) β 2 }. In particular, the probabilities of these events are the same: Pr( X µ β) = Pr(Y β 2 ). Since Y 0 by Markov s inequality Pr(Y β 2 ) E(Y ) β 2. Finally, note that E(Y ) = E[(X µ) 2 ] = V (X). 9

Example Chebyshev s inequality gives a lower bound on how well is X concentrated about its mean. Suppose X is B 100,1/2 and we want a lower bound on Pr(40 < X < 60). Note that so, Now, So, 40 < X < 60 10 < X 50 < 10 X 50 < 10 Pr(40 < X < 60) = Pr( X 50 < 10) = 1 Pr( X 50 10). Pr( X 50 10) V (X) 10 2 100 (1/2)2 = 100 = 1 4. Pr(40 < X < 60) 1 V (X) 10 2 = 3 4. This is not too bad: the correct answer is 0.9611. 10

The law of large numbers (LLN) You suspect the coin you are betting on is biased. You would like to get an idea on the probability that it lands heads. How would you do that? Flip n times and check the relative number of Hs. In other words, if X k is the indicator of H on the kth flip, you estimate p as n k=1 p X k. n The underlying assumption is that as n grows bigger the approximation is more likely to be accurate. Is there a mathematical justification for this intuition? 11

LLN cont. Consider the following betting scheme: At every round the croupier rolls a die. You pay $1 to join the game in which you bet on the result of the next 5 rolls. If you guess them all correctly you get 6 5 = 7776 dollars, 0 otherwise. How can you estimate if this is a fair game? Study the average winnings of the last n gamblers. Formally, let X k be the winnings of the kth gambler. We hope to estimate E(X k ) by n k=1 E(X k ) X k. n Is there a mathematical justification for this intuition? Is the previous problem essentially different than this one? 12

Example of the (weak) LLN Consider again the binomial p = 1/2 case. With n S n = X k, we expect, for example, that k=1 Pr(0.4 < S n n < 0.6) = Pr(0.4n < S n < 0.6n) will be big (close to 1) as n increases. As before, Pr(0.4n < S n < 0.6n) = Pr( 0.1n < S n 0.5n < 0.1n) = Pr( S n 0.5n < 0.1n) = 1 Pr( S n 0.5n 0.1n). As before we can bound Pr( S n 0.5n 0.1n) V (S n) (0.1n) 2 = n (1/2)2 0.01n 2 = 1 0.04n. Pr(0.4 < S n n < 0.6) 1 1 0.04n Are any of 0.4, 0.6 or p = 1/2 special? 13 n 1.

The (weak) law of large numbers The previous example can be generalized to the following statement about a sequence of Bernoulli(p) trials: for any ε > 0, ( n k=1 Pr X k n p ) ε n 0. A further generalization allows us to replace p by E(X k ). Suppose X 1, X 2,... are a sequence of iid (independent and identically distributed) random variables. Then, with µ = E(X k ) ( n k=1 Pr X k n µ ) ε n 0. The proof is essentially identical to the previous one using Chebyshev s inequality. 14

The binomial dispersion S n is a binomial B (n,p) random variable. How tightly is it concentrated about its mean? In particular, how large an interval about the mean should we consider in order to guarantee that S n is in that interval with probability of at least 0.99? Can you readily name such an interval? Can we be more frugal? We know that if we take an interval of length, say, 2 n/10 then Pr(np n/10 < S n < np + n/10) n 1. Why is that true? np n/10 < S n < np + n/10 1/10 < S n np < 1/10 n S n n p < 1/10, and by the LLN ( S ) n Pr n p < 1/10 1. n 15

Therefore, for all sufficiently large n, Pr(np n/10 < S n < np + n/10) 0.99. Two problems: We didn t really say what n is? We are still being wasteful (as you will see). Clearly, the question is that of the dispersion of S n about its mean. Recall that the variance is supposed to (crudely) measure just that. Chebyshev s inequality helps visualizing that: β > 0 Pr( X E(X) < β) = 1 Pr( X E(X) β) 1 V (X) β 2. What should β be in order to make sure that Pr ( X E(X) < β ) 0.99? Need: V (X) β 2 0.01, or β 100V (X) = 10σ(X). Applying this general rule to the binomial S n we have P ( S n np < 10 np(1 p) ) 0.99. 16

More generally, Pr ( S n np < ασ(s n ) ) 1 1 α 2. Since σ(s n ) = np(1 p), it means most of the action takes place in an interval of size c n about np (before we had an interval of size b n). 17

Confidence interval What happens if we don t know p? We can still repeat the argument above to get: Pr( S n np < 5 n) 1 V (S n) (5 n) 2 since p(1 p) 1/4. np(1 p) = 1 25n 1 1/4 25 = 0.99, It follows that for any p and n: ( S n Pr n p < 5 ) 0.99. n So with probability of at least 0.99, S n /n is within a distance of 5/ n of its unknown mean, p. This can help us design an experiment to estimate p. For example, suppose that a coin is flipped 2500 times and that S = S 2500 is the number of heads. Then with probability of at least 0.99, S/2500 is within 5/50 = 0.1 of p. 18

Equivalently, with probability of at least 0.99 the interval (S/2500 0.1, S/2500 + 0.1) contains p. Such an interval is called a 99% confidence interval for p. For example, suppose we see only 750 heads in 2500 flips. Since 750/2500 = 0.3 our 99% confidence interval is (0.3 0.1, 0.3 + 0.1) = (0.2, 0.4). We should therefore be quite suspicious of this coin. Remark. We have been quite careless: all we used to generate our confidence interval was Chebyshev s inequality. Chebyshev s inequality doesn t know that S n happens to be a binomial random variable: it only uses the mean and the variance of S n. A more careful analysis would gain us a significantly tighter 99% confidence interval. 19