Expectations of Sums of Random Variables STAT/MTHE 353: 4 - More on Expectations and Variances T. Linder Queen s University Winter 017 Recall that if X 1,...,X n are random variables with finite expectations, then E(X 1 + X + + X n )=E(X 1 )+E(X )+ + E(X n ) The X i can be continuous or discrete or of any other type. The expectation on the left-hand-side is with with respect to the joint distribution of X 1,...,X n. The ith expectation on the right-hand-side is with with respect to the marginal distribution of X i, i =1,...,n. STAT/MTHE 353: 4 - More on Expectations 1/ 37 STAT/MTHE 353: 4 - More on Expectations / 37 Often we can write a r.v. X as a sum of simpler random variables. Then E(X) is the sum of the expectation of these simpler random variables. Example: Consider (X 1,...,X r ) having multinomial distribution with parameters n and (p 1,...,p r ). Compute E(X i ), i =1,...,r Example: Let (X 1,...,X r ) the multivariate hypergeometric distribution with parameters N and n 1,...,n r. Compute E(X i ), i =1,...,r Example: We have two urns. Initially Urn 1 contains n red balls and Urn contains n blue balls. At each stage of the experiment we pick a ball from Urn 1 at random, also pick a ball from Urn at random, and then swap the balls. Let X = # of red balls in Urn 1 after k stages. Compute E(X) for even k. Example: (Matching problem) If the integers 1,,...,n are randomly permuted, what is the probability that integer i is in the ith position? What is the expected number of integers in the correct position? STAT/MTHE 353: 4 - More on Expectations 3/ 37 STAT/MTHE 353: 4 - More on Expectations 4/ 37
Conditional Expectation Suppose X =(X 1,...,X n ) T and =( 1,..., m ) T are two vector random variables defined on the same probability space. The distributions (joint marginals) of X and can be described the pdfs f X (x) and f (y) (if both X and are continuous) or by the pmfs p X (x) and p (y) (if both are discrete). The joint distribution of the pair (X, ) can be described by their joint pdf f X, (x, y) or joint pmf p X, (x, y). The conditional distribution of X given = y is described by either the conditional pdf or the conditional pmf f X (x y) = f X, (x, y) f (y) p X (x y) = p X, (x, y) p (y) STAT/MTHE 353: 4 - More on Expectations 5/ 37 Remarks: (1) In general, X and can have di erent types of distribution (e.g., one is discrete, the other is continuous). Example: Let n = m =1and X = + Z, where is a Bernoulli(p) r.v.andz N(0, ),and and Z are independent. Determine the conditional pdf of X given =0and =1. Also, determine the pdf of X. () Not all random variables are either discrete or continuous. Mixed discrete-continuous and even more general distributions are possible, but they are mostly out of the scope of this course. STAT/MTHE 353: 4 - More on Expectations 6/ 37 Definitions (1) The conditional expectation of X given = y is the mean (expectation) of the distribution of X given = y and is denoted by E(X = y). () The conditional variance of X given = y is the the variance of the distribution of X given = y and is denoted by Var(X = y). If both X and are discrete, Special case: Assume X and are independent. Then (considering the discrete case) p X (x y) =p X (x) so that for all y, E(X = y) = X x xp X (x y) E(X = y) = X x xp X (x y) = X x xp X (x) =E(X) and Var(X = y) = X x E(X = y) p X (x y) x In case both X and are continuous, wehave and Var(X = y) = E(X = y) = Z 1 Z 1 1 xf X (x y) dx x E(X = y) f X (x y) dx 1 STAT/MTHE 353: 4 - More on Expectations 7/ 37 A similar argument shows E(X = y) = E(X) if X and are independent continuous random variables. STAT/MTHE 353: 4 - More on Expectations 8/ 37
Notation: Let g(y) = E(X = y). We define the random variable E(X ) by setting E(X )=g( ) Similarly, letting h(y) = Var(X = y), the random variable Var(X ) is defined by Var(X )=h( ) The following are important properties of conditional expectation. We don t prove them formally, but they should be intuitively clear. Properties (i) (Linearity of conditional expectation) IfX 1 and X are random variables with finite expectations, then for all a, b R, E(aX 1 + bx )=ae(x 1 )+be(x ) For example, if X and are independent, then E(X = y) = E(X) (constant function), so E(X )=E(X) (ii) If g : R! R is a function such that E[g( )] is finite, then and if E g( )X is finite, then E g( ) = g( ) E g( )X = g( )E(X ) STAT/MTHE 353: 4 - More on Expectations 9/ 37 STAT/MTHE 353: 4 - More on Expectations 10 / 37 Theorem 1 (Law of total expectation) E(X) =E E(X ) Proof: Assume both X and are discrete. Then E E(X ) = X E(X = y)p (y) = X X xp X (x y) p (y) y y x = X X x p X, (x, y) p (y) = X X xp X, (x, y) p y x (y) y x = X x xp X (x) =E(X) Example: Expected value of geometric distribution... Lemma (Variance formula) Var(X) =E Var(X ) + Var E(X ) Proof: Since Var(X = y) is the variance of the conditional distribution of X given = y, Var(X )=E[X ] E[X ] Taking expectation (with respect to ), E Var(X ) = E E[X ]) E E[X ] = E(X ) E E[X ] On the other hand, Var E[X ] = E E[X ] so E E(X ) = E E[X ] E(X) Var(X) =E(X ) E(X) = E Var(X ) + Var E(X ) STAT/MTHE 353: 4 - More on Expectations 11 / 37 STAT/MTHE 353: 4 - More on Expectations 1 / 37
Remarks: (1) Let A be an event and X the indicator of A: 8 < 1 if A occurs X = : 0 if A c occurs () The law of total expectation says that we can compute the mean of a distribution by conditioning on another random variable. This distribution can be a conditional distribution. For example, for r.v. s X,,andZ, E(X = y) =E E(X = y, Z) = y Then E(X) =P (A). Assuming is a discrete r.v., we have E(X = y) =P (A = y) and the law of total expectation states so that E(X )=E E(X,Z) P (A) =E(X) = X y E(X = y)p (y) = X y P (A = y)p (y) For example, if Z is discrete, which is the law of total probability. For continuous we have P (A) = Z 1 E(X = y)f (y) dy = Z 1 1 1 P (A = y)f (y) dy E(X = y) = X E(X = y, Z = z)p Z (z y) z = X E(X = y, Z = z)p (Z = z = y) z Exercise: Prove the above statement if X,,andZ are discrete. STAT/MTHE 353: 4 - More on Expectations 13 / 37 STAT/MTHE 353: 4 - More on Expectations 14 / 37 Example: Repeatedly flip a biased coin which comes up heads with probability p. Let X denote the number of flips until consecutive heads occur. Find E(X). Solution: Example: (Simplex algorithm) There are n vertices (points) that are ranked from best to worst. Start from point j and at each step, jump to one of the better points at random (with equal probability). What is the expected number of steps to reach the best point? Solution: Minimum mean square error (MMSE) estimation Suppose a r.v. is observed and based on its value we want to guess the value of another r.v. X. Formally, we want to use a function g( ) of to estimate the unobserved X in the sense of minimizing the mean square error E (X g( )) It turns out that g ( )=E(X ) is the optimal choice. Theorem 3 Suppose X has finite variance. Then for g ( )=E(X ) and any function g E (X g( )) E (X g ( )) STAT/MTHE 353: 4 - More on Expectations 15 / 37 STAT/MTHE 353: 4 - More on Expectations 16 / 37
Proof: E (X Use the properties of conditional expectation: g( )) = E (X g ( )+g ( ) g( )) = E (X g ( )) +(g ( ) g( )) + (X g ( ))(g ( ) g( )) = E (X g ( )) + E (g ( ) g( )) +E (X g ( ))(g ( ) g( )) = E (X g ( )) +(g ( ) g( )) + (g ( ) g( ))E X g ( ) = E (X g ( )) +(g ( ) g( )) + (g ( ) g( )) E(X ) g ( ) {z } =0 = E (X g ( )) +(g ( ) g( )) Proof cont d Thus E (X g( )) = E (X g ( )) +(g ( ) g( )) Take expectations on both sides and use the law of total expectation to obtain E (X g( )) = E (X g ( )) + E g ( ) g( )) Since g ( ) g( ) 0,thisimplies E (X g( )) E (X g ( )) STAT/MTHE 353: 4 - More on Expectations 17 / 37 STAT/MTHE 353: 4 - More on Expectations 18 / 37 Random Sums Remark: Note that since g (y) =E(X = y), wehave E Var(X ) = E (X g ( )) i.e., E Var(X ) is the mean square error of the MMSE estimate of X given. Example: Suppose X N(0, X ) and Z N(0, Z ),wherex and Z are independent. Here X represents a signal sent from a remote location which is corrupted by noise Z so that the received signal is = X + Z. What is the MMSE estimate of X given = y? Theorem 4 (Wald s equation) Let X 1,X... be i.i.d. random variables with mean µ. LetN be r.v. with values in {1,,...} that is independent of the X i s and has finite mean E(N). DefineX = P N X i. Then Proof: E(X) =E(N)µ E(X N = n) = E(X 1 + + X N N = n) = E(X 1 + + X n N = n) = E(X 1 N = n)+ + E(X n N = n) (linearity of expectation) = E(X 1 )+ + E(X n ) (N and X i are independent) = nµ STAT/MTHE 353: 4 - More on Expectations 19 / 37 STAT/MTHE 353: 4 - More on Expectations 0 / 37
Covariance and Correlation Proof cont d: We obtained E(X N = n) =nµ for all n =1,,...,i.e, E(X N) =Nµ. By the law of total expectation E(X) =E E(X N) = E(Nµ)=E(N)µ Example: (Branching Process) Suppose a population evolves in generations starting from a single individual (generation 0). Each individual of the ith generation produces a random number of o springs; the collection of all o springs by generation i individuals forms generation i +1. The number of o springs born to distinct individuals are independent random variables with mean µ. LetX n be the number of individuals in the nth generation. Find E(X n ). Covariance Definition Let X and be two random variables with finite variance. Their covariance is defined by Cov(X, )=E (X E(X))( E( )) Properties: (1) Cov(X, ) = E(X ) E E(X) = E(X ) E(X)E( )+E(X)E( ) = E(X ) E(X)E( ) E XE( ) + E E(X)E( ) The formula Cov(X, )=E(X ) computations. E(X)E( ) is often useful in STAT/MTHE 353: 4 - More on Expectations 1 / 37 STAT/MTHE 353: 4 - More on Expectations / 37 () Cov(X, )=Cov(,X). (3) If X = we obtain (5) If X and are independent, then Cov(X, )=0. Proof: By independence, E(X )=E(X)E( ), so Cov(X, )=E (X E(X)) ) = Var(X) Cov(X, )=E(X ) E(X)E( )=0 (4) For any constants a, b, c and d, Cov(aX + b, c + d) = E (ax + b E(aX + b))(c + d E(c + d)) = E a(x E(X))c( E( )) = ace (X E(X))( E( )) = ac Cov(X, ) Definition Let X 1,...,X n be random variables with finite variances. The covariance matrix of the vector X =(X 1,...,X n ) T is the n n matrix Cov(X) whose (i, j)th entry is Cov(X i,x j ). Remarks: The ith diagonal entry of Cov(X) is Var(X i ), i =1,...,n Cov(X) is a symmetric matrix since Cov(X i,x j )=Cov(X j,x i ) for all i and j. STAT/MTHE 353: 4 - More on Expectations 3 / 37 STAT/MTHE 353: 4 - More on Expectations 4 / 37
Some properties of covariance are easier to derive using a matrix formalism. Let V = { ij ; i =1,..., m, j =1,...,n} be an m n matrix of random variables having finite expectations. We define E(V ) by taking expectations componentwise: 3 3 11... 1n E( 11 )... E( 1n ) 1... n E(V )=E 6. 4... 7. 5 = E( 1 )... E( n ) 6. 4... 7. 5 m1... mn E( m1 )... E( mn ) Now notice that the n n matrix (X E(X))(X E(X)) T has (X i E(X i ))(X j E(X j )) in its (i, j)th entry. Thus Cov(X) =E (X E(X))(X E(X)) T Lemma 5 Let A be an m n real matrix and define = AX (an m-dimensional random vector). Then Cov( )=A Cov(X)A T Proof: First note that by the linearity of expectation, Thus E( )=E(AX) =AE(X). Cov( ) = E ( E( ))( E( )) T = E (AX AE(X))(AX AE(X)) T = E (A(X E(X)))(A(X E(X))) T = E A(X E(X))(X E(X))) T A T = AE (X E(X))(X E(X))) T A T = A Cov(X)A T STAT/MTHE 353: 4 - More on Expectations 5 / 37 STAT/MTHE 353: 4 - More on Expectations 6 / 37 For any m-vector c =(c 1,...,c m ) T we also have Cov( + c) =Cov( ) since Cov( i + c i, j + c j )=Cov( i, j ). Thus Cov(AX + c) =A Cov(X)A T Let m =1so that c = c is a scalar and A is and 1 n matrix, i.e., A is arowvectora = a T =(a 1,...,a n ). Then X n Cov(a T X + c) =Cov X n a i X i + c = Var a i X i + c On the other hand, Cov(a T X + c) = a T Cov(X)a = = = Hence X n Var a i X i + c = j=1 a i a j Cov(X i,x j ) a i a j Cov(X i,x j ) a i Cov(X i,x i )+X i<j a i a j Cov(X i,x j ) a i Var(X i )+X i<j a i Var(X i )+X i<j a i a j Cov(X i,x j ) Note that if X 1,...,X n are independent, thencov(x i,x j )=0for i 6= j, and we obtain X n Var a i X i + c = a i Var(X i ) STAT/MTHE 353: 4 - More on Expectations 7 / 37 STAT/MTHE 353: 4 - More on Expectations 8 / 37
More generally, let X =(X 1,...,X n ) T and =( 1,..., m ) T and let Cov(X, ) be the n m matrix with (i, j)th entry Cov(X i, j ). Note that Cov(X, )=E (X E(X))( E( )) T If A is a k n matrix, B is an l m matrix, c is a k-vector and d is an l-vector, then Cov(AX + c, B + d) = E (AX + c E(AX + c))(b + d E(B + d)) T We obtain = AE (X E(X))( E( )) T B T Cov(AX + c, B + d) =A Cov(X, )B T We can now prove the following important property of covariance: Lemma 6 For any constants a 1,...,a n and b 1,...,b m, X n Cov a i X i, i.e., Cov(X, ) is bilinear. mx b j j = j=1 j=1 mx a i b j Cov(X i, j ) Proof: Let k = l =1and A = a T =(a 1,...,a n ) and B = b T =(b 1,...,b m ). Then we have X n Cov a i X i, mx b j j j=1 = Cov(a T X, b T ) =Cov(AX, B ) = A Cov(X, )B T = a T Cov(X, )b mx = a i b j Cov(X i, j ) j=1 STAT/MTHE 353: 4 - More on Expectations 9 / 37 STAT/MTHE 353: 4 - More on Expectations 30 / 37 The following property of covariance is of fundamental importance: Lemma 7 Cov(X, ) apple p Var(X) Var( ) Proof: First we prove the Cauchy-Schwarz inequality for random variables U and V with finite variances. Let R, then 0 apple E (U V ) = E(U UV + V ) = E(U ) E(UV)+ E(V ) This is a quadratic polynomial in roots. which cannot have two distinct real Proof cont d: Thus its discriminant cannot be positive: 4 E(UV) 4E(U )E(V ) apple 0 so we obtain [E(UV)] apple E(U )E(V ) Use this with U = X E(X) and V = E( ) to get Cov(X, ) = E (X E(X))( E( )) apple qe (X E(X)) E ( E( )) = p Var(X) Var( ) STAT/MTHE 353: 4 - More on Expectations 31 / 37 STAT/MTHE 353: 4 - More on Expectations 3 / 37
Correlation Recall that Cov(aX, b )=ab Cov(X, ). Thisisanundesirable property if we want to use Cov(X, ) as a measure of association between X and. A proper normalization will solve this problem: Definition The correlation coe variances is defined by Remarks: (X, )= Since Var(aX + b) =a Var(X), cient between X and having nonzero Cov(X, ) p Var(X) Var( ) (ax + b, a + d) = (X, ) Letting µ X = E(X), µ = E( ), X = Var(X), = Var( ), we have (X, ) = Cov(X, ) = Cov(X µ X, µ ) X X X µx = Cov, µ X Thus (X, ) is the covariance between the standardized versions of X and. If X and are independent, then Cov(X, )=0, so (X, )=0. On the other hand, (X, )=0does not imply that X and are independent. Remark: If (X, )=0we say that X and are uncorrelated. Example: Find random variables X and that are uncorrelated but not independent. Example: Covariance and correlation for multinomial random variables... STAT/MTHE 353: 4 - More on Expectations 33 / 37 STAT/MTHE 353: 4 - More on Expectations 34 / 37 Theorem 8 The correlation always satisfies (X, ) apple 1 Moreover, (X, ) =1if and only if = ax + b for some constants a and b (a 6= 0), i.e., is an a ne function of X. Proof: We know that Cov(X, ) apple p Var(X) Var( ), so (X, ) apple 1 always holds. Let s assume now that = ax + b, wherea 6= 0. Then Cov(X, )=Cov(X, ax+b) =Cov(X, ax) = a Cov(X, X) = avar(x) so (X, )= a Var(X) p Var(X)a Var(X) = p a = ±1 a STAT/MTHE 353: 4 - More on Expectations 35 / 37 Proof cont d: Conversely, suppose that (X, )=1. Then X X Var = Cov, X X X X X = Var + Var This means that X X X = Var(X) + Var( ) X = 1+1 =0 X Cov, X Cov(X, ) X = c for some constant c, so = X X c If (X, )= 1, consider Var X X + and use the same proof Remark: The previous theorem implies that correlation can be thought of as a measure of linear association (linear dependence) between X and. Recall the multinomial example... STAT/MTHE 353: 4 - More on Expectations 36 / 37
Example: (Linear MMSE estimation) Let X and be random variables with zero means and finite variances X > 0 and > 0. Suppose we want to estimate X in the MMSE sense using a linear function of ;i.e., we are looking for a R minimizing E (X a ) Find the minimizing a and determine the resulting minimum mean square error. Relate the results to (X, ). STAT/MTHE 353: 4 - More on Expectations 37 / 37