Expectation DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda
Aim Describe random variables with a few numbers: mean, variance, covariance
Expectation operator Mean and variance Covariance Conditional expectation
Discrete random variables Average of the values of a function weighted by the pmf E (g (X )) = x R g (x) p X (x) E (g (X, Y )) = x R X x R Y g (x, y) p X,Y (x, y) ( ( )) E g X = x1 g ( x) p X ( x) x 2 x n
Continuous random variables Average of the values of a function weighted by the pdf E (g (X )) = x= g (x) f X (x) dx E (g (X, Y )) = x= y= g (x, y) f X,Y (x, y) dx dy ( ( )) E g X = g ( x) f X ( x) dx 1 dx 2... dx n x 1 = x 2 = x n=
Discrete and continuous random variables E (g (C, D)) = = c= d R D d R D g (c, d) f C (c) p D C (d c) dc c= g (c, d) p D (d) f C D (c d) dc
St Petersburg paradox A casino offers you a game Flip an unbiased coin until it lands on heads You get 2 k dollars where k = number of flips Expected gain?
St Petersburg paradox E (Gain) = 2 k 1 2 k k=1
St Petersburg paradox E (Gain) = 2 k 1 2 k k=1 =
Linearity of expectation For any constants a and b and any functions g 1 and g 2 E (a g 1 (X, Y ) + b g 2 (X, Y )) = a E (g 1 (X, Y )) + b E (g 2 (X, Y )) Follows from linearity of sums and integrals
Example: Coffee beans Company buys coffee beans from two local producers Beans from Colombia: C tons/year Beans from Vietnam: V tons/year Model: C uniform between 0 and 1 V uniform between 0 and 2 C and V independent What is the expected total amount of beans B?
Example: Coffee beans E (C + V )
Example: Coffee beans E (C + V ) = E (C) + E (V )
Example: Coffee beans E (C + V ) = E (C) + E (V ) = 0.5 + 1 = 1.5 tons
Example: Coffee beans E (C + V ) = E (C) + E (V ) = 0.5 + 1 = 1.5 tons Holds even if C and V are not independent
Independence If X, Y are independent then E (g (X ) h (Y )) = E (g (X )) E (h (Y ))
Independence E (g (X ) h (Y )) = x= y= g (x) h (y) f X,Y (x, y) dx dy
Independence E (g (X ) h (Y )) = = x= y= x= y= g (x) h (y) f X,Y (x, y) dx dy g (x) h (y) f X (x) f Y (y) dx dy
Independence E (g (X ) h (Y )) = = x= y= x= y= = E (g (X )) E (h (Y )) g (x) h (y) f X,Y (x, y) dx dy g (x) h (y) f X (x) f Y (y) dx dy
Expectation operator Mean and variance Covariance Conditional expectation
Mean The mean or first moment of X is E (X ) It s the center of mass of the distribution
Bernoulli E (X ) = 0 p X (0) + 1 p X (1) = p
Binomial A binomial is a sum of n Bernoulli random variables X = n i=1 B i
Binomial A binomial is a sum of n Bernoulli random variables X = n i=1 B i ( n ) E (X ) = E B i i=1
Binomial A binomial is a sum of n Bernoulli random variables X = n i=1 B i ( n ) E (X ) = E B i = i=1 n E (B i ) i=1
Binomial A binomial is a sum of n Bernoulli random variables X = n i=1 B i ( n ) E (X ) = E B i = i=1 n E (B i ) i=1 = np
Mean of important random variables Random variable Parameters Mean Bernoulli p p Geometric p 1 p Binomial n, p np Poisson λ λ Uniform a, b a+b 2 Exponential λ 1 λ Gaussian µ, σ µ
Cauchy random variable 0.3 fx (x) 0.2 0.1 0 10 5 0 5 10 x f X (x) = 1 π(1 + x 2 ).
Cauchy random variable E(X ) = = 0 x π(1 + x 2 ) dx x π(1 + x 2 ) dx 0 x π(1 + x 2 ) dx
Cauchy random variable E(X ) = = 0 x π(1 + x 2 ) dx x π(1 + x 2 ) dx 0 x π(1 + x 2 ) dx 0 x π(1 + x 2 ) dx = 0 1 2π(1 + t) dt = lim t log(1 + t) 2π
Cauchy random variable E(X ) = = 0 x π(1 + x 2 ) dx x π(1 + x 2 ) dx 0 x π(1 + x 2 ) dx 0 x π(1 + x 2 ) dx = 1 0 2π(1 + t) dt log(1 + t) = lim t 2π =
Mean of a random vector Vector formed by the means of its components E (X 1 ) ( ) E X := E (X 2 ) E (X n ) By linearity of expectation, for any matrix A R m n and b R m ( E AX + ) ( ) b = A E X + b
The mean as a typical value The mean is a typical value of the random variable The probability that X equals E (X ) can be zero The mean can be severely distorted by a subset of extreme values
Density with subset of extreme values 0.1 fx (x) 0 0 20 40 60 80 100 x Uniform random variable X with support [ 4.5, 4.5] [99.5, 100.5]
Density with subset of extreme values 4.5 100.5 E (X ) = x f X (x) dx + x f X (x) dx x= 4.5 x=99.5 = 1 100.5 2 99.5 2 10 2 = 10
Density with subset of extreme values 0.1 fx (x) 0 0 20 40 60 80 100 x
Median Midpoint of the distribution: number m such that P (X m) 1 2 and P (X m) 1 2 For continuous random variables F X (m) = m f X (x) dx = 1 2
Density with subset of extreme values F X (m) = m 4.5 = m + 4.5 10 f X (x) dx
Density with subset of extreme values F X (m) = m 4.5 = m + 4.5 10 = 1 2 f X (x) dx m = 0.5
Density with subset of extreme values 0.1 Mean Median fx (x) 0 0 20 40 60 80 100 x
Variance The mean square or second moment of X is E ( X 2) The variance of X is Var (X ) := E ((X E (X )) 2) = E ( X 2 2X E (X ) + E 2 (X ) ) = E ( X 2) E 2 (X ) The standard deviation of X is σ X := Var (X )
Bernoulli E ( X 2) = 0 p X (0) + 1 p X (1) = p Var (X ) = E ( X 2) E 2 (X ) = p p 2 = p (1 p)
Variance of common random variables Random variable Parameters Variance Bernoulli p p (1 p) Geometric p 1 p p 2 Binomial n, p np (1 p) Poisson λ λ Uniform a, b (b a) 2 12 Exponential λ 1 λ 2 Gaussian µ, σ σ 2
Geometric (p = 0.2) 0.2 0.15 px (k) 0.1 5 10 2 0 0 5 10 15 20 k
Binomial (n = 20, p = 0.5) 0.2 0.15 0.1 5 10 2 0 0 5 10 15 20 k
Poisson (λ = 25) 8 10 2 6 4 2 0 10 20 30 40 k
Uniform [0, 1] 1 0.8 fx (x) 0.6 0.4 0.2 0 0.5 0 0.5 1 1.5 x
Exponential (λ = 1) 1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 5 x
Gaussian (µ = 0, σ = 1) 0.4 0.3 0.2 0.1 0 4 2 0 2 4 x
Variance The variance operator is not linear, but Var (a X + b) = E ((a X + b E (a X + b)) 2) = E ((a X + b ae (X ) b) 2) = a 2 E ((X E (X )) 2) = a 2 Var (X )
Bounding probabilities using expectations Aim: Characterize behavior of X to some extent using E (X ) and Var (X )
Markov s inequality For any nonnegative random variable X and any a > 0 P (X a) E (X ) a
Markov s inequality Consider the indicator variable 1 X a X a 1 X a 0
Markov s inequality Consider the indicator variable 1 X a X a 1 X a 0 E (X ) a E (1 X a )
Markov s inequality Consider the indicator variable 1 X a X a 1 X a 0 E (X ) a E (1 X a ) = a P (X a)
Age of students at NYU Mean: 20 years How many are younger than 30?
Age of students at NYU Mean: 20 years How many are younger than 30? P(A 30) E (A) 30
Age of students at NYU Mean: 20 years How many are younger than 30? At least 1/3 P(A 30) E (A) 30 = 2 3
Chebyshev s inequality For any positive constant a > 0, P ( X E (X ) a) Var (X ) a 2
Chebyshev s inequality For any positive constant a > 0, P ( X E (X ) a) Var (X ) a 2 Corollary: If Var (X ) = 0 then P (X E (X )) = 0
Chebyshev s inequality For any positive constant a > 0, P ( X E (X ) a) Var (X ) a 2 Corollary: If Var (X ) = 0 then P (X E (X )) = 0 For any ɛ > 0 P ( X E (X ) ɛ) Var (X ) ɛ 2 = 0
Chebyshev s inequality Define Y := (X E (X )) 2 By Markov s inequality P ( X E (X ) a) = P ( Y a 2)
Chebyshev s inequality Define Y := (X E (X )) 2 By Markov s inequality P ( X E (X ) a) = P ( Y a 2) E (Y ) a 2
Chebyshev s inequality Define Y := (X E (X )) 2 By Markov s inequality P ( X E (X ) a) = P ( Y a 2) E (Y ) = a 2 Var (X ) a 2
Age of students at NYU Mean: 20 years, standard deviation: 3 years How many are younger than 30?
Age of students at NYU Mean: 20 years, standard deviation: 3 years How many are younger than 30? P(A 30) P( A 20 10)
Age of students at NYU Mean: 20 years, standard deviation: 3 years How many are younger than 30? At least 91 % P(A 30) P( A 20 10) Var (A) 100 = 9 100
Expectation operator Mean and variance Covariance Conditional expectation
Covariance The covariance of X and Y is Cov (X, Y ) := E ((X E (X )) (Y E (Y ))) = E (XY Y E (X ) X E (Y ) + E (X ) E (Y )) = E (XY ) E (X ) E (Y ) If Cov (X, Y ) = 0, X and Y are uncorrelated
Covariance Cov (X, Y ) 0.5 0.9 0.99 Cov (X, Y ) 0-0.9-0.99
Variance of the sum Var (X + Y ) = E ((X + Y E (X + Y )) 2) ( = E (X E (X )) 2) + E ((Y E (Y )) 2) + 2E ((X E (X )) (Y E (Y ))) = Var (X ) + Var (Y ) + 2 Cov (X, Y )
Variance of the sum Var (X + Y ) = E ((X + Y E (X + Y )) 2) ( = E (X E (X )) 2) + E ((Y E (Y )) 2) + 2E ((X E (X )) (Y E (Y ))) = Var (X ) + Var (Y ) + 2 Cov (X, Y ) If X and Y are uncorrelated, then Var (X + Y ) = Var (X ) + Var (Y )
Independence implies uncorrelation Cov (X, Y ) = E (XY ) E (X ) E (Y ) = E (X ) E (Y ) E (X ) E (Y ) = 0
Uncorrelation does not imply independence X, Y are independent Bernoulli with parameter 1 2 Let U = X + Y and V = X Y Are U and V independent? Are they uncorrelated?
Uncorrelation does not imply independence p U (0) p V (0) p U,V (0, 0)
Uncorrelation does not imply independence p U (0) = P (X = 0, Y = 0) = 1 4 p V (0) p U,V (0, 0)
Uncorrelation does not imply independence p U (0) = P (X = 0, Y = 0) = 1 4 p V (0) = P (X = 1, Y = 1) + P (X = 0, Y = 0) = 1 2 p U,V (0, 0)
Uncorrelation does not imply independence p U (0) = P (X = 0, Y = 0) = 1 4 p V (0) = P (X = 1, Y = 1) + P (X = 0, Y = 0) = 1 2 p U,V (0, 0) = P (X = 0, Y = 0) = 1 4
Uncorrelation does not imply independence p U (0) = P (X = 0, Y = 0) = 1 4 p V (0) = P (X = 1, Y = 1) + P (X = 0, Y = 0) = 1 2 p U,V (0, 0) = P (X = 0, Y = 0) = 1 4 p U (0) p V (0) = 1 8
Uncorrelation does not imply independence Cov (U, V ) = E (UV ) E (U) E (V ) = E ((X + Y ) (X Y )) E (X + Y ) E (X Y ) = E ( X 2) E ( Y 2) E 2 (X ) + E 2 (Y )
Uncorrelation does not imply independence Cov (U, V ) = E (UV ) E (U) E (V ) = E ((X + Y ) (X Y )) E (X + Y ) E (X Y ) = E ( X 2) E ( Y 2) E 2 (X ) + E 2 (Y ) = 0
Correlation coefficient Pearson correlation coefficient of X and Y ρ X,Y := Cov (X, Y ) σ X σ Y. Covariance between X /σ X and Y /σ Y
Correlation coefficient σ Y = 1, Cov (X, Y ) = 0.9, ρ X,Y = 0.9 σ Y = 3, Cov (X, Y ) = 0.9, ρ X,Y = 0.3 σ Y = 3, Cov (X, Y ) = 2.7, ρ X,Y = 0.9
Cauchy-Schwarz inequality For any X and Y E (XY ) E (X 2 ) E (Y 2 ). and E (XY ) = E (X 2 ) E (Y 2 E (Y ) Y = 2 ) E (X 2 ) X E (XY ) = E (X 2 ) E (Y 2 E (Y ) Y = 2 ) E (X 2 ) X
Cauchy-Schwarz inequality We have Cov (X, Y ) σ X σ Y and equivalently ρ X,Y 1 In addition ρ X,Y = 1 Y = c X + d where c := { σy σ X if ρ X,Y = 1, σ Y σ X if ρ X,Y = 1, d := E (Y ) ce (X )
Covariance matrix of a random vector The covariance matrix of X is defined as Var (X 1 ) Cov (X 1, X 2 ) Cov (X 1, X n ) Cov (X 2, X 1 ) Var (X 2 ) Cov (X 2, X n ) Σ X =...... Cov (X n, X 2 ) Cov (X n, X 2 ) Var (X n ) ( = E X X ) ( ) ( ) T T E X E X
Covariance matrix after a linear transformation Σ A X + b
Covariance matrix after a linear transformation ( ( Σ AX + b = E AX + ) ( b AX + ) ) T ( b E AX + ) ( b E AX + ) T b
Covariance matrix after a linear transformation ( ( Σ AX + b = E AX + ) ( b AX + ) ) T ( b E AX + ) ( b E AX + ) T b ( = A E X X ) T A T + ( ) T ( ) b E X A T + A E X b T + b b T ( ) ( ) T ( ) A E X E X A T A E X b T ( ) T b E X A T b b T
Covariance matrix after a linear transformation ( ( Σ AX + b = E AX + ) ( b AX + ) ) T ( b E AX + ) ( b E AX + ) T b ( = A E X X ) T A T + ( ) T ( ) b E X A T + A E X b T + b b T ( ) ( ) T X X A T b b T = A A E ( E ( X X T ) E ( ) T ( ) E X A T A E X b T b E ( ) X E ( X ) T ) A T
Covariance matrix after a linear transformation ( ( Σ AX + b = E AX + ) ( b AX + ) ) T ( b E AX + ) ( b E AX + ) T b ( = A E X X ) T A T + ( ) T ( ) b E X A T + A E X b T + b b T ( ) ( ) T X X A T b b T = A A E ( E ( X X T ) E = AΣ X A T ( ) T ( ) E X A T A E X b T b E ( ) X E ( X ) T ) A T
Variance in a fixed direction For any unit vector u ) Var ( u T X = u T Σ X u
Direction of maximum variance To find direction of maximum variance we must solve arg max u 2 =1 ut Σ X u
Linear algebra Symmetric matrices have orthogonal eigenvectors Σ X = UΛU T λ 1 0 0 = [ ] u 1 u 2 u n 0 λ 2 0 [ u1 u 2 ] T u n 0 0 λ n
Linear algebra λ 1 = max u 2 =1 ut Au u 1 = arg max u 2 =1 ut Au λ k = max u 2 =1,u u 1,...,u k 1 u T Au u k = arg max u T Au u 2 =1,u u 1,...,u k 1
Direction of maximum variance λ1 = 1.22, λ2 = 0.71 λ1 = 1, λ 2 = 1 λ1 = 1.38, λ2 = 0.32
Whitening Let Σ X = UΛU T be full rank All the entries of Λ 1 U T X, where 1 λ1 0 0 Λ 1 0 1 := λ2 0, 1 0 0 λn are uncorrelated
Whitening Σ Λ 1 U T X = Λ 1 U T Σ X U Λ 1
Whitening Σ Λ 1 U T X = Λ 1 U T Σ X U Λ 1 = Λ 1 U T UΛU T U Λ 1
Whitening Σ Λ 1 U T X = Λ 1 U T Σ X U Λ 1 = Λ 1 U T UΛU T U Λ 1 = Λ 1 Λ Λ 1 because U T U = I
Whitening Σ Λ 1 U T X = Λ 1 U T Σ X U Λ 1 = Λ 1 U T UΛU T U Λ 1 = Λ 1 Λ Λ 1 because U T U = I = I
Whitening X U T X Λ 1 U T X
For Gaussian rvs uncorrelation implies mutual independence Uncorrelation implies σ 2 1 0 0 0 σ2 2 0 Σ X =...... 0 0 σn 2 which in turn implies 1 f X ( x) = ( (2π) n Σ exp 1 ) 2 ( x µ)t Σ 1 ( x µ) = = n i=1 ( ) 1 exp (x i µ i ) 2 (2π)σi 2σi 2 n f Xi (x i ) i=1
Expectation operator Mean and variance Covariance Conditional expectation
Conditional expectation Expectation of g (X, Y ) given X = x? E (g (X, Y ) X = x) = Can be interpreted as a function y= h (x) := E (g (X, Y ) X = x) g(x, y) f Y X (y x) dy, The conditional expectation of g (X, Y ) given X is It s a random variable E (g (X, Y ) X ) := h (X )
Iterated expectation For any X and Y and any function g : R 2 R E (g (X, Y )) = E (E (g (X, Y ) X ))
Iterated expectation h (x) := E (g (X, Y ) X = x) = y= g (x, y) f Y X (y x) dy
Iterated expectation h (x) := E (g (X, Y ) X = x) = y= g (x, y) f Y X (y x) dy E (E (g (X, Y ) X )) = E (h (X ))
Iterated expectation h (x) := E (g (X, Y ) X = x) = y= g (x, y) f Y X (y x) dy E (E (g (X, Y ) X )) = E (h (X )) = x= h (x) f X (x) dx
Iterated expectation h (x) := E (g (X, Y ) X = x) = y= g (x, y) f Y X (y x) dy E (E (g (X, Y ) X )) = E (h (X )) = = x= x= h (x) f X (x) dx y= f X (x) f Y X (y x) g (x, y) dy dx
Iterated expectation h (x) := E (g (X, Y ) X = x) = y= g (x, y) f Y X (y x) dy E (E (g (X, Y ) X )) = E (h (X )) = = x= x= h (x) f X (x) dx y= = E (g (X, Y )) f X (x) f Y X (y x) g (x, y) dy dx
Example: Desert Car traveling through the desert Time until the car breaks down: T State of the motor: M State of the road: R Model: M uniform between 0 (no problem) and 1 (very bad) R uniform between 0 (no problem) and 1 (very bad) M and R independent T exponential with parameter M + R
Example: Desert E (T ) = E (E (T M, R))
Example: Desert E (T ) = E (E (T M, R)) ( ) 1 = E M + R
Example: Desert E (T ) = E (E (T M, R)) ( ) 1 = E M + R = 1 1 0 0 1 dm dr m + r
Example: Desert E (T ) = E (E (T M, R)) ( ) 1 = E M + R = = 1 1 0 1 0 0 1 dm dr m + r log (r + 1) log (r) dr
Example: Desert E (T ) = E (E (T M, R)) ( ) 1 = E M + R = = 1 1 0 1 0 0 1 dm dr m + r log (r + 1) log (r) dr = log 4 = 1.39
Grizzlies in Yellowstone Model for the weight of grizzly bears in Yellowstone: Males: Gaussian with µ := 240 kg and σ := 40kg Females: Gaussian with µ := 140 kg and σ := 20kg There are about the same number of females and males
Grizzlies in Yellowstone E (W ) = E (E (W S))
Grizzlies in Yellowstone E (W ) = E (E (W S)) = E (W S = 1) + E (W S = 1) 2
Grizzlies in Yellowstone E (W ) = E (E (W S)) E (W S = 1) + E (W S = 1) = 2 = 170 kg
Bayesian coin flip Bayesian methods often endow parameters of discrete distributions with a continuous marginal distribution You suspect a coin is biased You are uncertain about the bias so you model it as a random variable with pdf f B (b) = 2t for t [0, 1] What is the expected value of the coin flip X?
Bayesian coin flip E (X ) = E (E (X B))
Bayesian coin flip E (X ) = E (E (X B)) = E (B)
Bayesian coin flip E (X ) = E (E (X B)) = E (B) = 1 0 2b 2 db
Bayesian coin flip E (X ) = E (E (X B)) = E (B) = 1 0 = 2 3 2b 2 db