Expectation. DS GA 1002 Probability and Statistics for Data Science. Carlos Fernandez-Granda

Expectation DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda

Aim Describe random variables with a few numbers: mean, variance, covariance

Expectation operator Mean and variance Covariance Conditional expectation

Discrete random variables Average of the values of a function weighted by the pmf E (g (X )) = x R g (x) p X (x) E (g (X, Y )) = x R X x R Y g (x, y) p X,Y (x, y) ( ( )) E g X = x1 g ( x) p X ( x) x 2 x n

Continuous random variables Average of the values of a function weighted by the pdf E (g (X )) = x= g (x) f X (x) dx E (g (X, Y )) = x= y= g (x, y) f X,Y (x, y) dx dy ( ( )) E g X = g ( x) f X ( x) dx 1 dx 2... dx n x 1 = x 2 = x n=

Discrete and continuous random variables E (g (C, D)) = = c= d R D d R D g (c, d) f C (c) p D C (d c) dc c= g (c, d) p D (d) f C D (c d) dc

St Petersburg paradox A casino offers you a game Flip an unbiased coin until it lands on heads You get 2 k dollars where k = number of flips Expected gain?

St Petersburg paradox E (Gain) = 2 k 1 2 k k=1

St Petersburg paradox E (Gain) = 2 k 1 2 k k=1 =

Linearity of expectation For any constants a and b and any functions g 1 and g 2 E (a g 1 (X, Y ) + b g 2 (X, Y )) = a E (g 1 (X, Y )) + b E (g 2 (X, Y )) Follows from linearity of sums and integrals (ag 1 (x, y) + bg 2 (x, y))p X,Y (x, y) x R X x R Y = a g 1 (x, y) p X,Y (x, y) + b g 2 (x, y) p X,Y (x, y) x R Y x R Y x R X x R X

Example: Coffee beans Company buys coffee beans from two local producers Beans from Colombia: C tons/year Beans from Vietnam: V tons/year Model: C uniform between 0 and 1 V uniform between 0 and 2 C and V independent What is the expected total amount of beans B?

Example: Coffee beans E (C + V )

Example: Coffee beans E (C + V ) = E (C) + E (V )

Example: Coffee beans E (C + V ) = E (C) + E (V ) = 0.5 + 1 = 1.5 tons

Example: Coffee beans E (C + V ) = E (C) + E (V ) = 0.5 + 1 = 1.5 tons Holds even if C and V are not independent

Independence If X, Y are independent then E (g (X ) h (Y )) = E (g (X )) E (h (Y ))

Independence E (g (X ) h (Y )) = x= y= g (x) h (y) f X,Y (x, y) dx dy

Independence E (g (X ) h (Y )) = = x= y= x= y= g (x) h (y) f X,Y (x, y) dx dy g (x) h (y) f X (x) f Y (y) dx dy

Independence E (g (X ) h (Y )) = = x= y= x= y= = E (g (X )) E (h (Y )) g (x) h (y) f X,Y (x, y) dx dy g (x) h (y) f X (x) f Y (y) dx dy

Expectation operator Mean and variance Covariance Conditional expectation

Mean The mean or first moment of X is E (X ) It s the center of mass of the distribution

Bernoulli E (X ) = 0 p X (0) + 1 p X (1) = p

Binomial A binomial is a sum of n Bernoulli random variables X = n i=1 B i

Binomial A binomial is a sum of n Bernoulli random variables X = n i=1 B i ( n ) E (X ) = E B i i=1

Binomial A binomial is a sum of n Bernoulli random variables X = n i=1 B i ( n ) E (X ) = E B i = i=1 n E (B i ) i=1

Binomial A binomial is a sum of n Bernoulli random variables X = n i=1 B i ( n ) E (X ) = E B i = i=1 n E (B i ) i=1 = np

Mean of important random variables Random variable Parameters Mean Bernoulli p p Geometric p 1 p Binomial n, p np Poisson λ λ Uniform a, b a+b 2 Exponential λ 1 λ Gaussian µ, σ µ

Cauchy random variable 0.3 fx (x) 0.2 0.1 0 10 5 0 5 10 x f X (x) = 1 π(1 + x 2 ).

Cauchy random variable E(X ) = = 0 x π(1 + x 2 ) dx x π(1 + x 2 ) dx 0 x π(1 + x 2 ) dx

Cauchy random variable E(X ) = = 0 x π(1 + x 2 ) dx x π(1 + x 2 ) dx 0 x π(1 + x 2 ) dx 0 x π(1 + x 2 ) dx = 0 1 2π(1 + t) dt = lim t log(1 + t) 2π

Cauchy random variable E(X ) = = 0 x π(1 + x 2 ) dx x π(1 + x 2 ) dx 0 x π(1 + x 2 ) dx 0 x π(1 + x 2 ) dx = 1 0 2π(1 + t) dt log(1 + t) = lim t 2π =

Mean of a random vector Vector formed by the means of its components E (X 1 ) ( ) E X := E (X 2 ) E (X n ) By linearity of expectation, for any matrix A R m n and b R m ( E AX + ) ( ) b = A E X + b

The mean as a typical value The mean is a typical value of the random variable The probability that X equals E (X ) can be zero The mean can be severely distorted by a subset of extreme values

Density with subset of extreme values 0.1 fx (x) 0 0 20 40 60 80 100 x Uniform random variable X with support [ 4.5, 4.5] [99.5, 100.5]

Density with subset of extreme values 4.5 100.5 E (X ) = x f X (x) dx + x f X (x) dx x= 4.5 x=99.5 = 1 100.5 2 99.5 2 10 2 = 10

Density with subset of extreme values 0.1 fx (x) 0 0 20 40 60 80 100 x

Median Midpoint of the distribution: number m such that P (X m) 1 2 and P (X m) 1 2 For continuous random variables F X (m) = m f X (x) dx = 1 2

Density with subset of extreme values F X (m) = m 4.5 = m + 4.5 10 f X (x) dx

Density with subset of extreme values F X (m) = m 4.5 = m + 4.5 10 = 1 2 f X (x) dx m = 0.5

Density with subset of extreme values 0.1 Mean Median fx (x) 0 0 20 40 60 80 100 x

Variance The mean square or second moment of X is E ( X 2) The variance of X is Var (X ) := E ((X E (X )) 2) = E ( X 2 2X E (X ) + E 2 (X ) ) = E ( X 2) E 2 (X ) The standard deviation of X is σ X := Var (X )

Bernoulli E ( X 2) = 0 p X (0) + 1 p X (1) = p Var (X ) = E ( X 2) E 2 (X ) = p p 2 = p (1 p)

Variance of common random variables Random variable Parameters Variance Bernoulli p p (1 p) Geometric p 1 p p 2 Binomial n, p np (1 p) Poisson λ λ Uniform a, b (b a) 2 12 Exponential λ 1 λ 2 Gaussian µ, σ σ 2

Geometric (p = 0.2) 0.2 0.15 px (k) 0.1 5 10 2 0 0 5 10 15 20 k

Binomial (n = 20, p = 0.5) 0.2 0.15 0.1 5 10 2 0 0 5 10 15 20 k

Poisson (λ = 25) 8 10 2 6 4 2 0 10 20 30 40 k

Uniform [0, 1] 1 0.8 fx (x) 0.6 0.4 0.2 0 0.5 0 0.5 1 1.5 x

Exponential (λ = 1) 1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 5 x

Gaussian (µ = 0, σ = 1) 0.4 0.3 0.2 0.1 0 4 2 0 2 4 x

Variance The variance operator is not linear, but Var (a X + b) = E ((a X + b E (a X + b)) 2) = E ((a X + b ae (X ) b) 2) = a 2 E ((X E (X )) 2) = a 2 Var (X )

Bounding probabilities using expectations Aim: Characterize behavior of X to some extent using E (X ) and Var (X )

Markov s inequality For any nonnegative random variable X and any a > 0 P (X a) E (X ) a

Markov s inequality Consider the indicator variable 1 X a X a 1 X a 0

Markov s inequality Consider the indicator variable 1 X a X a 1 X a 0 E (X ) a E (1 X a )

Markov s inequality Consider the indicator variable 1 X a X a 1 X a 0 E (X ) a E (1 X a ) = a P (X a)

Age of students at NYU Mean: 20 years How many are younger than 30?

Age of students at NYU Mean: 20 years How many are younger than 30? P(A 30) E (A) 30

Age of students at NYU Mean: 20 years How many are younger than 30? At least 1/3 P(A 30) E (A) 30 = 2 3

Chebyshev s inequality For any positive constant a > 0, P ( X E (X ) a) Var (X ) a 2

Chebyshev s inequality For any positive constant a > 0, P ( X E (X ) a) Var (X ) a 2 Corollary: If Var (X ) = 0 then P (X E (X )) = 0

Chebyshev s inequality For any positive constant a > 0, P ( X E (X ) a) Var (X ) a 2 Corollary: If Var (X ) = 0 then P (X E (X )) = 0 For any ɛ > 0 P ( X E (X ) ɛ) Var (X ) ɛ 2 = 0

Chebyshev s inequality Define Y := (X E (X )) 2 By Markov s inequality P ( X E (X ) a) = P ( Y a 2)

Chebyshev s inequality Define Y := (X E (X )) 2 By Markov s inequality P ( X E (X ) a) = P ( Y a 2) E (Y ) a 2

Chebyshev s inequality Define Y := (X E (X )) 2 By Markov s inequality P ( X E (X ) a) = P ( Y a 2) E (Y ) = a 2 Var (X ) a 2

Age of students at NYU Mean: 20 years, standard deviation: 3 years How many are younger than 30?

Age of students at NYU Mean: 20 years, standard deviation: 3 years How many are younger than 30? P(A 30) P( A 20 10)

Age of students at NYU Mean: 20 years, standard deviation: 3 years How many are younger than 30? At least 91 % P(A 30) P( A 20 10) Var (A) 100 = 9 100

Expectation operator Mean and variance Covariance Conditional expectation

Covariance The covariance of X and Y is Cov (X, Y ) := E ((X E (X )) (Y E (Y ))) = E (XY Y E (X ) X E (Y ) + E (X ) E (Y )) = E (XY ) E (X ) E (Y ) If Cov (X, Y ) = 0, X and Y are uncorrelated

Covariance Cov (X, Y ) 0.5 0.9 0.99 Cov (X, Y ) 0-0.9-0.99

Variance of the sum Var (X + Y ) = E ((X + Y E (X + Y )) 2) ( = E (X E (X )) 2) + E ((Y E (Y )) 2) + 2E ((X E (X )) (Y E (Y ))) = Var (X ) + Var (Y ) + 2 Cov (X, Y )

Variance of the sum Var (X + Y ) = E ((X + Y E (X + Y )) 2) ( = E (X E (X )) 2) + E ((Y E (Y )) 2) + 2E ((X E (X )) (Y E (Y ))) = Var (X ) + Var (Y ) + 2 Cov (X, Y ) If X and Y are uncorrelated, then Var (X + Y ) = Var (X ) + Var (Y )

Independence implies uncorrelation Cov (X, Y ) = E (XY ) E (X ) E (Y ) = E (X ) E (Y ) E (X ) E (Y ) = 0

Uncorrelation does not imply independence X, Y are independent Bernoulli with parameter 1 2 Let U = X + Y and V = X Y Are U and V independent? Are they uncorrelated?

Uncorrelation does not imply independence p U (0) p V (0) p U,V (0, 0)

Uncorrelation does not imply independence p U (0) = P (X = 0, Y = 0) = 1 4 p V (0) p U,V (0, 0)

Uncorrelation does not imply independence p U (0) = P (X = 0, Y = 0) = 1 4 p V (0) = P (X = 1, Y = 1) + P (X = 0, Y = 0) = 1 2 p U,V (0, 0)

Uncorrelation does not imply independence p U (0) = P (X = 0, Y = 0) = 1 4 p V (0) = P (X = 1, Y = 1) + P (X = 0, Y = 0) = 1 2 p U,V (0, 0) = P (X = 0, Y = 0) = 1 4

Uncorrelation does not imply independence p U (0) = P (X = 0, Y = 0) = 1 4 p V (0) = P (X = 1, Y = 1) + P (X = 0, Y = 0) = 1 2 p U,V (0, 0) = P (X = 0, Y = 0) = 1 4 p U (0) p V (0) = 1 8

Uncorrelation does not imply independence Cov (U, V ) = E (UV ) E (U) E (V ) = E ((X + Y ) (X Y )) E (X + Y ) E (X Y ) = E ( X 2) E ( Y 2) E 2 (X ) + E 2 (Y )

Uncorrelation does not imply independence Cov (U, V ) = E (UV ) E (U) E (V ) = E ((X + Y ) (X Y )) E (X + Y ) E (X Y ) = E ( X 2) E ( Y 2) E 2 (X ) + E 2 (Y ) = 0

Correlation coefficient Pearson correlation coefficient of X and Y ρ X,Y := Cov (X, Y ) σ X σ Y. Covariance between X /σ X and Y /σ Y

Correlation coefficient σ Y = 1, Cov (X, Y ) = 0.9, ρ X,Y = 0.9 σ Y = 3, Cov (X, Y ) = 0.9, ρ X,Y = 0.3 σ Y = 3, Cov (X, Y ) = 2.7, ρ X,Y = 0.9

Cauchy-Schwarz inequality For any X and Y E (XY ) E (X 2 ) E (Y 2 ). and E (XY ) = E (X 2 ) E (Y 2 E (Y ) Y = 2 ) E (X 2 ) X E (XY ) = E (X 2 ) E (Y 2 E (Y ) Y = 2 ) E (X 2 ) X

Cauchy-Schwarz inequality We have Cov (X, Y ) σ X σ Y and equivalently ρ X,Y 1 In addition ρ X,Y = 1 Y = c X + d where c := { σy σ X if ρ X,Y = 1, σ Y σ X if ρ X,Y = 1, d := E (Y ) ce (X )

Covariance matrix of a random vector The covariance matrix of X is defined as Var (X 1 ) Cov (X 1, X 2 ) Cov (X 1, X n ) Cov (X 2, X 1 ) Var (X 2 ) Cov (X 2, X n ) Σ X =...... Cov (X n, X 2 ) Cov (X n, X 2 ) Var (X n ) ( = E X X ) ( ) ( ) T T E X E X

Covariance matrix after a linear transformation Σ A X + b

Covariance matrix after a linear transformation ( ( Σ AX + b = E AX + ) ( b AX + ) ) T ( b E AX + ) ( b E AX + ) T b

Covariance matrix after a linear transformation ( ( Σ AX + b = E AX + ) ( b AX + ) ) T ( b E AX + ) ( b E AX + ) T b ( = A E X X ) T A T + ( ) T ( ) b E X A T + A E X b T + b b T ( ) ( ) T ( ) A E X E X A T A E X b T ( ) T b E X A T b b T

Covariance matrix after a linear transformation ( ( Σ AX + b = E AX + ) ( b AX + ) ) T ( b E AX + ) ( b E AX + ) T b ( = A E X X ) T A T + ( ) T ( ) b E X A T + A E X b T + b b T ( ) ( ) T X X A T b b T = A A E ( E ( X X T ) E ( ) T ( ) E X A T A E X b T b E ( ) X E ( X ) T ) A T

Covariance matrix after a linear transformation ( ( Σ AX + b = E AX + ) ( b AX + ) ) T ( b E AX + ) ( b E AX + ) T b ( = A E X X ) T A T + ( ) T ( ) b E X A T + A E X b T + b b T ( ) ( ) T X X A T b b T = A A E ( E ( X X T ) E = AΣ X A T ( ) T ( ) E X A T A E X b T b E ( ) X E ( X ) T ) A T

Variance in a fixed direction For any unit vector u ) Var ( u T X = u T Σ X u

Direction of maximum variance To find direction of maximum variance we must solve arg max u 2 =1 ut Σ X u

Linear algebra Symmetric matrices have orthogonal eigenvectors Σ X = UΛU T λ 1 0 0 = [ ] u 1 u 2 u n 0 λ 2 0 [ u1 u 2 ] T u n 0 0 λ n

Linear algebra λ 1 = max u 2 =1 ut Au u 1 = arg max u 2 =1 ut Au λ k = max u 2 =1,u u 1,...,u k 1 u T Au u k = arg max u T Au u 2 =1,u u 1,...,u k 1

Direction of maximum variance λ1 = 1.22, λ2 = 0.71 λ1 = 1, λ 2 = 1 λ1 = 1.38, λ2 = 0.32

Coloring Goal: Transform uncorrelated samples with unit variance so that they have a prescribed covariance matrix Σ 1. Compute the eigendecomposition Σ = UΛU T. 2. Set y := U Λ x where λ1 0 0 Λ := 0 λ2 0 0 0 λn

Coloring Σ Y

Coloring Σ Y = U ΛΣ X Λ T U T

Coloring Σ Y = U ΛΣ X Λ T U T = U ΛI Λ T U T

Coloring Σ Y = U ΛΣ X Λ T U T = U ΛI Λ T U T = Σ

Coloring X Λ X U Λ X

Generating Gaussian random vectors Goal: Sampling from an n-dimensional Gaussian random vector with mean µ and covariance matrix Σ 1. Generate n independent standard Gaussian samples x 2. Compute the eigendecomposition Σ = UΛU T 3. Set y := U Λ x + µ For non-gaussian random vectors, coloring does not necessarily preserve the distribution

For Gaussian rvs uncorrelation implies mutual independence Uncorrelation implies σ 2 1 0 0 0 σ2 2 0 Σ X =...... 0 0 σn 2 which in turn implies 1 f X ( x) = ( (2π) n Σ exp 1 ) 2 ( x µ)t Σ 1 ( x µ)

For Gaussian rvs uncorrelation implies mutual independence Uncorrelation implies σ 2 1 0 0 0 σ2 2 0 Σ X =...... 0 0 σn 2 which in turn implies 1 f X ( x) = ( (2π) n Σ exp 1 ) 2 ( x µ)t Σ 1 ( x µ) = n i=1 ( ) 1 exp (x i µ i ) 2 (2π)σi 2σi 2

For Gaussian rvs uncorrelation implies mutual independence Uncorrelation implies σ 2 1 0 0 0 σ2 2 0 Σ X =...... 0 0 σn 2 which in turn implies 1 f X ( x) = ( (2π) n Σ exp 1 ) 2 ( x µ)t Σ 1 ( x µ) = = n i=1 ( ) 1 exp (x i µ i ) 2 (2π)σi 2σi 2 n f Xi (x i ) i=1

Expectation operator Mean and variance Covariance Conditional expectation

Conditional expectation Expectation of g (X, Y ) given X = x? E (g (X, Y ) X = x) = Can be interpreted as a function y= h (x) := E (g (X, Y ) X = x) g(x, y) f Y X (y x) dy, The conditional expectation of g (X, Y ) given X is It s a random variable E (g (X, Y ) X ) := h (X )

Iterated expectation For any X and Y and any function g : R 2 R E (g (X, Y )) = E (E (g (X, Y ) X ))

Iterated expectation h (x) := E (g (X, Y ) X = x) = y= g (x, y) f Y X (y x) dy

Iterated expectation h (x) := E (g (X, Y ) X = x) = y= g (x, y) f Y X (y x) dy E (E (g (X, Y ) X )) = E (h (X ))

Iterated expectation h (x) := E (g (X, Y ) X = x) = y= g (x, y) f Y X (y x) dy E (E (g (X, Y ) X )) = E (h (X )) = x= h (x) f X (x) dx

Iterated expectation h (x) := E (g (X, Y ) X = x) = y= g (x, y) f Y X (y x) dy E (E (g (X, Y ) X )) = E (h (X )) = = x= x= h (x) f X (x) dx y= f X (x) f Y X (y x) g (x, y) dy dx

Iterated expectation h (x) := E (g (X, Y ) X = x) = y= g (x, y) f Y X (y x) dy E (E (g (X, Y ) X )) = E (h (X )) = = x= x= h (x) f X (x) dx y= = E (g (X, Y )) f X (x) f Y X (y x) g (x, y) dy dx

Example: Desert Car traveling through the desert Time until the car breaks down: T State of the motor: M State of the road: R Model: M uniform between 0 (no problem) and 1 (very bad) R uniform between 0 (no problem) and 1 (very bad) M and R independent T exponential with parameter M + R

Example: Desert E (T ) = E (E (T M, R))

Example: Desert E (T ) = E (E (T M, R)) ( ) 1 = E M + R

Example: Desert E (T ) = E (E (T M, R)) ( ) 1 = E M + R = 1 1 0 0 1 dm dr m + r

Example: Desert E (T ) = E (E (T M, R)) ( ) 1 = E M + R = = 1 1 0 1 0 0 1 dm dr m + r log (r + 1) log (r) dr

Example: Desert E (T ) = E (E (T M, R)) ( ) 1 = E M + R = = 1 1 0 1 0 0 1 dm dr m + r log (r + 1) log (r) dr = log 4 = 1.39

Grizzlies in Yellowstone Model for the weight of grizzly bears in Yellowstone: Males: Gaussian with µ := 240 kg and σ := 40kg Females: Gaussian with µ := 140 kg and σ := 20kg There are about the same number of females and males

Grizzlies in Yellowstone E (W ) = E (E (W S))

Grizzlies in Yellowstone E (W ) = E (E (W S)) = E (W S = 0) + E (W S = 1) 2

Grizzlies in Yellowstone E (W ) = E (E (W S)) E (W S = 0) + E (W S = 1) = 2 = 180 kg

Bayesian coin flip Bayesian methods often endow parameters of discrete distributions with a continuous marginal distribution You suspect a coin is biased You are uncertain about the bias so you model it as a random variable with pdf f B (b) = 2t for t [0, 1] What is the expected value of the coin flip X?

Bayesian coin flip E (X ) = E (E (X B))

Bayesian coin flip E (X ) = E (E (X B)) = E (B)

Bayesian coin flip E (X ) = E (E (X B)) = E (B) = 1 0 2b 2 db

Bayesian coin flip E (X ) = E (E (X B)) = E (B) = 1 0 = 2 3 2b 2 db