STAT:5100 (22S:193) Statistical Inference I

STAT:5100 (22S:193) Statistical Inference I Week 10 Luke Tierney University of Iowa Fall 2015 Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 1

Monday, October 26, 2015 Recap Multivariate normal distribution Linear combinations of random variables Copula models Conditional distributions Conditional PMFs and PMFs Conditional expectation Law of total expectation Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 2

Monday, October 26, 2015 Conditional Expectations Some useful properties of conditional expectation: E[g(Y ) + X Y ] = g(y ) + E[X Y ] E[g(Y ) Y ] = g(y ) E[g(Y )X Y ] = g(y )E[X Y ] This shows that E[g(Y )X ] = E[g(Y )E[X Y ]] since E[g(Y )X ] = E[E[g(Y )X Y ]] = E[g(Y )E[X Y ]] This property can be taken as the definition of E[X Y ]. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 3

Monday, October 26, 2015 Conditional Expectations A formal definition of conditional expectation: Definition Let X, Y be random variables, and assume that E[ X ] <. A conditional expectation E[X Y ] of X given Y is any random variable Z such that for some measurable function h, and Z = h(y ) E[Xg(Y )] = E[Zg(Y )] for all bounded measurable functions g. Theorem A version of the conditional expectation E[X Y ] exists, and any two versions are almost surely equal. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 4

Monday, October 26, 2015 Conditional Expectations A useful consequence of the properties of conditional expectation: Theorem Suppose X and Y are random variables and g is a function such that E[X 2 ] < and E[g(Y ) 2 ] <. Then E[(X g(y )) 2 ] = E[(X E[X Y ]) 2 ] + E[(E[X Y ] g(y )) 2 ] Notes E[(X g(y )) 2 ] is the mean squared error for using g(y ) to predict X. The function of Y with the smallest mean square prediction error is E[X Y ]. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 5

Monday, October 26, 2015 Conditional Expectations Proof. E[(X g(y )) 2 ] =E[(X E[X Y ] + E[X Y ] g(y )) 2 ] =E[(X E[X Y ]) 2 ] + E[(E[X Y ] g(y )) 2 ] + 2E[(X E[X Y ])(E[X Y ] g(y ))] To complete the proof we need to show that the cross product term is zero. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 6

Monday, October 26, 2015 Conditional Expectations Proof (continued). For any function h with E[h(Y ) 2 ] < E[(X E[X Y ])h(y )] = E[E[X E[X Y ] Y ]h(y )] and So E[X E[X Y ] Y ] = E[X Y ] E[X Y ] = 0. E[(X E[X Y ])h(y )] = 0. The cross product expectation corresponds to and is therefore zero. h(y ) = E[X Y ] g(y ) Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 7

Monday, October 26, 2015 Conditional Expectations Corollary If X, Y are random variables with E[X 2 ] < then Var(X ) = E[Var(X Y )] + Var(E[X Y ]) Proof. Take g(y ) = E[X ]: Var(X ) = E[(X E[X ]) 2 ] = E[(X E[X Y ]) 2 ] + E[(E[X Y ] E[X ]) 2 ] = E[E[(X E[X Y ]) 2 Y ]] + E[(E[X Y ] E[E[X Y ]]) 2 ] = E[Var(X Y )] + Var(E[X Y ]) Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 8

Monday, October 26, 2015 Conditional Expectations Example Consider again the example of N customers making total purchases S = N X i. i=1 Suppose the independent purchase amounts have finite variance σ 2. The conditional variance of S given N = n is ( N ) Var(S N = n) = Var X i N = n i=1 ( n ) = Var X i N = n i=1 ( n ) = Var X i i=1 fix upper limit independence of X i and N = nσ 2 mutual independence of X i. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 9

Monday, October 26, 2015 Conditional Expectations Example (continued) The conditional variance of S given N is the random variable Var(S N) = Nσ 2. The variance of S is therefore Var(S) = E[Var(S N)] + Var(E[S N]) = E[Nσ 2 ] + Var(Nµ) = E[N]σ 2 + Var(N)µ 2 = λσ 2 + λµ 2 Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 10

Monday, October 26, 2015 Conditional Expectations Corollary Let X, Y be random variables E[X 2 ] <. Then for any g such that E[g(Y ) 2 ] < E[(X g(y )) 2 ] E[(X E[X Y ]) 2 ] Proof. From the theorem, E[(X g(y )) 2 ] = E[(X E[X Y ]) 2 ] + E[(E[X Y ] g(y )) 2 ] E[(X E[X Y ]) 2 ]. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 11

Monday, October 26, 2015 Conditional Expectations Conditional Expectation and Orthogonality A distance among random variables can be defined as X Y = E(X Y ) 2. In terms of this distance E[X Y ] is the closest function of Y to X. Put another way: Among all functions of Y the function with the lowest mean squared error for predicting X is E[X Y ]. The result that E[(X E[X Y ])h(y )] = 0 for all h can be interpreted as: The prediction error X E[X Y ] is orthogonal to the set of all functions of Y. E[X Y ] can be viewed as the orthogonal projection of X onto the set of all random variables that are functions of Y. There are strong parallels to least squares fitting in regression analysis. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 12

Monday, October 26, 2015 Hierarchical Models Hierarchical Models Often it is useful to build up models from conditional distributions. Example Each customer visiting to a store buys something with probability p. Customers make their purchasing decisions independently. Let X = number who buy something N = number who come to store Then X N Binomial(N, p). Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 13

Monday, October 26, 2015 Hierarchical Models Example (continued) Suppose the number of customers who arrive in a given period has a Poisson(λ) distribution. This is a two-stage hierarchical model: X N Binomial(N, p) N Poisson(λ) If the store is chosen at random, then λ would vary from store to store. This produces a three-stage hierarchical model: X N, λ Binomial(N, p) N λ Poisson(λ) λ f p might also vary from store to store. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 14

Monday, October 26, 2015 Hierarchical Models Example (continued) The marginal distribution of X is a mixture distribution. In the two-stage model, X is a Poisson mixture of binomials: f X (x) = P(X = x) = n P(X = x, N = n) = n P(X = x N = n)p(n = n) = So X Poisson(pλ). n=x ( n )p x n x λn (1 p) x n! e λ = (pλ)x e λ 1 [(1 p)λ]n x x! (n x)! n=x = (pλ)x x! e λ+(1 p)λ = (pλ)x e pλ x! Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 15

Monday, October 26, 2015 Hierarchical Models Example (continued) With a third stage, P(X = x) = = 0 0 P(X = x λ)f (λ)dλ (pλ) x e pλ f (λ)dλ x! What forms of f are both flexible and convenient? Gamma is a natural choice: f (λ) = 1 Γ(α)β α λα 1 e λ/β 1 (0, ) (λ) Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 16

Monday, October 26, 2015 Hierarchical Models Example (continued) Then the marginal PMF of X is (pλ) x P(X = x) = e pλ 1 0 x! Γ(α)β α λα 1 e λ/β dλ p x = x!γ(α)β α λ x+α 1 e λ(p+1/β) dλ 0 p x Γ(x + α) = x!γ(α)β α (p + 1/β) ( x+α ) Γ(x + α) p x ( ) 1/β α = Γ(α)x! p + 1/β p + 1/β For α a nonnegative integer this is a negative binomial distribution. For non-integer α this also a negative binomial distribution this is a definition. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 17

Wednesday, October 28, 2015 Recap Properties of conditional expectation Variance decomposition Geometry of conditional expectation Hierarchical models Poisson mixture of binomial distributions Gamma mixture of Poisson distributions Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 18

Wednesday, October 28, 2015 Hierarchical Models Example Suppose we observe N = n customers visiting the store. What is the conditional distribution of the store s λ value? The joint density/pmf of λ and N is f (λ, n) = f N λ (n λ)f (λ) = λn 1 n! e λ Γ(α)β α λα 1 e λ/β = λn+α 1 n!γ(α)β α e λ(1+1/β). The conditional density of λ given N = n is therefore f λ N (λ n) = f (λ, n) f N (n) f (λ, n) λn+α 1 e λ(1+1/β) = λ n+α 1 e λ(β+1)/β. This corresponds to a Γ(n + α, β/(1 + β)) distribution. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 19

Wednesday, October 28, 2015 Hierarchical Models Example A poll of size n is to be take to see whether voters prefer Candidate A or Candidate B. Assuming sampling with replacement, the number X in the sample who favor A is Binomial(n, p) with p the population proportion who favor A. The race is close and we are fairly confident that p is between 0.4 and 0.6. We can capture this by thinking of p as a random variable with a distribution that puts about 90% probability between 0.4 and 0.6. This is our prior probability distribution on p. Once we have collected our data we can compute a posterior probability like P(p > 0.5 data). This is an example of Bayesian inference. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 20

Wednesday, October 28, 2015 Hierarchical Models Example (continued) A convenient form for the prior distribution is a Beta(α, β) distribution. The joint density/pmf of X, p is then of the form f (x, p) = f X p (x p)f (p) ( ) n = p x (1 p) n x 1 x B(α, β) pα 1 (1 p) β 1. The posterior density of p given X = x is f (p x) = f (x, p) f X (x) f (x, p) p x+α 1 (1 p) n x+β 1. This is the density of a Beta(x + α, n x + β) distribution. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 21

Wednesday, October 28, 2015 Hierarchical Models Example (continued) A Beta distribution with α = β = 33 is symmetric about 0.5 and assigns approximately 0.9 probability to the interval between 0.4 and 0.6. Results for some possible sample sizes and observed counts: x n p P(p > 0.5 data) 15 20 0.75 0.86 115 200 0.575 0.97 1,046 2,000 0.523 0.98 All three scenarios produce approximately the same p-value for testing H 0 : p 0.5 against H 1 : p > 0.5. Some plots: p <- seq(0, 1, len = 101) plot(p, dbeta(p, 33, 33), type = "l", ylim = c(0, 40)) abline(v = 0.5, lty = 2) lines(p, dbeta(p, 33 + 15, 33 + 20-15), col = "red") lines(p, dbeta(p, 33 + 115, 33 + 200-115), col = "forestgreen") lines(p, dbeta(p, 33 + 1046, 33 + 2000-1046), col = "blue") Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 22

Wednesday, October 28, 2015 Hierarchical Models Example Suppose X, Y are bivariate normal with standard normal marginals and correlation ρ. The joint density is f (x, y) = { 1 2π 1 ρ exp x 2 + y 2 } 2ρxy 2 2(1 ρ 2. ) The conditional density of Y given X = x is { f Y X (y x) exp y 2 } 2ρxy 2(1 ρ 2. ) This is a N(ρx, 1 ρ 2 ) density. For a general bivariate normal distribution ( ) x µ X Y X = x N µ Y + ρσ Y, σy 2 (1 ρ 2 ). σ X Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 23

Wednesday, October 28, 2015 Hierarchical Models Example Suppose X 1, X 2, given M = m, are independent N(m, 1). The marginal distribution of M is N(0, δ 2 ). Then X 1, X 2 are bivariate normal with E[X 1 ] = E[X 2 ] = E[E[X 1 M]] = E[M] = 0 Var(X 1 ) = Var(X 2 ) = E[Var(X 1 M)] + Var(E[X 1 M]) The covariance of X 1, X 2 is So the correlation is = 1 + Var(M) = 1 + δ 2. Cov(X 1, X 2 ) = E[X 1 X 2 ] = E[E[X 1 X 2 M]] = E[E[X 1 M]E[X 2 M]] = E[M 2 ] = δ 2. ρ = δ2 1 + δ 2. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 24

Wednesday, October 28, 2015 Change of Variables Change of Variables Suppose X, Y are jointly continuous on a set A. Let U, V be defined as U = g 1 (X, Y ) V = g 2 (X, Y ) The image of A under the transformation g is B = {(u, v) : u = g 1 (x, y) and v = g 2 (x, y) for some (x, y) A} Assume g = (g 1, g 2 ) is one-to-one on A. Then g has an inverse h = (h 1, h 2 ) defined on B, and X = h 1 (U, V ) Y = h 2 (U, V ) Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 25

Wednesday, October 28, 2015 Change of Variables A small rectangle [u, u + du] [v, v + dv] has area dudv. The image of this rectangle under h is (approximately) a parallelogram. (u, v) (x, y) Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 26

Wednesday, October 28, 2015 Change of Variables The area of this parallelogram is approximately J(u, v) dudv where ( x J(u, v) = dxdy dudv = det u y u = x y u v x y v u x v y v ) J is called the Jacobian determinant of the transformation. This generalizes to three or more variables in the obvious way. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 27

Wednesday, October 28, 2015 Change of Variables The density of U, V for (u, v) B can be derived as f U,V (u, v)dudv = f X,Y (x, y)dxdy. Then f U,V (u, v) = f X,Y (x, y) dxdy dudv = f X,Y (x, y) J(u, v) Alternatively, f U,V (u, v) = f X,Y (h 1 (u, v), h 2 (u, v)) J(u, v) Here J(u, v) = det ( h1 (u,v) u h 2 (u,v) u h 1 (u,v) v h 2 (u,v) v ) Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 28

Friday, October 30, 2015 Recap Simple Bayesian inference examples Gaussian hierarchical models Change of variables for jointly continuous random variables Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 29

Friday, October 30, 2015 Change of Variables Example Let X, Y be independent with X Gamma(α, 1) Y Gamma(β, 1) Define U = X /(X + Y ) V = X + Y Then A = (0, ) (0, ) B = (0, 1) (0, ). Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 30

Friday, October 30, 2015 Change of Variables Example (continued) The inverse transformation is The Jacobian is J(u, v) = det x = h 1 (u, v) = uv y = h 2 (u, v) = v uv = (1 u)v ( ) v u = v(1 u) + vu = v. v (1 u) Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 31

Friday, October 30, 2015 Change of Variables Example (continued) The joint density of U and V for 0 < u < 1 and v > 0 is therefore f U,V (u, v) = f X (uv)f Y ((1 u)v)v = 1 Γ(α) (uv)α 1 e uv 1 Γ(β) ((1 u)v)β 1 e (1 u)v v 1 = Γ(α)Γ(β) uα 1 (1 u) β 1 v α+β 1 e v Incorporating indicator functions for B gives f U,V (u, v) = 1 [ u α 1 (1 u) β 1 1 (0,1) (u) ] [ v α+β 1 e v 1 (0, ) (v) ]. Γ(α)Γ(β) So U, V are independent with U Beta(α, β) V Gamma(α + β, 1) Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 32

Friday, October 30, 2015 Change of Variables Example Suppose X, Y are uniformly distributed on the region A = {(x, y) : 1 x 1, 1 y 1}. The joint density of X, Y is { 1 f X,Y (x, y) = 4 if 1 x 1 and 1 y 1 0 otherwise. Let U = X + Y V = X Y. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 33

Friday, October 30, 2015 Change of Variables Example (continued) The inverse transformation is The range of the transformation is x = u + v 2 y = u v. 2 B = {(u, v) : 2 u + v 2, 2 u v 2}, The Jacobian determinant of the inverse transformation is ( 1 ) 1 J(u, v) = det 2 2 1 = 1 4 1 4 = 1 2 2 1 2 Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 34

Friday, October 30, 2015 v Change of Variables Example (continued) The joint density of U, V is therefore ( u + v f U,V (u, v) = f X,Y, u v ) J(u, v) 2 2 { 1 = 8 if 2 u + v 2 and 2 u v 2 0 otherwise. This is a uniform distribution on the square B: 2 1 0 1 2 u v = 2 u + v = 2 u + v = 2 u v = 2 2 1 0 1 2 Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 35 u

Friday, October 30, 2015 Change of Variables Example (continued) We can compute the marginal density of U by integrating out v from the joint density: f U (u) = f U,V (u, v)dv 2+u 1 2 u 8 dv = 1 8 2(2 + u) = 1 2 + u 4 if 2 u 0 = 2 u 1 u 2 8 dv = 1 8 2(2 u) = 1 2 u 4 if 0 u 2 0 otherwise { 1 = 2 u 4 if u 2 0 otherwise. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 36

Friday, October 30, 2015 Change of Variables Example Suppose X, Y are uniformly distributed on the unit disk The joint density of X and Y is A = {(x, y) : x 2 + y 2 1} f X,Y (x, y) = 1 π 1 A(x, y) Let U = X /Y V = Y Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 37

Friday, October 30, 2015 Change of Variables Example (continued) The inverse transformation is x = uv y = v. The range of the transformation is B = {(u, v) : u 2 v 2 + v 2 1} = {(u, v) : v 1/ 1 + u 2 }. The Jacobian determinant of the inverse transformation is ( ) v u J(u, v) = det = v 0 1 Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 38

Friday, October 30, 2015 Change of Variables Example (continued) The joint density of U, V is therefore The marginal density of U is f U (u) = f U,V (u, v) = f X,Y (uv, v) J(u, v) { v = π if v 1/ 1 + u 2 0 otherwise. f U,V (u, v)dv = 1/ 1+u 2 v = 2 0 This is a Cauchy density. π dv = v 2 1/ 1+u 2 v 1/ 1+u 2 π dv 1/ 1+u 2 π = 1 1 π 1 + u 2 0 Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 39

Friday, October 30, 2015 Change of Variables Two problems: 1. X, Y are continuous, independent. Find the density of U = X + Y. 2. X N(0, 1), Y χ 2 p, independent. Find the density of U = X / Y /p t p. Two possible approaches: a. identify the region U u in the x, y plane integrate the joint density over this region differentiate the result with respect to u b. Add another variable V so that (X, Y ) (U, V ) is one-to-one find the joint density of U, V integrate out V The second approach is often easier since it only requires a one-dimensional integral V can be chosen to make this integral easier Often taking V = X or V = Y is sufficient. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 40

Friday, October 30, 2015 Change of Variables Example For the first problem with U = X + Y take V = X. The inverse transformation is X = V Y = U V The Jacobian determinant of the inverse is ( ) 0 1 J(u, v) = det = 1 1 1 So f U,V (u, v) = f X (v)f Y (u v) The marginal distribution of U is therefore f U (u) = f U,V (u, v)dv = This is called the convolution of f X and f Y. f X (v)f Y (u v)dv Transforms (e.g. moment generating functions) turn convolutions into products. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 41

Friday, October 30, 2015 Change of Variables Example For the second problem with U = X / Y /p set V = Y : Then the inverse transformation is U = X / Y /p V = Y X = U V /p Y = V The Jacobian determinant of the inverse transformation is ( v/p u J(u, v) = det ) 2p v/p = v/p. 0 1 The range of the transformation is B = {(u, v) : < u <, 0 < v < }. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 42

Friday, October 30, 2015 Change of Variables Example (continued) So the joint density of U and V is f U,V (u, v) = f X (u v/p)f Y (v) v/p { 1 = 2π exp { 1 2 u2 v/p } 1 v p/2 1 e v/2 v/p v > 0 Γ(p/2)2 p/2 0 v 0 Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 43

Friday, October 30, 2015 Change of Variables Example (continued) The marginal density of U is therefore { 1 f U (u) = exp 1 } 1 2π 2 u2 v/p Γ(p/2)2 v p/2 1 e v/2 v/pdv p/2 = 0 1 2π pγ(p/2)2 p/2 0 v p+1 2 1 e 1 2 v(1+u2 /p) dv = Γ ( ) p+1 2 2 (p+1)/2 1 2πΓ(p/2) p2 p/2 (1 + u 2 /p) (p+1)/2 Γ ( ) p+1 2 1 = Γ(p/2)(pπ) 1/2 (1 + u 2 /p) (p+1)/2 This is the density of Student s t distribution with p degrees of freedom. For p = 1 this is a Cauchy distribution. Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 44