ST 740: Linear Models and Multivariate Normal Inference

Size: px

Start display at page:

Download "ST 740: Linear Models and Multivariate Normal Inference"

Martin Thornton
5 years ago
Views:

1 ST 740: Linear Models and Multivariate Normal Inference Alyson Wilson Department of Statistics North Carolina State University November 4, 2013 A. Wilson (NCSU STAT) Linear Models November 4, / 28

2 Normal Linear Regression Model We assume that E[Y x] = β 1 x 1 + β 2 x β p x p and that Y i = β 1 x 1 + β 2 x β p x p + ɛ i = β T x + ɛ i where ɛ i are i.i.d. Normal(0, σ 2 ). This implies p(y 1,..., y n x 1,..., x n, β, σ 2 ) = n p(y i x i, β, σ 2 ) i=1 = (2πσ 2 ) n/2 exp( 1 2σ 2 n (y i β T x i ) 2 ) i=1 A. Wilson (NCSU STAT) Linear Models November 4, / 28

3 Normal Linear Regression Model Multivariate Normal If we think of y as the n-dimensional column vector (y 1,..., y n ) T, and X as the n p marix with ith row x i, then we can write the normal linear regression model as y X, β, σ 2 MVN(Xβ, σ 2 I ) A. Wilson (NCSU STAT) Linear Models November 4, / 28

4 Multivariate Normal Sampling Distribution Assume that we have data y 1,..., y n with a multivariate normal sampling distribution. If y i is a p-dimensional vector, we have f (y i θ, Σ) = (2π) p/2 Σ 1/2 exp( (y i θ) T Σ 1 (y i θ)/2) A. Wilson (NCSU STAT) Linear Models November 4, / 28

5 Multivariate Normal Inference Prior Distribution for θ If our sampling distribution is univariate normal, we saw that it was often (computationally) convenient to specify the prior distribution for the mean as a normal distribution. Suppose that we decide to specify the prior distribution for θ, the mean of our multivariate normal sampling distribution, as MVN(µ 0, Λ 0 ). What is the full conditional distribution of θ given y and Σ? To figure this out, we ll need the likelihood (MVN(θ, Σ)), the prior distribution for θ, and some information about the prior distribution for Σ. A. Wilson (NCSU STAT) Linear Models November 4, / 28

6 Prior Distribution for θ First, let s rewrite the prior distribution for θ. π(θ µ 0, Λ 0 ) = (2π) p/2 Λ 0 1/2 exp( (θ µ 0 ) T Λ 1 0 (θ µ 0)/2) = (2π) p/2 Λ 0 1/2 [ exp 1 2 θt Λ 1 0 θ + θt Λ 1 0 µ 0 1 ] 2 µt 0 Λ 1 0 µ 0 [ exp 1 ] 2 θt Λ 1 0 θ + θt Λ 1 0 µ 0 exp [ 12 ] θt A 0 θ + θ T b 0 where A 0 = Λ 1 0 and b 0 = Λ 1 0 µ 0. Notice that the prior covariance matrix is A 1 0 and the prior mean is A 1 0 b 0. A. Wilson (NCSU STAT) Linear Models November 4, / 28

7 Sampling Distribution Now, let s take a look at the sampling distribution for y 1,..., y n. π(y 1,..., y n θ, Σ) = n (2π) p/2 Σ 1/2 exp( (y i θ) T Σ 1 (y i θ)/2) i=1 = (2π) np/2 Σ n/2 exp Σ n/2 exp Σ n/2 exp [ [ i=1 [ 1 2 ] n (y i θ)σ 1 (y i θ) i=1 ] n [ yi T Σ 1 y i exp n ] 2 θt Σ 1 θ + θ T Σ 1 nȳ n i=1 where A 1 = nσ 1 and b 1 = nσ 1 ȳ. ] yi T Σ 1 y i exp [ 12 ] θt A 1 θ + θ T b 1 A. Wilson (NCSU STAT) Linear Models November 4, / 28

8 Posterior/Full Conditional Distribution for θ [ ] π(θ, Σ y 1,... y n ) π(σ) Σ n/2 exp 1 n yi T Σ 1 y i 2 i=1 exp [ 12 ] θt A 1 θ + θ T b 1 exp [ 12 ] θt A 0 θ + θ T b 0 A. Wilson (NCSU STAT) Linear Models November 4, / 28

9 Posterior/Full Conditional Distribution for θ Assuming that the prior distribution for Σ does not depend on θ, π(θ Σ, y 1,..., y n ) exp [ 12 ] θt A 1 θ + θ T b 1 exp [ 12 ] θt A 0 θ + θ T b 0 [ exp 1 ] 2 θt (A 0 + A 1 )θ + θ T (b 0 + b 1 ) This is a MVN with covariance matrix and mean (A 0 + A 1 ) 1 = (Λ nσ 1 ) 1 (A 0 + A 1 ) 1 (b 0 + b 1 ) = (Λ nσ 1 ) 1 (Λ 1 0 µ 0 + nσ 1 ȳ) A. Wilson (NCSU STAT) Linear Models November 4, / 28

10 Full Conditional Distribution for θ This is a MVN with covariance matrix (A 0 + A 1 ) 1 = (Λ nσ 1 ) 1 The precision (inverse variance) is the sum of the prior precision and the data precision. The mean is (A 0 + A 1 ) 1 (b 0 + b 1 ) = (Λ nσ 1 ) 1 (Λ 1 0 µ 0 + nσ 1 ȳ) which is a weighted average of the prior mean and the sample mean. A. Wilson (NCSU STAT) Linear Models November 4, / 28

11 Multivariate Normal Inference Prior Distribution for Σ To construct a prior distribution for Σ, we need a distribution over symmetric, positive definite matrices. The idea is a generalization of the chi-squared (gamma) distribution. 1 Sample z 1,..., z ν0 i.i.d. MVN(0, Φ 0 ) 2 Calculate Z T Z = ν 0 i=1 z iz T i The distribution of the matrices Z T Z is called a Wishart(ν 0, Φ 0 ) distribution. A. Wilson (NCSU STAT) Linear Models November 4, / 28

12 Wishart Properties If ν 0 > p then Z T Z is positive definite with probability 1. Z T Z is symmetric with probability 1. E[Z T Z] = ν 0 Φ 0 By analogy with the univariate normal case, we use the Wishart distribution as a prior for the precision matrix and the inverse Wishart distribution as a prior for the covariance matrix. A. Wilson (NCSU STAT) Linear Models November 4, / 28

13 Prior Distribution for Σ Inverse Wishart Distribution 1 Sample z 1,..., z ν0 i.i.d. MVN(0, S 1 0 ) 2 Calculate (Z T Z) 1 = ( ν 0 i=1 z iz T i ) 1 The distribution of the matrices (Z T Z) 1 is called an inverse Wishart(ν 0, S 1 0 ) distribution. A. Wilson (NCSU STAT) Linear Models November 4, / 28

14 Inverse Wishart Distribution The density function is E[Z T Z] = ν 0 S0 1 E[(Z T Z) 1 ] = 1 ν 0 p 1 S 0 π(σ) = 2 ν0p/2 π (p2)/2 S 0 ν 0/2 p Γ([ν j]/2) j=1 Σ (ν 0+p+1)/2 exp[ tr(s 0 Σ 1 )/2] 1 A. Wilson (NCSU STAT) Linear Models November 4, / 28

15 Choosing Parameters For the inverse Wishart distribution, think of ν 0 as the prior sample size. If we are confident that the true covariance matrix is near some covariance matrix Σ 0, choose ν 0 large and set S 0 = (ν 0 p 1)Σ 0. If we choose ν 0 = p + 2 and S 0 = Σ 0, then the samples will be loosely centered around Σ 0. A. Wilson (NCSU STAT) Linear Models November 4, / 28

16 Full Conditional Distribution for Σ Useful Matrix Algebra Fact K b T k Ab k = tr(b T BA) i=1 where B is the matrix whose kth row is b T k. Therefore, n (y i θ) T Σ 1 (y i θ) = tr(s θ Σ 1 ) i=1 where S θ = n i=1 (y i θ)(y i θ) T is the residual sum of squares matrix for y 1,..., y n. A. Wilson (NCSU STAT) Linear Models November 4, / 28

17 Full Conditional Distribution for Σ π(θ, Σ y 1,..., y n ) π(θ) Σ n/2 exp ( 1 2 ) n (y i θ) T Σ 1 (y i θ) Σ (ν0+p+1)/2 exp[ tr(s 0 Σ 1 )/2] ( π(σ θ, y 1,..., y n ) Σ (n+ν0+p+1)/2 exp 1 ) 2 tr((s 0 + S θ )Σ 1 ) So the full conditional distribution of Σ is Inverse Wishart(ν 0 + n, [S 0 + S θ ] 1 ). i=1 A. Wilson (NCSU STAT) Linear Models November 4, / 28

18 Noninformative Prior for Multivariate Normal Data A commonly used noninformative prior is π(θ, Σ) Σ (p+1)/2 A. Wilson (NCSU STAT) Linear Models November 4, / 28

19 MCMC for Multivariate Normal Model To summarize, suppose that our model is Y i θ, Σ MVN(θ, Σ) θ MVN(µ 0, Λ 0 ) Σ Inverse Wishart(ν 0, S 1 0 ) We can draw a random sample from our posterior distribution using Gibbs sampling. π(θ y 1,..., y n, Σ) MVN((Λ nσ 1 ) 1 (Λ 1 0 µ 0 + nσ 1 ȳ), (Λ nσ 1 ) 1 ) π(σ y 1,..., y n, θ) Inverse Wishart(ν 0 + n, [S 0 + S θ ] 1 ) The R package MCMCpack has a random inverse Wishart distribution riwish(v,s). It seems to be built for Gibbs sampling it generates only one random draw at a time. A. Wilson (NCSU STAT) Linear Models November 4, / 28

20 Classical Approach From a classical approach, we choose an estimator for β that minimzes n i=1 (y i β T x i ) 2. We will call β OLS the ordinary least squares estimate of β. A. Wilson (NCSU STAT) Linear Models November 4, / 28

21 Bayesian Approach Semi-Conjugate Prior Distribution for β Suppose that we choose a prior distribution for β that is MVN(β 0, Σ 0 ). π(β y, X, σ 2 ) p(y X, β, σ 2 )π(β)π(σ 2 ) ( exp 1 ) 2σ 2 (yt y 2β T X T y + β T X T Xβ) ( exp 1 ) 2 βt Σ 1 0 β + βt Σ 1 0 β 0 exp ( 12 ) βt X T Xβ/σ 2 + β T X T y/σ 2 ( exp 1 ) 2 βt Σ 1 0 β + βt Σ 1 0 β 0 ( exp 1 ) 2 βt (X T X/σ 2 + Σ 1 0 )β + βt (X T y/σ 2 + Σ 1 0 β 0) A. Wilson (NCSU STAT) Linear Models November 4, / 28

22 Bayesian Approach Semi-Conjugate Prior Distribution for β The full conditional distribution for β, π(β y, X, σ 2 ) is ( MVN (X T X/σ 2 + Σ 1 0 ) 1 (X T y/σ 2 + Σ 1 0 β 0), (X T X/σ 2 + Σ 1 0 ) 1) A. Wilson (NCSU STAT) Linear Models November 4, / 28

23 Bayesian Approach Semi-Conjugate Prior Distribution for σ 2 Suppose that we choose a prior distribution for σ 2 that is InverseGamma(ν 0 /2, ν 0 σ 2 0 /2). π(σ 2 y, X, β) p(y X, β, σ 2 )π(β)π(σ 2 ) ( (σ 2 ) n/2 exp 1 ) 2σ 2 (yt y 2β T X T y + β T X T Xβ) (1/σ 2 ) ν0/2+1 exp( ν 0 σ0/2σ 2 2 ) ( (1/σ 2 ) (n+ν0)/2 exp 1 ) 2σ 2 (ν 0σ0 2 + (y Xβ) T (y Xβ) which is InverseGamma((n + ν 0 )/2, (ν 0 σ (y Xβ)T (y Xβ)). A. Wilson (NCSU STAT) Linear Models November 4, / 28

24 Other Prior Distributions One common (and convenient) noninformative prior distribution is uniform on (β, log(σ)), In this case, we can show that π(β, σ 2 ) 1 σ 2 π(β σ 2, y, X) MVN((X T X) 1 X T y, (X T X) 1 σ 2 ) π(σ 2 y, X) InverseGamma( n p, (y X ˆβ OLS ) T (y X ˆβ OLS )) It is straightforward to derive all of the full conditionals and marginals for this model, so there are many choices for calculating samples from the posterior distribution. A. Wilson (NCSU STAT) Linear Models November 4, / 28

25 Other Prior Distributions When we discussed noninformative prior distributions, we learned that prior distributions can be selected to capture many different kinds of knowledge. If we do not have much knowledge about the values of the regression parameters, we may wish to choose a prior distribution that is somehow minimally informative. A. Wilson (NCSU STAT) Linear Models November 4, / 28

26 Other Prior Distributions Unit Information Prior One suggestion for the normal linear regression model comes from Kass and Wasserman (1995). The idea is to create a prior distribution that contains the same amount of information as one observation. Recall that the variance of ˆβ OLS is σ 2 (X T X) 1. If we think about the precision (inverse variance) of this estimate, we have (X T X)/σ 2. The idea is that the precision of ˆβ OLS is the amount of information in n observations. This implies that the amount of information in one observation is one n th as much, or (X T X)/nσ 2. Suppose that we set Σ 0 = nσ 2 (X T X) 1 in the prior distribution for β. They also suggest setting β 0 = ˆβ OLS, ν 0 = 1 and σ 2 0 = (ˆσ2 0 ) OLS. A. Wilson (NCSU STAT) Linear Models November 4, / 28

27 Other Prior Distributions g-prior The g-prior starts from the idea that parameter estimation should be invariant to the scale of the regressors. Suppose that X is a given set of regressors and X = XH for some p p matrix H. If we obtain the posterior distribution of β from y and X and β from y and X, then, according to this principle of invariance, the posterior distributions of β and H β should be the same. π(β σ 2 ) MVN(0, k(x T X) 1 ) π(σ 2 ) InverseGamma(ν 0 /2, ν 0 σ 2 0/2) A popular choice for k is gσ 2. Choosing g large reflects less informative prior beliefs. A. Wilson (NCSU STAT) Linear Models November 4, / 28

28 Other Prior Distributions g-prior In this case, we can show that π(β σ 2 g, y, X) MVN( g + 1 (XT X) 1 X T g y, g + 1 (XT X) 1 σ 2 ) π(σ 2 y, X) InverseGamma( n + ν 0, ν 0σ0 2 + SSR g ) 2 2 where SSR g = y T (I g g+1 X(XT X) 1 X T )y. Another popular noninformative choice is ν 0 = 0. A. Wilson (NCSU STAT) Linear Models November 4, / 28

ST 740: Model Selection

ST 740: Model Selection Alyson Wilson Department of Statistics North Carolina State University November 25, 2013 A. Wilson (NCSU Statistics) Model Selection November 25, 2013 1 / 29 Formal Bayesian Model