Part 6: Multivariate Normal and Linear Models

Size: px

Start display at page:

Download "Part 6: Multivariate Normal and Linear Models"

Prudence Brooks
5 years ago
Views:

1 Part 6: Multivariate Normal and Linear Models 1

2 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of individuals, or each run of a repeated experiment However, datasets are frequently multivariate, having multiple measurements for each individual or experiment 2

3 Multivariate model The most useful (commonly used) model for multivariate data is the multivariate normal model For a collection of variables, it allows us to jointly estimate population means variances and correlations 3

4 Example: reading comprehension A sample of 22 children are given reading comprehension tests before and after receiving a particular instructional method Each student i will have two scores, Y i,1 and Y i,2 denoting the pre- and post-instructional scores respectively We denote each student s pair of scores as a vector, so that Y i Y i = Yi,1 Y i,2 = score on first test score on second test 4

5 Example: quantities of interest The things we might be interested in include the population mean µ, particularly µ 2 µ 1 E{Yi,1 } µ1 E{Y} = = E{Y i,2 } µ 2 and the population covariance matrix =Cov[Y] = = E{Y1 2 } E{Y 1 } 2 E{Y 1 Y 2 } E{Y 1 }E{Y 2 } E{Y 1 Y 2 } E{Y 1 }E{Y 2 } E{Y2 2 } E{Y 2 } ,2 2 1,2 2 [ measures the consistency of the intervention] where 5 for 1,2 = [0, 1]

6 Multivariate normal One model for describing first- and second-order moments of multivariate data is the multivariate normal model We say that a p-dimensional data vector Y has a multivariate normal (MVN) distribution if its sampling density is p(y µ, )= where y = 1 1 (2 ) p/2 exp 1/2 2 (y µ)> 1 (y µ) 0 1 y 1 B. A µ = 0 1 µ 1 B. A = ,p C A y p µ p 6 1,p 2 p

7 Example: bivariate normals 1,2 = 48 [ = 0.5] 1,2 =0[ = 0] 1,2 = 48 [ =0.5] µ = (50, 50) ( 2 1, 2 2) = (64, 144) 7

8 Notation & marginals We shall write Y N p (µ, ) when Y has a p-dimensional MVN distribution An interesting feature of the MVN distribution is that the marginal distribution of each variable is a univariate normal: Y j N (µ j, 2 j ) 8

9 Semiconjugate prior Recall that if Y 1,...,Y n are IID (univariate) normal, then a convenient conjugate prior distribution for the population mean is also (univariate) normal Similarly, a convenient prior distribution for the multivariate mean µ is a MVN distribution, which we will parameterize as p(µ) =N p (µ; µ 0, 0 ) where µ 0 and 0 are the prior mean and variance of µ, respectively 9

10 The essence of the prior We have that if µ N p (µ 0, 0 ), then 1 p(µ) / exp 2 µ> A 0 µ + µ > b 0 where A and b 0 = 1 0 = µ 0 Conversely, this result says that if a random vector µ has a density in R p that is proportional to 1 exp 2 µ> Aµ + µ > b for some matrix A and vector b, then µ must have a MVN distribution with covariance and meana 1 b 10 A 1

11 The essence of likelihood If the sampling model is {Y 1,...,Y n µ, } iid N p (µ, ) then similar calculations show that the joint sampling density of the observed vectors y 1,...,y n is p(y 1,...,y n µ, ) / exp 1 2 µ> A 1 µ + µ > b 1 where A 1 = n 1,b 1 = n 1 ȳ and ȳ =( 1 n P yi,1,..., 1 n P yi,n ) 11

12 The essence of posterior Combining the prior and likelihood gives the posterior p(µ y 1,...,y n, ) / exp 1 2 µ> A 1 µ + µ > b 1 exp 1 2 µ> A 0 µ + µ > b 0 1 =exp 2 µ> A n µ + µ > b n where A n = A 0 + A 1 = n 1 b n = b 0 + b 1 = 0 1 µ 0 + n 1 ȳ 12

13 The posterior This implies that the posterior conditional distribution of µ must therefore be MVN with covariance A 1 and mean A 1 n b n, so Cov[µ y 1,...,y n, ] = n =( n 1 ) 1 E{µ y 1,...,y n, } = µ n = n ( 1 0 µ 0 + n 1 ȳ) n giving {µ y 1,...,y n, } N p (µ n, n) Just like in the univariate case: the posterior precision (inverse covariance) is the sum of the prior precision and data precision the posterior expectation is the average of the prior expectation and the sample mean 13

14 Prior covariance elements Just as the variance 2 must be positive, the covariance matrix must be positive definite, i.e., x > x>0, for all vectors x Positive definiteness guarantees that 2 j > 0 for all j and that all correlations are between 1 and 1 Another requirement is that the covariance matrix must be symmetric, i.e., j,k = k,j Any valid prior distribution for must put all of its probability mass on this complicated set of symmetric, positive definite matrices 14

15 Wishart distribution A convenient family of distributions with just these properties is the Wishart the multivariate analogue of the gamma family Recall that, in the univariate normal model, the gamma distribution is conjugate for the precision 1/ 2 the conjugate prior for 2 is IG Similarly, it turns out that the Wishart distribution is a semi-conjugate prior for the precision matrix 1 and so the conjugate prior for Wishart 15 is inverse-

16 Inverse-Wishart To obtain IW( 0,S where S is a p p 0 1 ) 0 positive-definite matrix and is a positive integer 0 1. Sample z 1,...,z 0 iid Np (0,S 1 0 ) 2. Calculate Z > Z = 0 X i=1 z i z > i 3. Set =(Z > Z) 1 Accordingly, the precision matrix 1 has a W ( 0,S 1 distribution 16 0 )

17 Expected covariance The expected covariances under an inverse-wishart are E{ 1 } = 0 S 1 0 E{ } = 1 p 1 S 0 As a prior for a MVN covariance, if we are confident that the true is near 0, then we might choose 0 large and set so that the S 0 =( 0 p 1) 0 distribution is tightly centered around 0 If not, we may choose 0 = p +2and S 0 = 0 0, so that the distribution is loosely centered around 17

18 IW (prior) density The density for IW( 0,S 1 0 ) is given by p( ) / ( 0 +p+1)/2 exp 1 2 tr(s 0 1 ) where tr( ) is the trace, or sum of the diagonal elements, of a matrix An interesting result from linear algebra is that KX b > k Ab k =tr(bab > )=tr(b > BA) k=1 where B is a matrix whose k th row is b > k 18

19 Convenient likelihood To combine the IW prior distribution with the sampling distribution for Y 1,...,Y n we take advantage of the above result for traces: p(y 1,...,y n µ, ) ( / = n/2 exp 1 2 ) nx (y i µ > ) 1 (y i µ) i=1 n/2 1 exp 2 tr(s µ 1 ) where S µ = P n i=1 (y i µ)(y i µ) > is the residual y 1,...,y n sum of squares matrix for the vectors if the population mean is presumed to be 19 µ

20 Posterior We are now in a convenient position to combine the prior with the likelihood as follows p( y 1,...,y n,µ) / p(y 1,...,y n µ, ) p( ) = n/2 1 exp 2 tr(s µ 1 ) ( 0 +p+1)/2 1 exp 2 tr(s 0 1 ) = ( 0 +n+p 1)/2 1 exp 2 tr([s 0 + S µ ] 1 ) 20

21 Posterior Thus we have { y 1,...,y n,µ} IW( 0 + n, [S 0 + S µ ] 1 ) Even though it was rough going to obtain, hopefully the result is intuitive the posterior sample size n = 0 + n is the sum of the prior sample size and the data sample size Similarly, the posterior residual sum of squares S n = S 0 + S µ is a sum of the prior sum of squares and the data sum of squares 21

22 Gibbs sampler We now have the set of full posterior conditionals {µ y 1,...,y n, } N p (µ n, n ) { y 1,...,y n,µ} IW( n,s 1 n ) which we may use to construct a GS algorithm to obtain a MCMC approximation to the joint posterior p(µ, y 1,...,y n ) 22

23 Gibbs sampler Given a starting value, the GS algorithm generates {µ (s+1), (s+1) } from {µ (s), (s) } via the following two steps: µ (s+1) (0) 1. Sample from its full conditional distribution: a) compute µ n and n from y 1,...,y n and (s) b) sample µ (s+1) N p (µ n, n ) 2. Sample (s+1) from its full conditional distribution: a) compute S n from y 1,...,y n and µ (s+1) b) sample (s+1) IW( n,s 1 n ) a) ) {µ n, n } depend on and S 0 depends on µ, so these must be recalculated every iteration 23

24 Example: reading comprehension Lets return to the our motivating reading comprehension example We shall model the 22 pairs of scores, before and after instruction, as IID samples from a MVN distribution We start by thinking about the priors for µ and The exam was designed to give average scores of around 50 out of 100, so prior expectation µ 0 = (50, 50) is sensible as a 24

25 Example: mean prior variance Since the true mean cannot be below 0 or above 100, it is desirable to use a prior variance for little probability outside of this range µ that puts If we take 2 0,1 = 2 0,2 = (50/2) 2 = 625 then P (µ j /2 [0, 100]) = 0.05 Since the two exams are measuring the same thing, whatever the true values of µ 1 and µ 2 are, it is probable that they are close We can reflect this with a prior correlation of 0.5, so that 1,2 =

26 Example: prior covariance & data Some of the same logic about the range of exam scores applies to choosing a prior for We ll take S 0 = 0 0, but only loosely center around this value by taking = p +2=4 The sufficient statistics of MVN inference are y 1,...,y 22 needed for ȳ = (47.18, 53.86) > (s 2 1,s 2 2) = (182.16, ) s 1,2 /(s 1 s 2 )=0.7 26

27 Example: Gibbs sampler Initialize the GS algorithm with (0) = s 2 1 s 1,2 s 1,2 s mu[, 1] mu[, 2] no effect Quantile-based CIs 2.5% 97.5% µ 2 µ 1 P (µ 2 >µ 1 y 1,...,y 22 ) > 0.99 strong evidence that the instruction is working 27

28 Example: Posterior predictive What is the probability that a randomly selected child will score higher on the second exam than on the first? Quantile-based CIs 2.5% 97.5% y 2 y 1 a less significant, and possibly worrying result! P (y 2 >y 1 y 1,...,y 22 )= y[, 1] y[, 2] no effect 28

29 Regression Regression modeling is concerned with describing how the sampling distribution of one random variable Y response variable varies with another variable or set of variables x =(x 1,...,x p ) explanatory variable(s) or covariates Specifically, a regression model postulates a form for p(y x), the conditional distribution for Y given x Estimation of p(y x) is made using data y 1,...,y n gathered under a variety of conditions x 1,...,x n 29

30 Linear model One simple but flexible approach to regression is via the linear (sampling) model (LM) The LM treats responses Y i as independent (but not identically distributed) realizations of a process that is linear in explanatory variables x > i =(x i,1,...,x i,p ), observed with Gaussian noise Y i ind N (µ i, 2 ) where µ i = E{Y x i } = x > i x i where the are known, is an unknown p-dimensional parameter vector of regression coefficients, and 2 is an unknown variance parameter 30

31 Compact notation The LM is usually written as Y = X + ", where Y = 0 1 Y 1 B. A, X = > 1 B... 1 C A, = C A, " = 0 1 " 1 B. A Y n x > n p " n and {" 1,...," n } iid N(0, 2 ) Even more compact notation is design matrix Y N n (X,I n 2 ) where is a n n identity matrix I n 31

32 Maximum Likelihood As long as X > X is invertible, which is the case when X is of full rank p, the MLE is ˆ =(X > X) 1 X > Y Similarly, we may find nx ˆ2 = 1 n Y X ˆ 2 = 1 n (Y i x > i ˆ) 2 i=1 32

33 Sampling the MLE It may be shown that ˆ Np (, 2 (X > X) 1 ) and ˆ2 2 n 2 n p Moreover ˆ and ˆ2 are independent These results may be used to construct confidence intervals, test hypotheses, etc., in a frequentist setup 33

34 Example: Oxygen intake 12 healthy men who did not exercise regularly were recruited to take part in a study of the effects of two different exercise regimen on oxygen intake 6 were randomly assigned to a 12-week flat- terrain running program the remaining 6 were assigned a 12-week step aerobics program The maximum oxygen intake of each subject was measured (in l/m) while running on an inclined treadmill, both before and after the 12-week program 34

35 Example: Oxygen intake Of interest is how a subject s change in maximal oxygen intake may depend upon which program they were assigned to one explanatory variable However, other factors, such as age, are expected to affect the change in maximal intake as well two explanatory variables I.e., we wish to estimate the conditional distribution of oxygen intake for a given exercise program and age 35

36 Example: Data deta o2 uptake e.g., aerobic running age It is easy to imagine two straight lines, one for the aerobic points, and the for the running ones How do we estimate them, and do we need two? 36

37 Example: Linear in covariates A sensible LM may be constructed as follows Y i = 1 x i,1 + 2 x i,2 + 3 x i,3 + 4 x i,4 x i,1 = 1 for each subject i (intercept term) x i,2 =0ifsubjecti is running, 1 if aerobic x i,3 = the age of subject i x i,4 = x i,2 x i,3 (interaction term) Thus, the (12) rows of X are comprised of 4 columns: x > i =(x i,1,x i,2,x i,3,x i,4 ) 37

38 Example: Explaining the terms in the model Under this model the conditional expectations of Y for the two different levels of x i,2 are E{Y x} = age, if running E{Y x} =( )+( ) age, if aerobic The model assumes that the relationship is linear in age for both exercise groups, with the difference in intercepts given by and the difference in slopes by

39 Example: MLE/OLS inference deta o2 uptake aerobic running age Classical tests indicate that the exercise program may not be significant Coefficients: Estimate Std. Error t value Pr(> t ) x ** x x ** x

40 Bayesian LM We demonstrate how a simple semi-conjugate prior distribution for and can be used when there is information available about the parameters 2 In situations where prior information is unavailable or difficult to quantify, an alternative default class of prior distributions is given We shall see how the MLE ( ˆ, ˆ2) crops up as factors in our posterior distributions, with similar sampling distributions as posteriors in the default prior case 40

41 Semi-conjugate prior The sampling density of the data, as a function of is p(y X,, 2 )=N n (y; X, 2 ) / exp =exp (y X )> (y X ) 2 2 [y> y 2 > X > y + > X > X ] The role that plays in the exponent looks very similar to that played by y, which is MVN This suggests that a MVN prior for is conjugate 41

42 Semi-conjugate prior If N p ( 0, 0 ), then p( y, X, 2 ) exp > / X> y > 1 + X> X 0 2 which we recognize as proportional to an MVN with n Var[ y, X, n E{ y, X, 2 ]=( X > X/ 2 ) 1 2 } = n ( X > y/ 2 ) i.e., { y, X, 2 } N p ( n, n ) 42

43 Interpretation We can gain some understanding of the posterior (conditional) by considering some limiting cases of n Var[ y, X, n E{ y, X, 2 ]=( X > X/ 2 ) 1 2 } = n ( X > y/ 2 ) If the elements of the prior precision matrix 0 small, then n ˆ, the MLE (or OLS estimator) 1 are On the other hand, if the measurement precision is very small ( is very large), then, the prior expectation 2 43 n 0

44 Conjugate prior variance As in most normal sampling problems, the semiconjugate prior for 2 is IG If 2 IG( 0 /2, /2) then p( 2 yx, ) =( 2 ) ( 0+n)/2+1 exp [ (y X ) > (y X )] which we recognize as an IG density, i.e., { 2 y, X, } IG 0 + n 2, (y X ) > (y X ) 2 44

45 Gibbs sampler Using these full conditionals, we may construct a GS algorithm as follows: given current values of { (s), 2(s) }, new values can be generated by 1. updating a) compute n (y, X, 2(s) ) and n(y, X, 2(s) ) b) sample (s+1) N p ( n, n ) 2. updating 2 a) compute b) sample s 2 =(y X (s+1) ) > (y X (s+1) ) 2(s+1) IG([ 0 + n]/2, [ s 2 ]/2) 45

46 Prior difficulty The Bayesian analysis of the LM requires a specification of the prior parameters ( 0, 0 ) and ( 0, 2 0) There are O(p 2 ) such parameters, so this can be quite a monumental task even for modest p Even when prior information exists (as in the oxygen intake example) sometimes an analysis must be done in the absence of prior information Fortunately, there are some convenient weaklyinformative priors that are easy to use 46

47 Jeffreys prior The (independence) Jeffreys prior for the LM is p(, 2 ) / 1/ 2 i.e., p( )=p( ) / 1 and p( 2 ) / 1/ 2 We may interpret the former as N (0, 1 p ), and the latter as 2 IG(0, 0), from which we may easily derive y, X, 2 N p ( ˆ, 2 (X > X) 1 ) 2 y, X, IG(n/2,nˆ2/2) Check that the posterior is proper for n>p+1 47

48 Marginal posterior variance The marginal posterior variance is available in closed form under the IG (and Jeffreys ) prior (cond. prob) Since we can sample from both of these distributions, samples from the joint posterior p(, 2 y, X) may be obtained by MC 48

49 Example: Oxygen intake We will use the independent Jeffreys prior for this example with the MC procedure using the marginal posterior p( 2 y, X) Frequency Frequency beta[, 1] 2.5% % -23 beta[, 2] 97.5% % 49 Frequency % % 3.3 Frequency % % beta[, 3] beta[, 4]

50 Example: effect of aerobics? The posterior distributions seem to suggest only weak evidence of a difference between the two groups, as the 95% CIs for and both contain zero 2 4 However, these parameters themselves do not quite tell the whole story According to the model, the average difference in y between two people of the same age x but in different exercise programs is x Thus, the posterior distribution for the effect of the aerobics program over the running program is obtained via the posterior distribution of x for each x 50

51 Example: effect of aerobics depends on age beta[2] + beta[4]*x age This indicates reasonably strong evidence of a difference at young ages and less at older ones 51

52 Prediction (forecasting) Suppose that we wish to obtain the posterior predictive distribution Y (x ) at a new location x There are several ways in which we may go about sampling from this predictive distribution The decomposition: (LTP) p(y y, X, x )= Z p(y x,, 2 )p(, (cond. prob.) 2 y, X) d d 2 says that we may obtain MC samples as y (s) N (x > (s), 2(s) ) 52

53 Example: predictive quantiles deta o2 uptake aerobic running age 53

Part 4: Multi-parameter and normal models

Part 4: Multi-parameter and normal models 1 The normal model Perhaps the most useful (or utilized) probability model for data analysis is the normal distribution There are several reasons for this, e.g.,