Version: July 10, 2013 Notes on the Multivariate Normal and Related Topics Let me refresh your memory about the distinctions between population and sample; parameters and statistics; population distributions and sampling distributions One might say that anyone worth knowing, knows these distinctions We start with the simple univariate framework and then move on to the multivariate context A) Begin by positing a population of interest that is operationalized by some random variable, say X In this Theory World framework, X is characterized by parameters, such as the expectation of X, µ = E(X), or its variance, σ 2 = V(X) The random variable X has a (population) distribution, which for us will typically be assumed normal B) A sample is generated by taking observations on X, say, X 1,, X n, considered independent and identically distributed as X, ie, they are exact copies of X In this Data World context, statistics are functions of the sample, and therefore, characterize the sample: the sample mean, ˆµ = 1 n n X i ; the sample variance, ˆσ 2 = 1 n n (X i ˆµ) 2, with some possible variation in dividing by n 1 to generate an unbiased estimator of σ 2 The statistics, ˆµ and ˆσ 2, are point estimators of µ and σ 2 They are random variables by themselves, so they have distributions that are called sampling distributions The general problem of statistical inference is to ask what the sample statistics, ˆµ and ˆσ 2, tell us about their population counterparts, 1
such as µ and σ 2 Can we obtain some notion of accuracy from the sampling distribution, eg, confidence intervals? The multivariate problem is generally the same but with more notation: A) The population (Theory World) is characterized by a collection of p random variables, X = [X 1, X 2,, X p ], with parameters: µ i = E(X i ); σ 2 i = V(X i ); σ ij = Cov(X i, X j ) = E[(X i E(X i ))(X j E(X j ))]; or the correlation between X i and X j, ρ ij = σ ij /σ i σ j B) The sample (Data World) is defined by n independent observations on the random vector X, with the observations placed into an n p data matrix (eg, subject by variable) that we also (with a slight abuse of notation) denote by a bold-face capital letter, X n p : X n p = x 11 x 12 x 1p x 21 x 22 x 2p x n1 x n2 x np The statistics corresponding to the parameters of the population are: ˆµ i = 1 n n k=1 x ki ; ˆσ 2 i = 1 n n k=1 (x ki ˆµ i ) 2, with again some possible variation to a division by n 1 for an unbiased estimate; ˆσ ij = n k=1 (x ki ˆµ i )(x kj ˆµ j ); and ˆρ ij = ˆσ ij /ˆσ iˆσ j 1 n To obtain a good sense of what the estimators tell us about the population parameters, we will have to make some assumption about the population, eg, [X 1,, X p ] has a multivariate normal distribution As we will see, this assumption leads to some very nice results = x 1 x 2 x n 2
01 Developing the Multivariate Normal Distribution Suppose X 1,, X p are p continuous random variables with density functions, f 1 (x 1 ),, f p (x p ), and distribution functions, F 1 (x 1 ),, F p (x p ), where P (X i x i ) = F i (x i ) = x i f i(x i )dx We define a p-dimensional random variable (or random vector) as the vector, X = [X 1,, X p ]; X has the joint cumulative distribution function F (x 1,, x p ) = x p x 1 f(x 1,, x p )dx 1 dx p If the random variables, X 1,, X p are independent, then the joint density and cumulative distribution functions factor: f(x 1,, x p ) = f 1 (x 1 ) f p (x p ) and F (x 1,, x p ) = F 1 (x 1 ) F p (x p ) The independence property will not be assumed in the following; in fact, the whole idea is to investigate what type of dependency exists in the random vector X When X = [X 1, X 2,, X p ] MVN(µ, Σ), the joint density function has the form 1 φ(x 1,, x p ) = (2π) p/2 Σ exp{ 1 1/2 2 (x µ) Σ 1 (x µ)} In the univariate case, the density provides the usual bell-shaped curve: φ(x) = 1 2πσ exp{ 1 2 [(x µ)/σ)2 ]} Given the MVN assumption on the generation of the data matrix, 3
X n p, we know the sampling distribution of the vector of means: ˆµ = ˆµ 1 ˆµ 2 ˆµ p MVN(µ, (1/n)Σ) In the univariate case, we have that ˆµ N(µ, (1/n)σ 2 ) As might be expected, a Central Limit Theorem (CLT) states that these results also hold asymptotically when the MVN assumption is relaxed Also, if we estimate the variance-covariance matrix, Σ, with divisions by n 1 rather than n, and denote the result as ˆΣ, then E(ˆΣ) = Σ Linear combinations of random variables form the backbone of all of multivariate statistics For example, suppose X = [X 1, X 2,, X p ] MVN(µ, Σ), and consider the following q linear combinations: Z 1 = c 11 X 1 + + c 1p X p Z q = c q1 X 1 + + c qp X p Then the vector Z = [Z 1, Z 2,, Z q ] MVN(Cµ, CΣC ), where C q p = c 11 c q1 c 1p c qp These same results hold in the sample if we observe the q linear combinations, [Z 1, Z 2,, Z q ]: ie, the sample mean vector is Cˆµ and the sample variance-covariance matrix is CˆΣC 4
02 Multiple Regression and Partial Correlation (in the Population) Suppose we partition our vector of p random variables as follows: X = [X 1, X 2,, X p ] = [[X 1, X 2,, X k ], [X k+1, X k+2, X p ]] [X 1, X 2] Partitioning the mean vector and variance-covariance matrix in the same way, µ µ = 1 Σ ; Σ = 11 Σ 12 µ 2 Σ 12 Σ, 22 we have X 1 MVN(µ 1, Σ 11 ) ; X 2 MVN(µ 2, Σ 22 ) X 1 and X 2 are statistically independent vectors if and only if Σ 12 = 0 k p, the k p zero matrix What I call the Master Theorem refers to the conditional density of X 1 given X 2 = x 2 ; it is multivariate normal with mean vector of order k 1: µ 1 + Σ 12 Σ 1 22 (x 2 µ 2 ), and variance-covariance matrix of order k k: Σ 11 Σ 12 Σ 1 22 Σ 12 If we denote the (i, j) th partial covariance element in this latter matrix ( holding all variables in the second set constant ) as σ ij (k+1) p, then the partial correlation of variables i and j, holding all variables in the second set constant, is ρ ij (k+1) p = σ ij (k+1) p / σ ii (k+1) p σ jj (k+1) p 5
Notice that in the formulas just given, the inverse of Σ 22 must exist; otherwise, we have what is called multicollinearity, and solutions don t exist (or are very unstable numerically if the inverse almost doesn t exist) So, as one moral: don t use both total and subtest scores in the same set of independent variables If the set X 1 contains just one random variable, X 1, then the mean vector of the conditional distribution can be given as E(X 1 ) + σ 12 Σ 1 22 (x 2 µ 2 ), where σ 12 is 1 (p 1) and of the form [Cov(X 1, X 2 ),, Cov(X 1, X p )] This is nothing but our old regression equation written out in matrix notation If we let β = σ 12 Σ 1 22, then the conditional mean vector (predicted X 1 ) is equal to E(X 1 ) + β (x 2 µ 2 ) = E(X 1 ) + β 1 (x 2 µ 2 )+ +β p 1 (x p µ p ) The covariance matrix, Σ 11 Σ 12 Σ 1 22 Σ 12, takes on the form Var(X 1 ) σ 12 Σ 1 22 σ 12, ie, the variation in X 1 that is not explained If we take explained variation, σ 12 Σ 1 22 σ 12, and consider the proportion to the total variance, Var(X 1 ), we define the squared multiple correlation coefficient: ρ 2 1 2 p = σ 12Σ 1 22 σ 12 Var(X 1 ) In fact, the linear combination, β X 2, has the highest correlation of any linear combination with X 1 ; and this correlation is the positive square root of the squared multiple correlation coefficient 03 Moving to the Sample Up to this point, the concepts of regression and kindred ideas, have been discussed only in terms of population parameters We now have 6
the task of obtaining estimates of these various quantities based on a sample; also, associated inference procedures need to be developed Generally, we will rely on maximum-likelihood estimation and the related likelihood-ratio tests To define how maximum-likelihood estimation proceeds, we will first give a series of general steps, and then operationalize for a univariate normal distribution example A) Let X 1,, X n be univariate observations on X (independent and continuous, and depending on some parameters, θ 1,, θ k The density function of X i is denoted as f(x i ; θ 1,, θ k ) B) The likelihood of observing x 1,, x n for values of X 1,, X n is the joint density of the n observations, and because the observations are independent, L(θ 1,, θ k ) n f(x i ; θ 1,, θ k ), ie, we assume x 1,, x n are already observed, and that L is a function of the parameters only Now, choose parameter values such that L(θ 1,, θ k ) is at a maximum ie, the probability of observing that particular sample is maximized by the choice of θ 1,, θ k Generally, it is easier and equivalent to maximize the log-likelihood using l(θ 1,, θ k ) log(l(θ 1,, θ k )) C) Generally, the maximum values are found through differentiation, and by setting the partial derivatives equal to zero Explicitly, 7
we find θ 1,, θ k such that θ j l(θ 1,, θ k ) = 0, for j = 1,, k (We should probably check second derivatives to see if we have a maximum, but this is almost always true, anyways) Example: Suppose X i N(µ, σ 2 ), with density f(x i ; µ, σ 2 ) = 1 σ 2π exp{ (x i µ) 2 /2σ 2 } The likelihood has the form L(µ, σ 2 ) = n 1 σ 2π exp{ (x i µ) 2 /2σ 2 } = (1/ 2π) n (1/σ 2 ) n/2 exp{ (1/2σ 2 ) n (x i µ) 2 } The log-likelihood reduces: l(µ, σ 2 ) = log L(µ, σ 2 ) = log(1/ 2π) n + log(1/σ 2 ) n/2 + ( (1/2σ 2 ) n (x i µ) 2 ) = constant (n/2) log σ 2 (1/2σ 2 ) n (x i µ) 2 The partial derivatives have the form: l(µ, σ 2 ) µ = (1/2σ 2 ) n 2(x i µ)( 1), 8
and l(µ, σ 2 ) σ 2 = (n/2)(1/σ 2 ) (1/2(σ 2 ) 2 )( 1) n (x i µ) 2 Setting these two expressions to zero, gives ˆµ = n x i /n ; ˆσ 2 = (1/n) n (x i ˆµ) 2 Maximum likelihood (ML) estimates are generally consistent, asymptotically normal (for large n), and efficient; a function of the sufficient statistics; and have an invariance to operations performed on them eg, the ML estimate for σ is just ˆσ 2 As the ML estimator for σ 2 shows, ML estimators are not necessarily unbiased Another Example: Suppose X 1,, X n are observations on the Poisson discrete random variable X (having outcomes 0, 1, ): The likelihood is L(λ) = n P (X i = x i ) = exp( λ)λx i x i! and the log-likelihood P (X i = x i ) = exp( nλ)λ i x i i x i! l(λ) = log L(λ) = nλ + log(λ i x i ) log i, x i! The partial derivative l(λ) λ = n + (1/λ) n x i, 9
and when set to zero, gives 031 Likelihood Ratio Tests ˆλ = n x i /n Besides using the likelihood concept to find good point estimates, the likelihood can also be used to find good tests of hypotheses We will develop this idea in terms of a simple example: Suppose X 1 = x 1,, X n = x n denote values obtained on n independent observations on a N(µ, σ 2 ) random variable, and where σ 2 is assumed known The standard test of H 0 : µ = µ 0 versus H 1 : µ µ 0, would compare ( x µ 0 ) 2 /(σ 2 /n) to a χ 2 random variable with one-degree of freedom Or, if we chose the significance level to be 05, we could reject H 0 if x µ 0 /(σ/ n) 196 We now develop this same test using likelihood ideas: The likelihood of observing x 1,, x n, L(µ), is only a function of µ (because σ 2 is assumed to be known): L(µ) = (1/σ 2π) n exp{ (1/2σ 2 ) n (x i µ) 2 } Under H 0, L(µ) is at a maximum for µ = µ 0 ; under H 1, L(µ) is at a maximum for µ = ˆµ, the maximum likelihood estimator If H 0 is true, L(ˆµ) and L(µ 0 ) should be close to each other; If H 0 is false, L(ˆµ) should be much larger than L(µ 0 ) Thus, the decision rule would be to reject H 0 if L(µ 0 )/L(ˆµ) λ, where λ is some number less than 100 and chosen to obtain a particular α level The ratio, (L(µ 0 )/L(ˆµ)) = exp{ (1/2)(ˆµ µ 0 ) 2 }/(σ 2 /n) Thus, we could rephrase the decision rule: reject H 0 if (ˆµ µ 0 ) 2 }/(σ 2 /n) 10
2 log(λ), or if x µ 0 /(σ/ n) 2 log λ Thus, for an 05 level of significance, choose λ so 196 = 2 log λ Generally, we can phrase likelihood ratio tests as: 2 log(likelihood ratio) χ 2 ν ν 0, where ν is the dimension of the parameter space generally, and ν 0 is the dimension under H 0 032 Estimation To obtain estimates of the various quantities we need, merely replace the variances and covariances by their maximum likelihood estimates This process generates sample partial correlations or covariances; sample multiple squared correlations; sample regression parameters; and so on 11