B90.330 / C.005 NOTES for Wednesday 0.APR.7 Suppose that the model is β + ε, but ε does not have the desired variance matrix. Say that ε is normal, but Var(ε) σ W. The form of W is W w 0 0 0 0 0 0 w 0 0 0 w 0 0 0 3 w n We assume that we know the w i s, but we don t know σ. This would described as W diag,,,..., w w w3 w n The solution is easy. Define diag w, w, w3,..., wn. W. Then W W β + W ε. The distribution of W ε is normal with mean 0 and variance σ I. This gets us right back to the standard model. The problem is called weighted least squares. The same idea works if Var(ε) σ W in which W is some other known symmetric positive definite matrix. The only hangup is that W is more difficult to define and calculate. This problem is known as generalized least squares. Next is material on matrix rank. Please see the handout
We need to ask what we lose if does not have full rank. It s simple. The fitted vector ˆ is still unique. The residual sum of squares is identified (though its degrees of freedom get confused). The estimated coefficient b is not identifiable. Let s do the definition of positive definite. For ordinary numbers, the concept a > 0 is very clear. What should we mean by having a positive matrix? It s too much to ask that every entry be positive. We ll restrict our notions here to square matrices. There is no material consequence to the distinction symmetric versus non-symmetric, so we ll assume symmetric matrices. An n n matrix W is called positive definite if ( x is n and x 0 ) ( x W x > 0 ) 4 0 The matrix W 0 x W x 4 x + x > 0. is positive definite. If x x x, and x 0, then 0 The matrix W 40 is also positive definite. If x x W x 0 x x x + 40 x > 0. x x, and x 0, then An n n matrix W is called positive semi-definite if ( x is n ) ( x W x 0 ) The concepts negative definite and negative semi-definite are defined in the obvious way, but these ideas are much less useful.
There are these results about matrices that should be part of the knowledge base of every statistical person. * Covariance matrices are automatically positive semi-definite. * If Var() Ω then Var( a ) a Ω a 0. * If there is a vector a for which Var( a ) a Ω a 0, then the linear combination a is a constant. Equivalently, the matrix Ω must be singular. * Square matrices have eigenvalues. An eigenvalue is an any λ for which there is a non-zero vector u for which Ω u λ u. Rewrite this as (Ω - λi) u 0. Thus λ is an eigenvalue if and only if (Ω - λi) is a singular matrix. One consequence is that det[(ω - λi) ] 0 is a condition that can be used to find eigenvalues. We won t prove that here. Also, this is usually not the best computational way to find eigenvalues, at least for large matrices. In passing, we observe that an n n has n eigenvalues, not necessarily all different. cos θ sin θ Do eigenvalues have to exist? The matrix Ω sin θ cos θ u is a pure rotation. ou can check that Ω u rotates u u by angle θ. Consider then the condition det[(ω - λi) ] The condition is then cos θ λ sin θ det sin θ cos θ λ 0 (cos θ - λ) + sin θ 0 This is cos θ - λ cos θ + λ + sin θ 0 or λ - λ cos θ + 0 3
The roots for λ are cos θ ± cos θ cosθ ± 4cos θ 4 cos θ± sin θ cos θ ± i sin θ. It looks like eigenvalues can be imaginary! cos θ ± cos θ As a fun little proof, you can show that symmetric matrices only have real eigenvalues. (It s assumed that the matrix in question has all real entries.) Here s how. Say that A A and that u is an eigenvector with eigenvalue λ. While A has all real entries, we have made no such assumptions on λ or u. Now A u λ u. Write this in transpose form as u A λ u. Take complex conjugates throughout. The bar denotes complex conjugates. Thus a+ ib a ib. This operation for matrices and vectors is entry-by-entry. Note this about squared length: u u. Then u A λ u. Multiply this into u, getting u Au λ u u λ u In the original A u λ u, multiply on the left by u to get u Au u Au u λ u λ u u λ u This gets to λ λ, so λ must be a real number. * Matrix Ω is positive semi-definite if and only if all its eigenvalues are 0. An eigenvalue is an any λ for which there is a non-zero vector u for which Ω u λ u. Take this condition and multiply on the left by u to get u Ω u λ u u λ u. Since u, the squared length of u, is 4
non-negative, it follows that u Ω u 0. This proves the positive semi-definite definition for eigenvalues. However, every vector is a linear combination of the eigenvectors, so the result is true in general. * If M is idempotent, then its eigenvalues are all 0 s and s. This is easy to prove. Suppose that M u λ u. Multiple both sides by M to get M M u M u M (λu) λ M u λ (λ u) λ u This shows that λ λ, which is satisfied only for 0 and. See the handout on the Gauss-Markov theorem. The important finding is that the least squares estimate b ( ) - is blue (best linear unbiased estimate). It s important to note that this * uses Var( ε ) σ I * does not use normality (it s only about first and second moments) We have a nice handout out the reduced sum of squares. Here s a matrix version. Let s suppose the usual model β + ε. Let s suppose that the matrix is partitioned as U V n p n s n ( ps) We ll suppose that the column is part of U. In similar style, partition the coefficient vector as β p βu s βv ( p s) We ll assume, to preserve sanity, that everything has full rank. We d like to test the null hypothesis H 0 : β V 0 versus H : β V 0. The likelihood ratio test works very well for this. This comes down to comparing the two residual sums of squares: 5
SS Resid (H 0 ) based on model U β U + ε SS Resid (H ) based on model β + ε ( U V ) β β U V + ε Observe that SS Resid (H 0 ) SS Resid (H ). The difference SS Resid (H 0 ) - SS Resid (H ) will be distributed as σ χ when H 0 is true. This will be independent of ps SS Resid (H ), as it is in an orthogonal space. Of course SS Resid (H ) ~ These two together give us the basis for the partial F test. This test is σ χ. n p F ( ) MS ( H ) SSResid H0 SSResid H p s Resid and it has ( p s, n p ) degrees of freedom. Since SS Total SS Regr + SS Resid (for any model), the test can also be written as F ( ) MS ( H ) SSRegr H SSRegr H0 p s Resid 6
Let s make a note on conditional mean and variance. Suppose that random variables and have a joint density f, (x, y). The conditional mean E( x) and conditional variance Var( x) will come from the conditional density f (y x) f ( xy),, f ( x) What are these things for the normal distribution? Let s suppose that W is a k random multivariate normal with mean E W μ and Var( W ) Σ. This use of Σ is fairly common for this context. The density is ( wμ) Σ ( wμ) f( w ) e k / / π Σ In the special case with two coordinates (k ) for bivariate random variable σ the covariance matrix as Σ ρσ σ ρσ σ σ. The inverse is, write Σ - σ σ ( ρ ) ρ σ σ σ σ ρ σ σ. Then the density is where f y x,, πσ σ ρ e z ( ρ ) z ( y μ ) ρ( y μ )( x μ ) ( x μ ) + σ σ σ σ The marginal distribution of is normal, mean μ and variance conditional distribution of x : σ. So here is the 7
f y x πσ ρ f, y, x f ( x) We need to examine the exponent. e πσ σ ρ σ z + μ ( ρ ) σ ( x ) e π σ e z ( x μ ) ( ρ ) z + μ σ ( ρ ) ( x ) ( y μ ) ρ( y μ )( x μ ) ( x μ ) + + μ ( ρ ) σ σ σ σ σ ( x ) ( y )( x ) ( x ) ( y μ ) + + μ σ ( ρ ) ρσ μ μ σ μ ( x ) σ σ σ ( y )( x ) ρσ μ μ σ μ + μ ρ y x σ ( ρ ) σ σ ρσ y μ x μ σ ( ρ ) σ As a convenient notation, let s use β ρ σ σ. Finally, f y x πσ ρ e { (( y ) ( x )) } ( ρ ) μ β μ σ This conditional density is of course normal. The mean, meaning E( x), is ( x ) μ +β μ. The variance, meaning Var( x), is ( ) σ ρ. 8
This has many links to the model for simple linear regression! * The model is i β 0 + β x i + ε i, where the ε i s are independent, each with mean zero and with variance σ. Sxy * The estimate of slope is b Sxy. Observe that the Sxx n n Sxx numerator estimates Cov(, ) ρ σ σ, and the denominator estimates ρ σ σ. Thus b estimates. σ * Var ε σ ( ) σ ρ Var( x ). Curiously, this does not depend on the value (lower case) x on which the conditioning was done. (More just below.) In a simple linear regression R r, so that R Var ( ) estimates Var ( ). This is almost the definition of adjusted R. * The simple linear regression model gives the same variance to every ε i, so that this model is consistent with a bivariate normal distribution. That is, the bivariate normal satisfies the condition that Var( x ) is the same for all (lower case) x. If it happens that Var( x ) seems to depend on x, then we try to transform the problem to achieve equi-variance. Suppose that we have three or more coordinates to our multivariate normal random vector. We would then want Corr(, 3 x 3 ) to be the correlation in the conditional distribution. The result that makes all this go through is this one. Suppose that q r and with variance matrix is a (q+r) multivariate normal random vector with mean Σ Σ q q q r. Σ Σ r q r r μ q μ r 9
The conditional distribution of, given x, is then multivariate normal with mean μ + Σ Σ x μ and with variance Σ Σ Σ Σ. ( ) r q q r r r r q q q r r r r q It should be regarded as amazing (!!!) that the conditional variance matrix does not depend on x, the value on which the conditioning was done. Suppose that r, so that just represents one variable. Then Σ is just a scalar, and Σ is trivial to compute. This idea greatly simplifies the calculation of stepwise regression. See the handout on the column space of. Here s the definition: col () { a a is any p-by- vector } This is, of course, a p-dimensional space, since there are exactly p free choices in the vector a. Since E β, the expected value of the data vector lies in a p-dimensional space. However contains n pieces of information, and n is much greater than p. The remaining n - p pieces of information in can be relevant only to noise, and indeed will furnish the basis for estimating σ, the noise standard deviation. This is exactly the reason that the residual line in the analysis of variance table has n - p n - K - degrees of freedom. 0