Stat 511 Outline (Spring 2008 Revision of 2004 Version)

Similar documents
Stat 511 Outline (Spring 2009 Revision of 2008 Version)

Math 423/533: The Main Theoretical Topics

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

Least Squares Estimation-Finite-Sample Properties

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

STAT 540: Data Analysis and Regression

Linear Algebra Review

STAT 100C: Linear models

[y i α βx i ] 2 (2) Q = i=1

LECTURE 2 LINEAR REGRESSION MODEL AND OLS

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Lectures on Simple Linear Regression Stat 431, Summer 2012

Lecture 20: Linear model, the LSE, and UMVUE

Multivariate Regression

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Formal Statement of Simple Linear Regression Model

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Lecture 34: Properties of the LSE

THE ANOVA APPROACH TO THE ANALYSIS OF LINEAR MIXED EFFECTS MODELS

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

Ma 3/103: Lecture 24 Linear Regression I: Estimation

3. The F Test for Comparing Reduced vs. Full Models. opyright c 2018 Dan Nettleton (Iowa State University) 3. Statistics / 43

Regression Analysis IV... More MLR and Model Building

Regression. Oscar García

Bias Variance Trade-off

Stat 579: Generalized Linear Models and Extensions

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Advanced Econometrics I

Lecture 6: Linear models and Gauss-Markov theorem

MIXED MODELS THE GENERAL MIXED MODEL

Outline of GLMs. Definitions

Lecture 9 SLR in Matrix Form

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

5.1 Consistency of least squares estimates. We begin with a few consistency results that stand on their own and do not depend on normality.

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R.

Lecture 2: Linear and Mixed Models

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Maximum Likelihood Estimation; Robust Maximum Likelihood; Missing Data with Maximum Likelihood

Generalized Linear Models

Linear Methods for Prediction

Lecture 11: Regression Methods I (Linear Regression)

Lecture Notes. Introduction

STAT5044: Regression and Anova. Inyoung Kim

Linear models. Linear models are computationally convenient and remain widely used in. applied econometric research

Lecture 1: Linear Models and Applications

Practical Econometrics. for. Finance and Economics. (Econometrics 2)

PANEL DATA RANDOM AND FIXED EFFECTS MODEL. Professor Menelaos Karanasos. December Panel Data (Institute) PANEL DATA December / 1

The Linear Regression Model

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Statistics 203: Introduction to Regression and Analysis of Variance Course review

2. A Review of Some Key Linear Models Results. Copyright c 2018 Dan Nettleton (Iowa State University) 2. Statistics / 28

Chapter 5 Matrix Approach to Simple Linear Regression

Master s Written Examination

Multivariate Statistics

Likelihood-Based Methods

ANALYSIS OF VARIANCE AND QUADRATIC FORMS

A Bayesian Treatment of Linear Gaussian Regression

Introduction to General and Generalized Linear Models

Part 6: Multivariate Normal and Linear Models

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Multiple Linear Regression

Econometrics Lecture 5: Limited Dependent Variable Models: Logit and Probit

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

Stat 579: Generalized Linear Models and Extensions

Multiple Linear Regression

POLI 8501 Introduction to Maximum Likelihood Estimation

Introduction to Estimation Methods for Time Series models Lecture 2

Ch 2: Simple Linear Regression

General Linear Model: Statistical Inference

Introduction to General and Generalized Linear Models

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

The outline for Unit 3

Lecture 11: Regression Methods I (Linear Regression)

Ch 3: Multiple Linear Regression

Regression: Lecture 2

Stat 5101 Lecture Notes

BIOS 2083 Linear Models c Abdus S. Wahed

Topic 7 - Matrix Approach to Simple Linear Regression. Outline. Matrix. Matrix. Review of Matrices. Regression model in matrix form

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Mixed-Models. version 30 October 2011

Linear models and their mathematical foundations: Simple linear regression

Lecture 10 Multiple Linear Regression

A General Overview of Parametric Estimation and Inference Techniques.

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

So far our focus has been on estimation of the parameter vector β in the. y = Xβ + u

Regression Steven F. Arnold Professor of Statistics Penn State University

Maximum Likelihood Estimation

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

The Standard Linear Model: Hypothesis Testing

Statistical Distribution Assumptions of General Linear Models

Transcription:

Stat 511 Outline (Spring 2008 Revision of 2004 Version) Steve Vardeman Iowa State University February 1, 2008 Abstract This outline summarizes the main points of lectures based on Ken Koehler s class notes and other sources 1 Linear Models The basic linear model structure is Y = Xβ + ² (1) for Y an n 1 vector of observables, X an n k matrix of known constants, β a k 1 vector of (unknown) constants (parameters), and ² an n 1 vector of unobservable random errors Almost always one assumes that E² = 0 Often one also assumes that for an unknown constant (a parameter) σ 2 > 0,Var² =σ 2 I (these are the Gauss-Markov model assumptions) or somewhat more generally assumes that Var² = η 2 V (these are the Aitken model assumptions) These assumptions can be phrased as the mean vector EY is in the column space of the matrix X (EY C (X)) and the variance-covariance matrix VarY is known up to a multiplicative constant 11 Ordinary Least Squares The ordinary least squares estimate for EY = Xβ is made by minimizing ³ Y Y b 0 ³ Y Y b over choices of Y b C (X) Y b is then the (perpendicular) projection of Y onto C (X) This is minimization of the squared distance between Y and Y b belonging to C (X) Computation of this projection can be accomplished using a (unique) projection matrix P X as by = P X Y 1

There are various ways of constructing P X P X = X (X 0 X) X 0 One is as for (X 0 X) any generalized inverse of X 0 X As it turns out, P X is both symmetric and idempotent It is sometimes called the hat matrix and written as H rather than P X (It is used to compute the y hats ) The vector e = Y Y b =(I P X ) Y is the vector of residuals As it turns out, the matrix I P X isalsoaperpendicular projection matrix It projects onto the subspace of R n consisting of all vectors perpendicular to the elements of C (X) That is, I P X projects onto C (X) {u R n u 0 v =0 v C (X)} It is the case that C (X) = C (I P X ) and rank (X) =rank (X 0 X)=rank (P X )=dimension of C (X) =trace (P X ) and rank (I P X )=dimension of C (X) = trace (I P X ) and n = rank(i) =rank (P X )+rank (I P X ) Further, there is the Pythagorean Theorem/ANOVA identity Y 0 Y =(P X Y) 0 (P X Y)+((I P X ) Y) 0 ((I P X ) Y) = b Y 0 b Y + e 0 e When rank(x) =k (one has a full rank X) everyw C (X) has a unique representation as a linear combination of the columns of X In this case there is a unique b that solves Xb = P X Y = b Y = d Xβ (2) We can call this solution of equation (2) the ordinary least squares estimate of β Noticethatwethenhave Xb OLS = P X Y = X (X 0 X) X 0 Y so that X 0 Xb OLS = X 0 X (X 0 X) X 0 Y These are the so called normal equations X 0 X is k k with the same rank as X (namely k) and is thus non-singular So (X 0 X) =(X 0 X) 1 and the normal equations can be solved to give b OLS =(X 0 X) 1 X 0 Y When X is not of full rank, there are multiple b s that will solve equation (2) and multiple β sthatcouldbeusedtorepresentey C (X) There is thus no sensible least squares estimate of β 2

12 Estimability and Testability In full rank cases, for any c R k it makes sense to define the least squares estimate of the linear combination of parameters c 0 β as dc 0 β OLS = c 0 b OLS = c 0 (X 0 X) 1 X 0 Y (3) When X is not of full rank, the above expression doesn t make sense But even in such cases, for some c R k the linear combination of parameters c 0 β can be unambiguous in the sense that every β that produces a given EY = Xβ produces the same value of c 0 β As it turns out, the c s that have this property can be characterized in several ways Theorem 1 The following 3 conditions on c R k are equivalent: a) a R n such that a 0 Xβ = c 0 β β R k b) c C (X 0 ) c) Xβ 1 = Xβ 2 implies that c 0 β 1 = c 0 β 2 Condition c) says that c 0 β is unambiguous in the sense mentioned before the statement of the result, condition a) says that there is a linear combination of the entries of Y that is unbiased for c 0 β, and condition b) says that c 0 is a linear combinations of the rows of X When c R k satisfies the characterizations of Theorem 1 it makes sense to try and estimate c 0 β Thus, one is led to the following definition Definition 2 If c R k satisfies the characterizations of Theorem 1 the parametric function c 0 β is said to be estimable If c 0 β is estimable, a R n such that c 0 β = a 0 Xβ = a 0 EY β R k and so it makes sense to invent an ordinary least squares estimate of c 0 β as dc 0 β OLS = a 0 b Y = a 0 P X Y = a 0 X (X 0 X) X 0 Y = c 0 (X 0 X) X 0 Y which is a generalization of the full rank formula (3) It is often of interest to simultaneously estimate several parametric functions c 0 1β, c 0 2β,,c 0 lβ If each c 0 iβ is estimable, it makes sense to assemble the matrix c 0 1 c 0 2 C = (4) c 0 l 3

and talk about estimating the vector Cβ = c 0 1 β c 0 2β c 0 l β An ordinary least squares estimator of Cβ is then dcβ OLS = C (X 0 X) X 0 Y Related to the notion of the estimability of Cβ is the concept of testability of hypotheses Roughly speaking, several hypotheses like H 0 :c 0 i β = # are (simultaneously) testable if each c 0 iβ can be estimated and the hypotheses are not internally inconsistent To be more precise, suppose that (as above) C is an l k matrix of constants Definition 3 For a matrix C of the form (4), the hypothesis H 0 :Cβ = d is testable provided each c 0 iβ is estimable and rank(c) =l Cases of testing H 0 :Cβ = 0 are of particular interest If such a hypothesis is testable, a i R n such that c 0 i = a0 ix for each i, and thus with a 0 1 a 0 2 A = a 0 l one can write C = AX Then the basic linear model says that EY C (X) while the hypothesis says that EY C (A 0 ) C (X) C (A 0 ) is a subspace of C (X) of dimension rank (X) rank (A 0 )=rank (X) l and the hypothesis can thus be thought of in terms of specifying that the mean vector is in a subspace of C (X) 13 Means and Variances for Ordinary Least Squares (Under the Gauss-Markov Assumptions) Elementary rules about how means and variances of linear combinations of random variables are computed can be applied to find means and variances for OLS estimators Under the Gauss-Markov Model some of these are EY b = Xβ and VarY b = σ 2 P X Ee = 0 and Vare = σ 2 (I P X ) 4

Further, for l estimable functions c 0 1β, c 0 2β,,c 0 lβ and c 0 1 c 0 2 C = c 0 l the estimator d Cβ OLS = C (X 0 X) X 0 Y has mean and covariance matrix E d Cβ OLS = Cβ and Var d Cβ OLS = σ 2 C (X 0 X) C 0 Notice that in the case that X is full rank and β = Iβ is estimable, the above says that Eb OLS = β and Varb OLS = σ 2 (X 0 X) 1 It is possible to use Theorem 52A of Rencher about the mean of a quadratic form to also argue that ³ Ee 0 e =E Y Y b 0 ³ Y Y b = σ 2 (n rank (X)) This fact suggests the ratio MSE e 0 e n rank (X) as an obvious estimate of σ 2 In the Gauss-Markov model, ordinary least squares estimation has some optimality properties Foremost there is the guarantee provided by the Gauss- Markov Theorem This says that under the linear model assumptions with Var² =σ 2 I,forestimablec 0 β the ordinary least squares estimator c d0 β OLS is the Best (in the sense of minimizing variance) Linear (in the entries of Y) Unbiased (having mean c 0 β for all β) Estimator of c 0 β 14 Generalized Least Squares For V positive definite, suppose that Var² = η 2 V There exists a symmetric positive definite square root matrix for V 1,callitV 1 2 Then U = V 1 2 Y satisfies the Gauss-Markov model assumptions with model matrix W = V 1 2 X It then makes sense to do ordinary least squares estimation of EU C (W) (with P W U = b U = d Wβ) Note that for c C (W 0 ) the BLUE of the parametric function c 0 β is c 0 (W 0 W) W 0 U = c 0 (W 0 W) W 0 V 1 2 Y 5

(any linear function of the elements of U is a linear function of elements of Y and vice versa) So, for example, in full rank cases, the best linear unbiased estimator of the β vector is b OLS(U) =(W 0 W) 1 W 0 U =(W 0 W) 1 W 0 V 1 2 Y (5) Notice that ordinary least squares estimation of EU is minimization of ³V 1 2 Y U 0 b ³V 1 2 Y U b = ³Y V 1 2 U 0 b ³ V 1 Y V 1 2 U b over choices of U b ³ C (W) =C V 1 2 X This is minimization of ³Y b Y 0 V 1 ³ Y b Y over choices of Y b ³ C V 1 2 W = C (X) This is so-called generalized least squares estimation in the original variables Y 15 Reparameterizations and Restrictions When two superficially different linear models Y = Xβ + ² and Y = Wγ + ² (with the same assumptions on ²) havec (X) =C (W), they are really fundamentally the same b Y and Y b Y are the same in the two models Further, c 0 β is estimable a R n with a 0 Xβ = c 0 β β a R n with a 0 EY = c 0 β β That is, c 0 β is estimable exactly when it is a linear combination of the entries of EY Though this is expressed differently in the two models, it must be the case that the two equivalent models produce the same set of estimable functions, ie that {c 0 β c 0 β is estimable} = {d 0 γ d 0 γ is estimable} The correspondence between estimable functions in the two model formulations is as follows Since every column of W is a linear combination of the columns of X, there must be a matrix F such that W = XF Then for an estimable c 0 β, a R n such that c 0 = a 0 X But then c 0 β = a 0 Xβ = a 0 Wγ = a 0 XFγ =(c 0 F) γ So, for example, in the Gauss-Markov model, if c C (X 0 ), the BLUE of c 0 β =(c 0 F) γ is d c 0 β OLS 6

What then is there to choose between two fundamentally equivalent linear models? There are two issues Computational/formula simplicity pushes one in the direction of using full rank versions Sometimes, scientific interpretability of parameters pushes one in the opposite direction It must be understood that the set of inferences one can make can ONLY depend on the column space of the model matrix, NOT on how that column space is represented 16 Normal Distribution Theory and Inference If one adds to the basic Gauss-Markov (or Aitken) linear model assumptions an assumption that ² (and therefore Y) is multivariate normal, inference formulas (for making confidence intervals, tests and predictions) follow These are based primarily on two basic results Theorem 4 (Koehler 47, panel 309 See also Rencher Theorem 55A) Suppose that A is n n andsymmetricwithrank(a) =k, Y MVN n (μ, Σ) for Σ positive definite If AΣ is idempotent, then Y 0 AY χ 2 k (μ 0 Aμ) (So, if in addition Aμ = 0, theny 0 AY χ 2 k ) Theorem 5 (Theorem 137 of Christensen) Suppose that Y MVN μ, σ 2 I and BA = 0 a) If A is symmetric, Y 0 AY and BY are independent, and b) if both A and B are symmetric, then Y 0 AY and Y 0 BY are independent (Part b) is Koehler s 48 Part a) is a weaker form of Corollary 1 to Rencher s Theorem 56A and part b) is a weaker form of Corollary 1 to Rencher s Theorem 56B) Here are some implications of these theorems Example 6 In the normal Gauss-Markov model 1 ³ σ 2 Y Y b 0 ³ Y Y b = SSE σ 2 χ 2 n rank(x) This leads, for example, to 1 α level confidence limits for σ 2 Ã! SSE SSE upper α 2 point of, χ2 n rank(x) lower α 2 point of χ2 n rank(x) Example 7 (Estimation and testing for an estimable function) In the normal Gauss-Markov model, if c 0 β is estimable, dc 0 β OLS c 0 β q t n rank(x) MSE c 0 (X 0 X) c 7

This implies that H 0 :c 0 β =#can be tested using the statistic T = dc 0 β OLS # MSE q c 0 (X 0 X) c and a t n rank(x) reference distribution Further, if t is the upper α 2 point of the t n rank(x) distribution, 1 α level two-sided confidence limits for c 0 β are dc 0 β OLS ± t q MSE c 0 (X 0 X) c Example 8 (Prediction) In the normal Gauss-Markov model, suppose that c 0 β is estimable and y N c 0 β, γσ 2 independent of Y is to be observed (We assume that γ is known) Then dc 0 β OLS y q t n rank(x) MSE γ + c 0 (X 0 X) c This means that if t is the upper α 2 point of the t n rank(x) distribution, 1 α level two-sided prediction limits for y are dc 0 β OLS ± t q MSE γ + c 0 (X 0 X) c Example 9 (Testing) In the normal Gauss-Markov model, suppose that the hypothesis H 0 :Cβ = d is testable Then with ³ 0 ³ SS H 0 = dcβols d C (X 0 X) C 0 1 ³ dcβols d it s easy to see that for SS H 0 σ 2 χ 2 l δ 2 δ 2 = 1 σ 2 (Cβ d)0 ³ C (X 0 X) C 0 1 (Cβ d) This in turn implies that F = SS H 0 /l MSE F l,n rank(x) δ 2 So with f the upper α point of the F l,n rank(x) distribution, an α level test of H 0 :Cβ = d can be made by rejecting if SS H 0 /l MSE >f The power of this test is power δ 2 = P an F l,n rank(x) δ 2 random variable >f Or taking a significance testing point of view, a p-value for testing this hypothesis is P an F l,n rank(x) random variable exceeds the observed value of SS H 0 /l MSE 8

17 Normal Theory Maximum Likelihood and Least Squares One way to justify/motivate the use of least squares in the linear model is through appeal to the statistical principle of maximum likelihood That is, in the normal Gauss-Markov model, the joint pdf for the n observations is f Y Xβ,σ 2 = (2π) n 2 det σ 2 I 1 2 exp µ 12 (Y Xβ)0 σ 2 I 1 (Y Xβ) µ = (2π) n 1 2 σ n exp 1 2σ 2 (Y Xβ)0 (Y Xβ) Clearly, for fixed σ this is maximized as a function of Xβ C (X) by minimizing (Y Xβ) 0 (Y Xβ) ie with Y b = Xβ d OLS Then consider lnf ³Y Y,σ b 2 = n 2 ln2π n 2 lnσ2 1 2σ 2 SSE Setting the derivative with respect to σ 2 equal to 0 and solving shows this function of σ 2 to be maximized when σ 2 = SSE n That is, the (joint) maximum likelihood estimator of the parameter vector Xβ,σ 2 ³ is by, SSE n It is worth noting that so that SSE n = E SSE n = and the MLE of σ 2 is biased low µ n rank (X) n µ n rank (X) 18 Linear Models and Regression n MSE σ 2 <σ 2 A most important application of the general framework of the linear model is to (multiple linear) regression analysis, usually represented in the form y i = β 0 + β 1 x 1i + β 2 x 2i + + β r x ri + i for predictor variables x 1,x 2,,x r This can, of course, be written in the usual linear model format (1) for k = r +1and β 0 1 x 11 x 21 x r1 β 1 β = and X = 1 x 12 x 22 x r2 =(1 x 1 x 2 x r ) β r 1 x 1n x 2n x rn 9

Unless n is small or one is very unlucky, in regression contexts X is of full rank (ie is of rank r +1) A few specifics of what has gone before that are of particular interest in the regression context are as follows As always Y b = P X Y and in the regression context it is especially common to call P X = H the hat matrix It is n n and its diagonal entries h ii are sometimes used as indices of influence or leverage of a particular case on the regression fit Itisthecasethateachh ii 0 and X hii = trace (H) =rank (H) =r +1 so that the hats h ii average to r+1 n In light of this, a case with h ii > 2(r+1) n is sometimes flagged as an influential case Further Var b Y = σ 2 P X = σ 2 H so that Varby i = h ii σ 2 and an estimated standard deviation of by i is h ii MSE (Thisisuseful,forexample,inmakingconfidence intervals for Ey i ) Also as always, e =(I P X ) Y and Vare = σ 2 (I P X )=σ 2 (I H) So Vare i =(1 h ii ) σ 2 and it is typical to compute and plot standardized versions of the residuals e e i i = MSE 1 hii The general testing of hypothesis framework discussed in Section 16 has a particular important specialization in regression contexts That is, it is common in regression contexts (for p<r)to test H 0 : β p+1 = β p+2 = = β r =0 (6) and in first methods courses this is done using the full model/reduced model paradigm With X i =(1 x 1 x 2 x i ) this is the hypothesis H 0 : EY C (X p ) It is also possible to write this hypothesis in the standard form H 0 :Cβ = 0 using the matrix µ C = 0 I (r p) (r+1) (r p) (p+1) (r p) (r p) So from Section 16 the hypothesis can be tested using an F test with numerator sum of squares SS H0 =(Cb OLS ) 0 ³ C (X 0 X) 1 C 0 1 (CbOLS ) What is interesting and perhaps not initially obvious is that SS H0 = Y 0 P X P Xp Y (7) 10

andthatthiskindofsumofsquaresis the elementary SSR full SSR reduced (A proof of the equivalence (7) is on a handout posted on the course web page) Further, the sum of squares in display (7) can be made part of any number of interesting partitions of the (uncorrected) overall sum of squares Y 0 Y For example, it is clear that so that Y 0 Y = Y 0 P 1 + P Xp P 1 + PX P Xp +(I PX ) Y (8) Y 0 Y Y 0 P 1 Y = Y 0 P Xp P 1 Y + Y 0 P X P Xp Y + Y 0 (I P X ) Y In elementary regression analysis notation Y 0 Y Y 0 P 1 Y = SSTot (corrected) Y 0 P Xp P 1 Y = SSRreduced Y 0 P X P Xp Y = SSRfull SSR reduced Y 0 (I P X ) Y = SSE full (and then of course Y 0 (P X P 1 ) Y = SSR full ) These four sums of squares are often arranged in an ANOVA table for testing the hypothesis (6) It is common in regression analysis to use reduction in sums of squares notation and write R(β 0 )=Y 0 P 1 Y R(β 1,,β p β 0 )=Y 0 P Xp P 1 Y R(β p+1,,β r β 0,β 1,,β p )=Y 0 P X P Xp Y so that in this notation, identity (8) becomes Y 0 Y = R(β 0 )+R(β 1,,β p β 0 )+R(β p+1,,β r β 0,β 1,,β p )+SSE And in fact, even more elaborate breakdowns of the overall sum of squares are possible For example, R(β 0 )=Y 0 P 1 Y R(β 1 β 0 )=Y 0 (P X1 P 1 ) Y R(β 2 β 0,β 1 )=Y 0 (P X2 P X1 ) Y R(β r β 0,β 1,,β r 1 )=Y 0 P X P Xr 1 Y represents a Type I or Sequential sum of squares breakdown of Y 0 Y SSE (Note that these sums of squares are appropriate numerator sums of squares for testing significance of individual β s in models that include terms only up to the one in question) Theenterpriseoftryingtoassignasumofsquarestoapredictorvariable strikes Vardeman as of little real interest, but is nevertheless a common one Rather than think of R(β i β 0,β 1,,β i 1 )=Y 0 P Xi P Xi 1 Y 11

(which depends on the usually essentially arbitrary ordering of the columns of X) as due to x i it is possible to invent other assignments of sums of squares One is the SAS Type II Sum of Squares assignment With X i =(1 x 1 x i 1 x i+1 x r ) (the original model matrix with the column x i deleted), These are R(β i β 0,β 1,,β i 1,β i+1,,β r )=Y 0 P X P X i Y (appropriate numerator sums of squares for testing significance of individual β s in the full model) It is worth noting that Theorem B47 of Christensen guarantees that any matrix like P X P Xi is a perpendicular projection matrix and that Theorem B48 implies that C (P X P Xi )=C(X) C(X i ) (the orthogonal complement of C (X i ) with respect to C (X) defined on page 395 of Christensen) Further, it is easy enough to argue using Christensen s Theorem 137b (Koehler s 47) that any set of sequential sums of squares has pair-wise independent elements And one can apply Cochran s Theorem (Koehler s 49 on panel 333) to conclude that the whole set (including SSE) are mutually independent 19 Linear Models and Two-Way Factorial Analyses As a second application/specialization of the general linear model framework we consider the two-way factorial context, That is, for I levels of Factor A and J levels of Factor B, we consider situations where one can make observations under every combination of a level of A and a level of B For i =1, 2,,I and j =1, 2,,J y ijk = the kth observation at level i of A and level j of B The most straightforward way to think about such a situation is in terms of the cell means model y ijk = μ ij + ijk (9) In this full rank model, provided every within cell sample size n ij is positive, each μ ij is estimable and thus so too is any linear combination of these means It is standard to think of the IJ means μ ij as laid out in a two-way table as μ 11 μ 12 μ 1J μ 21 μ 22 μ 2J μ I1 μ I2 μ IJ Particular interesting parametric functions (c 0 β s) in model (9) are built on row, column and grand average means μ i = 1 J JX j=1 μ ij and μ j = 1 I IX i=1 μ ij and μ = 1 IJ X i,j μ ij 12

Each of these is a linear combination of the I J means μ ij and is thus estimable So are the linear combinations of them α i = μ i μ,β j = μ j μ, and αβ ij = μ ij μ + α i + β j (10) The factorial effects (10) here are particular (estimable) linear combinations of the cell means It is a consequence of how these are defined that X X X α i =0, β j =0, αβ ij =0 j, and X αβ ij =0 i (11) i j i j An issue of particular interest in two way factorials is whether the hypothesis H 0 :αβ ij =0 i and j (12) is tenable (If it is, great simplification of interpretation is possible changing levels of one factor has the same impact on mean response regardless of which level of the second factor is considered) This hypothesis can be equivalently written as μ ij = μ + α i + β j i and j or as μij μ ij 0 μi 0j μ i 0j 0 =0 i, i 0,j and j 0 and is a statement of parallelism on interaction plots of means To test this, one could write the hypothesis in terms of (I 1)(J 1) statements μ ij μ i μ j + μ =0 about the cell means and use the machinery for testing H 0 :Cβ = d from Example 9 In this case, d = 0 and the test is about EY falling in some subspace of C (X) For thinking about the nature of this subspace and issues related to the hypothesis (12), it is probably best to back up and consider an alternative to the cell means model approach Rather than begin with the cell means model, one might instead begin with the non-full-rank effects model y ijk = μ + α i + β j + αβ ij + ijk (13) I have put stars on the parameters to make clear that this is something different from beginning with cell means and defining effects as linear combinations of them Here there are k =1+I + J + IJ parameters for the means and only IJ different means A model including all of these parameters can not be of full rank To get simple computations/formulas, one must impose some restrictions There are several possibilities In the first place, the facts (11) suggest the so called sum restrictions in the effects model (13) X X X α i =0, β j =0, αβ ij =0 j, and X αβ ij =0 i i j i j 13

Alternative restrictions are so-called baseline restrictions SAS uses the baseline restrictions α I =0, β J =0, αβ Ij =0 j, andαβ ij =0 i while R and Splus use the baseline restrictions α 1 =0, β 1 =0, αβ 1j =0 j, andαβ i1 =0 i Under any of these sets of restrictions one may write a full rank model matrix as Ã! X n IJ = 1 X α n 1 X β X αβ n (I 1) n (J 1) n (I 1)(J 1) and the no interaction hypothesis (12) is the hypothesis H 0 :EY C ((1 X α X β )) So using the full model/reduced model paradigm from the regression discussion, one then has an appropriate numerator sum of squares ³ SS H0 = Y 0 P X P (1 Xα X β ) Y and numerator degrees of freedom (I 1) (J 1) (in complete factorials where every n ij > 0) Other hypotheses sometimes of interest are H 0 :α i =0 i or H 0 :β j =0 j (14) These are the hypotheses that all row averages of cell means are the same and that all column averages of cell means are the same That is, these hypotheses could be written as H 0 :μ i μ i 0 =0 i, i 0 or H 0 :μ j μ j 0 =0 j, j 0 It is possible to write the first of these in the cell means model as H 0 :Cβ = 0 for C that is (I 1) k and each row of C specifying α i =0for one of i =1, 2,,(I 1) (or equality of two row average means) Similarly, the second can be written in the cell means model as H 0 :Cβ = 0 for C that is (J 1) k and each row of C specifying β j =0for one of j =1, 2,,(J 1) (or equality of two column average means) Appropriate numerator sums of squares and degrees of freedom for testing these hypotheses are then obvious using the material of Example 9 These sums of squares are often referred to as Type III sums of squares How to interpret standard partitions of sums of squares and to relate them to tests of hypotheses (12) and (14) is problematic unless all cell sample sizes are the same (all n ij = m, the data are balanced ) That is, depending upon what kind of partition one asks for in a call of a standard two-way ANOVA routine, the program produces the following breakdowns 14

Source Type I SS (in the order A,B,A B) Type II SS A R (α s μ ) R (α s μ,β s) B R (β s μ,α s) R (β s μ,α s) A B R (αβ s μ,α s,β s) R (αβ s μ,α s,β s) Type III SS SS H0 for type (14) hypothesis in model (9) SS H0 for type (14) hypothesis in model (9) SS H0 for type (12) hypothesis in model (9) The A B sums of squares of all three types are the same When the data are balanced, the A and B sums of squares of all three types are also the same But when the sample sizes are not the same, the A and B sums of squares of different types are generally not the same, only the Type III sums of squares are appropriate for testing hypotheses (14), and exactly what hypotheses could be tested using the Type I or Type II sums of squares is both hard to figure out and actually quite bizarre when one does figure it out (They correspond to certain hypotheses about weighted averages of means that use sample sizes as weights See Koehler s notes for more on this point) From Vardeman s perspective, the Type III sums of squares have a reason to be in this context, whilethetypeiandtypeiisumsofsquaresdonot (Thisisinspiteof the fact that the Type I breakdown is an honest partition of an overall sum of squares, while the Type III breakdown is not) An issue related to lack of balance in a two-way factorial is the issue of empty cells in a two-way factorial That is, suppose that there are data for only k<ijcombinations of levels of A and B Full rank in this context is rank k and the cell means model (9) can have only k<ijparameters for the means However, if the I J table of data isn t too sparse (if the number of empty cells isn t too large and those that are empty are not in a nasty configuration) it is still possible to fit both a cell means model and a no-interaction effects model to the data This allows the testing of H 0 :theμ ij for which one has data show no interactions ie H 0 : μ ij μ ij 0 μi 0 j μ i 0 j =0 quadruples 0 (i, j), (i, j 0 ), (i 0,j) and (i 0,j 0 (15) ) complete in the data set and the estimation of all cell means under an assumption that a no-interaction effects model appropriate to the cells where one has data extends to all I J cells in the table 15

That is, let X be the cell means model matrix (for k full cells) and Ã! X n k = 1 n 1 X α X β n (I 1) n (J 1) be an appropriate restricted version of an effects model model matrix (with no interaction terms) If the pattern of empty cells is such that X is full rank (has rank I + J 1), the hypothesis (15) can be tested using F = Y0 (P X P X ) Y/ (k (I + J 1)) Y 0 (I P X ) Y/ (n k) and an F (k (I+J 1)),(n k) reference distribution Further, every μ + α i + β j is estimable in the no interaction effects model Provided this model extends to all I J combinations of levels of A and B, this provides estimates of mean responses for all cells (Note that this is essentially the same kind of extrapolation one does in a regression context to sets of predictors not in the original data set However, on an intuitive basis, the link supporting extrapolation is probably stronger with quantitative regressors than it is with the qualitative predictors of the present context) 2 Nonlinear Models A generalization of the linear model is the (potentially) nonlinear model that for β a k 1 vector of (unknown) constants (parameters) and for some function f (x, β) that is smooth (differentiable) in the elements of β, says that what is observed can be represented as y i = f (x i, β)+ i (16) for each x i a known vector of constants (The dimension of x is fixed but basically irrelevant for what follows In particular, it need not be k) As is typical in the linear model, one usually assumes that E i =0 i, anditis also common to assume that for an unknown constant (a parameter) σ 2 > 0, Var² =σ 2 I 21 Ordinary Least Squares in the Nonlinear Model In general (unlike the case when f (x i, β) =x 0 iβ and the model (16) is a linear model) there are typically no explicit formulas for least squares estimation of β That is, minimization of nx g (b) = (y i f (x i, b)) 2 (17) i=1 16

is a problem in numerical analysis There are a variety of standard algorithms used for this purpose They are all based on the fact that a necessary condition for b OLS to be a minimizer of g (b) is that g =0 j b j b=bols so that in search for an ordinary least squares estimator, one might try to find a simultaneous solution to these k estimating equations A bit of calculus and algebra shows that b OLS must then solve the matrix equation 0 = D 0 (Y f (X, b)) (18) whereweusethenotations D n k = µ f (xi, b) b j and f (X, b) n 1 = f (x 1, b) f (x 2, b) f (x n, b) Inthecaseofthelinearmodel µ D = x 0 b ib =(x ij )=X and f (X, b) =Xb j so that equation (18) is 0 = X 0 (Y XB), ie is the set of normal equations X 0 Y = X 0 Xb One of many iterative algorithms for searching for a solution to the equation (18) is the Gauss-Newton algorithm It proceeds as follows For b r = the approximate solution produced by the rth iteration of the algorithm (b 0 is some vector of starting values that must be supplied by the user), let µ f D r (xi, b) = b j b=b r The first order Taylor (linear) approximation to f (X, β) at b r is b r 1 b r 2 b r k f (X, β) f (X, b r )+D r (β b r ) So the nonlinear model Y = f (X, β)+² can be written as Y f (X, b r )+D r (β b r )+² 17

which can be written in linear model form (for the response vector Y = (Y f (X, b r ))) as (Y f (X, b r )) D r (β b r )+² In this context, ordinary least squares would say that β b r could be approximated as (β\ b r ) OLS =(D r0 D r ) 1 D r0 (Y f (X, b r )) whichinturnsuggeststhatthe(r +1)st iterate of b be taken as b r+1 = b r +(D r0 D r ) 1 D r0 (Y f (X, b r )) These iterations are continued until some convergence criterion is satisfied One type of convergence criterion is based on the deviance (what in a linear model would be called the error sum of squares) At the rth iteration this is SSE r = g (b r )= nx (y i f (x i, b r )) 2 i=1 and in rough terms, the convergence criterion is to stop when this ceases to decrease Other criteria have to do with (relative) changes in the entries of b r (one stops when b r ceases to change much) The R function nls implements this algorithm (both in versions where the user must supply formulas for derivatives and where the algorithm itself computes approximate/numerical derivatives) Koehler s note discuss two other algorithms, the Newton-Raphson algorithm and the Fisher scoring algorithm, as they are applied to the (iid normal errors version of the) loglikelihood function associated with the nonlinear model (16) Our fundamental interest here is not in the numerical analysis of how one finds a minimizer of the function (17), b OLS, but rather how that estimator can be used in making inferences There are two routes to inference based on complimentary large sample theories connected with b OLS We outline those in the next two subsections 22 Inference Methods Based on Large n Theory for the Distribution of MLE s Just as in the linear model case discussed in Section 17, b OLS and SSE/n are (joint) maximum likelihood estimators of β and σ 2 under the iid normal errors nonlinear model General theory about the consistency and asymptotic normality of maximum likelihood estimators (outlined, for example, in Section 731 of the Appendix) suggests the following Claim 10 Precise statements of the approximations below can be made in some large n circumstances ³ 1 µ 1) b OLS MVN k β, σ 2 (D 0 f(x D) where D = i,b) b j Further, b=β 18

(D 0 D) 1 typically gets small with increasing sample size 2) MSE = SSE n k σ2 ³ µ 3) (D 0 D) 1 bd 0 D 1 b where D= b f(x i,b) b j b=bols 4) For a smooth (differentiable) function h that maps < k < q,claim1) and the delta method (Taylor s Theorem of Section 72 of the Appendix) imply that h (b OLS ) MVN q ³ 5) G b G = 0 h (β), σ 2 G (D 0 D) 1 G µ h i (b) b j b=bols µ h for G = i (b) q k b j b=β Using this set of approximations, essentially exactly as in Section 16, one can develop inference methods Some of these are outlined below Example 11 (Inference for a single β j ) Frompart1)ofClaim10wegetthe approximation b OLSj β j σ N (0, 1) η j for η j the jth diagonal entry of (D 0 D) 1 But then from parts 2) and 3) of the claim, b OLSj β j σ b OLSj β j p η j MSE bηj ³ for bη j the jth diagonal entry of bd 0 D 1 b In the (normal Gauss-Markov) linear model context, this last random variable is in fact t distributed for any n Then both so that the nonlinear model formulas reduce to the linear model formulas, and as a means of making the already very approximate inference formulas somewhat more conservative, it is standard to say b OLSj β j MSE p bηj t n k and thus to test H 0 :β j =#using T = b OLSj # MSE p bηj and a t n k reference distribution, and to use the values b OLSj ± t q MSE bη j as confidence limits for β j 19

Example 12 (Inference for a univariate function of β, including a single mean response) For h that maps < k < 1 consider inference for h (β) (with application to f (x, β) for a given set of predictor variables x) Facts 4) and 5) of Claim 10 suggest that h (b OLS ) h (β) MSE r bg ³ bd 0 b D 1 bg 0 t n k This leads (as in the previous application/example) to testing H 0 :h (β) =# using h (b OLS ) # T = r ³ MSE bg bd 0 D 1 b bg 0 and a t n k reference distribution, and to use of the values h (b OLS ) ± t r ³ MSE bg bd 0 D 1 b bg 0 as confidence limits for h (β) For a set of predictor variables x, this can then be applied to h (β) =f (x, β) to produce inferences for the mean response at x That is, with Ã! Ã! G = f (x, b) and, as expected, G b f (x, b) = 1 k b j b j b=β one may test H 0 :f (x, β) =#using T = f (x, b OLS ) # MSE r bg ³ bd 0 b D 1 bg 0 and a t n k reference distribution, and use the values f (x, b OLS ) ± t r ³ MSE bg bd 0 D 1 b bg 0 as confidence limits for f (x, β) b=bols Example 13 (Prediction) Suppose that in the future, y normal with mean h (β) and variance γσ 2 independent of Y will be observed (The constant γ is assumed to be known) Approximate prediction limits for y are then r h (b OLS ) ± t MSE γ + b G ³ bd 0 b D 1 bg 0 20

23 Inference Methods Based on Large n Theory for Likelihood Ratio Tests/Shape of the Likelihood Function Exactly as in Section 17 for the normal Gauss-Markov linear model, the (normal) likelihood function in the (iid errors) non-linear model is L β,σ 2 Y Ã! =(2π) n 1 2 σ n exp 1 nx 2σ 2 (y i f (x i, β)) 2 Suppressing display of the fact that this is a function of Y (ie thisisa random function of the parameters β and σ 2 ) it is common to take a natural logarithm and consider the log-likelihood function l β,σ 2 = n 2 ln2π n 2 lnσ2 1 2σ 2 i=1 nx (y i f (x i, β)) 2 There is general large n theory concerning the use of this kind of random function in testing (and the inversion of tests to get confidence sets) that leads to inference methods alternative to those presented in Section 22 That theory is summarized in Section 732 of the Appendix Here we make applications of it in the nonlinear model In the notation of the Appendix, our applications to the nonlinear model will be to the case where r = k +1and θ stands for a vector including both β and σ 2 ) i=1 Example 14 (Inference for σ 2 ) Here let θ 1 in the general theory be σ 2 any σ 2, l β,σ 2 is maximized as a function of β by b OLS So then For l (θ 1 ) = l b OLS,σ 2 = n 2 ln 2π n 2 ln σ2 SSE 2σ 2 and ³ µ l bθmle = l b OLS, SSE = n n 2 ln2π n 2 lnsse n n 2 so that applying display (61) a large sample approximate confidence interval for σ 2 is ½ σ 2 n 2 ln σ2 SSE 2σ 2 > n SSE ln 2 n n 2 1 ¾ 2 χ2 1 This is an alternative to simply borrowing from the linear model context the approximation SSE χ 2 n k σ 2 and the confidence limits in Example 6 Example 15 (Inference for the whole vector β) Hereweletθ 1 in the general theory be β For any β, l β,σ 2 is maximized as a function of σ 2 by P n cσ 2 i=1 (β) = (y i f (x i, β)) 2 n 21

So then ³ l (θ 1 ) = l β, σ c 2 (β) = n 2 ln2π n 2 lnc σ 2 (β) n 2 ³ and (since l bθmle is as before) applying display (61) a large sample approximate confidence set for β is ½ β n 2 lnc σ 2 (β) > n 2 lnsse n 1 ¾ 2 χ2 k ½ = β lnσ c2 (β) < ln SSE n + 1 ¾ n χ2 k ( nx µ ) = β (y i f (x i, β)) 2 1 <SSEexp n χ2 k i=1 Now as it turns out, in the linear model an exact confidence region for β is of the form ( nx µ β (y i f (x i, β)) 2 <SSE 1+ k ) n k F k,n k (19) i=1 for F k,n k the upper α point So it is common to carry over formula (19) to the nonlinear model setting When this is done, the region prescribed by formula (19) is called the Beale region for β Example 16 (Inference for a single β j ) What is usually a better alternative to the methods in Example 11 (in terms of holding a nominal coverage probability) is the following application of the general likelihood analysis θ 1 from the general theory will now be β j Let β j stand for the vector β with its jth entry deleted and let β c P n j βj minimize i=1 yi f x i,β j, β 2 j over choices of β j Reasoning like that in Example 15 says that a large sample approximate confidence set for β j is ( nx ³ ³ µ β j y i f x i,β j, β c ) 2 1 j βj <SSEexp n χ2 1 i=1 This is the set of β j for which there is a β j that makes P n i=1 yi f x i,β j, β 2 j not too much larger than SSE It is common to again appeal to exact theory for the linear model and replace exp 1 n 1 χ2 with something related to an F percentage point, namely µ 1+ 1 µ n k F 1,n k = 1+ t2 n k n k (here the upper α point of the F distribution and the upper α 2 distribution are under discussion) point of the t 22

3 Mixed Models A second generalization of the linear model is the model Y = Xβ + Zu + ² for Y, X, β, and ² as in the linear model (1), Z a n q model matrix of known constants, and u a q 1 random vector It is common to assume that µ µ u u E = 0 and Var = G 0 q q q n ² ² 0 n q R n n This implies that EY = Xβ and VarY = ZGZ 0 + R ThecovariancematrixV = ZGZ 0 +R is typically a function of several variances (called variance components ) In simple standard models with balanced data, it is often a patterned matrix In this model, some objects of interest are the entries of β (the so-called fixed effects ), the variance components, and sometimes (prediction of) the entries of u (the random effects ) 31 Maximum Likelihood in Mixed Models We will acknowledge that the covariance matrix V = ZGZ 0 + R is typically a function of several (say p) variances by writing V σ 2 (thinking that σ 2 = σ 2 0) 1,σ 2 2,,σ 2 p The normal likelihood function here is L Xβ, σ 2 = f Y Xβ, V σ 2 = (2π) n 2 detv σ 2 1 2 exp µ 12 (Y Xβ)0 V σ 2 1 (Y Xβ) Notice that for fixed σ 2 (and therefore fixed V σ 2 ), to maximize this as a function of Xβ, one wants to use Xβ \(σ 2 )= Y b σ 2 = the generalized least squares estimate of Xβ based on V σ 2 This is (after a bit of algebra beginning with the generalized least squares formula (5)) by σ 2 ³ = X X 0 V σ 2 1 X 1 X 0 V σ 2 1 Y Plugging this into the likelihood, one produces a profile likelihood for the vector σ 2, L σ 2 ³ = L by σ 2 2, σ µ = (2π) n 2 detv σ 2 1 2 exp 1 ³Y Y 2 b σ 2 0 V σ 2 ³ 1 Y Y b 2 σ 23

At least in theory, this can be maximized to find MLE s for the variance components And with bσ 2 ML a maximizer of L σ 2, the maximum likelihood µ \ estimate of the whole set of parameters is Xβ ³bσ 2 ML, bσ 2 ML While maximum likelihood is supported by large sample theory, the folklore is that in small samples, it tends to underestimate variance components This is generally attributed to a failure to account for a reduction in degrees of freedom associated with estimation of the mean vector Restricted Maximum Likelihood or REML estimation is a modification of maximum likelihood that for the estimation of variance components essentially replaces use of a likelihood based on Y with one based on a vector of residuals that is known to have mean 0, and therefore requires no estimation of a mean vector More precisely, suppose that B is m n of rank m = n rank(x) and BX = 0 Define r = BY AREMLestimateofσ 2 is a maximizer of a likelihood based on r, L r σ 2 = f r σ 2 =(2π) m 2 detbv σ 2 B 0 1 2 exp µ 12 r0 BV σ 2 B 0 1 r Typically the entries of a REML estimate of σ 2 are larger than those of a maximum likelihood estimate based on Y And (happily) every m n matrix B of rank m = n rank(x) with BX = 0 leads to the same (REML) estimate 32 Estimation of an Estimable Vector of Parametric Functions Cβ Estimability of a linear combination of the entries of β means the same thing here as it always does (estimability has only to do with mean structure, not variance-covariance structure) For C an l k matrix with C = AX for some l n matrix A, theblueofcβ is This has covariance matrix Var d Cβ = C dcβ = A b Y σ 2 ³ X 0 V σ 2 1 X C 0 The BLUE of Cβ depends on the variance components, and is typically unavailable If one estimates σ 2 with bσ 2, say via REML or maximum likelihood, then one may estimate V σ 2 as V b σ 2 = V ³bσ 2, which makes available the approximation to the BLUE dcβ = A b Y ³ bσ 2 (20) 24

It is then perhaps possible to estimate the variance-covariance matrix of the approximate BLUE (20) as µ \ VarCβ d ³ = C X 0 V bσ 2 1 X C 0 (My intuition is that in small to moderate samples this will tend to produce standard errors that are better than nothing, but probably tend to be too small) 33 BestLinearUnbiasedPredictionandRelatedInference in the Mixed Model We now consider problems of prediction related to the random effects contained in u Under the MVN assumption E [u Y] =GZ 0 V 1 (Y Xβ) (21) which, obviously, depends on the fixed effect vector β µ µ u G GZ 0 Var = Y ZG ZGZ 0 + R (Forwhatitisworth, and the GZ 0 appearing in (21) is the covariance between u and Y) Something that is close to the conditional mean (21), but that does not depend on fixed effects is ³ bu = GZ 0 V 1 Y Y b (22) where Y b is the generalized least squares (best linear unbiased) estimate of the mean of Y, by = X X 0 V 1 X X 0 V 1 Y The predictor (22) turns out to be the Best Linear Unbiased Predictor of u, and if we temporarily abbreviate this can be written as B = X X 0 V 1 X X 0 V 1 and P = V 1 (I B) (23) bu = GZ 0 V 1 (Y BY) =GZ 0 V 1 (I B) Y = GZ 0 PY (24) We consider here predictions based on bu and the problems of quoting appropriate precision measures for them To begin with the u vector itself, bu is an obvious approximation and a precision of prediction should be related to the variability in the difference u bu = u GZ 0 PY = u GZ 0 P (Xβ + Zu + ²) This random vector has mean 0 and covariance matrix Var (u bu) = I GZ 0 PZ G I GZ 0 PZ 0 + GZ 0 PRPZG = G GZ 0 PZG (25) 25

(This last equality is not obvious to me, but is what McCulloch and Searle promise on page 170 of their book) Now bu in (24) is not available unless knows the covariance matrices G and V If one has estimated variance components and hence has estimates of G and V (and for that matter, R, B, and P) the approximate BLUP bu = b GZ 0 b PY may be used A way of making a crude approximation to a measure of precision of the approximate BLUP (as a predictor u) is to plug estimates into the relationship (25) to produce \ ³ Var u bu = G b GZ b 0 PZ b G b Consider now the prediction of a quantity l = c 0 β + s 0 u for an estimable c 0 β (estimability has nothing to do with the covariance structure of Y and so estimability means here what it always does) As it turns out, if c 0 = a 0 X,theBLUPofl is ³ b l= a 0 Y b + s 0 bu = a 0 BY + s 0 GZ 0 PY = a 0 B + s 0 GZ 0 P Y (26) To quantify the precision of this as a predictor of l we must consider the random variable b l l The variance of this is the unpleasant (but not impossible) quantity ³ Var bl l = a 0 X X 0 V 1 X X 0 a + s 0 GZ 0 PZGs 2a 0 BZGs (27) (This variance is from page 256 of McCulloch and Searle) Now b l of (26) is not available unless one knows covariance matrices But with estimates of variance components and corresponding matrices, what is available as an approximation to b l is b l = ³ a 0 b B+s 0 b GZ 0 b P Y A way of making a crude approximation to a measure of precision of the approximate BLUP (as a predictor of l) is to plug estimates into the relationship (27) to produce \ ³ ³ Var bl l = a 0 X X 0 V b 1 X X 0 a + s 0 GZ b 0 PZ b Gs 2a b 0 BZ b Gs b 26

34 Confidence Intervals and Tests for Variance Components Section 243 of Pinhiero and Bates indicates that the following is the state of the art for interval estimation of the entries of σ 2 = σ 2 1,σ 2 2,,σp 2 (This is an application of the general theory outlined in Appendix 731) Let γ = γ 1,γ 2,,γ p = log σ 2 1, log σ 2 2,,log σ 2 p be the vector of log variance components (This one-to-one transformation is used because for finite samples, likelihoods for γ tend to be better behaved/more nearly quadratic than likelihoods for σ 2 ) Let bσ 2 be a ML or REML estimate of σ 2,anddefine the corresponding vector of estimated log variance components bγ= bγ 1, bγ 2,,bγ p = ³log bσ 2 1, log bσ 2 2,,log bσ 2 p Consider a full rank version of the fixed effects part of the mixed model and let l β, σ 2 =logl β, σ 2 be the loglikelihood and l r σ 2 =logl r σ 2 be the restricted loglikelihood Define l (β, γ) =l β, exp γ 1, exp γ 2,,exp γ p and lr (γ) =l r exp γ1, exp γ 2,,exp γ p These are the loglikelihood and restricted loglikelihood for the γ parameterization To firstlayoutconfidence limits based on maximum likelihood, define a matrix of second partials M 11 M 12 M = p p p k where M 11 = 2 l (β, γ) γ i γ j (β,γ)=( β, γ) ML M 21 k p and M 12 = 2 l (β, γ) γ i β j M 22 k k, M 22 = 2 l (β, γ) β i β j (β,γ)=( β, γ) ML = M 0 21 (β,γ)=( β, γ) ML 27

Then the matrix Q = M 1 ³ functions as an estimated variance-covariance matrix for bγ, β b,whichis ML approximately normal in large samples So approximate confidence limits for the ith entry of γ are bγ i ± z q ii (28) (where q ii is the ith diagonal element of Q) Exponentiating, approximate confidence limits for σ 2 i based on maximum likelihood are ³ bσ 2 i exp ( z q ii ), bσ 2 i exp (z q ii ) (29) A similar story can be told based on REML estimates Ã! M r = p p 2 l r (γ) γ i γ j γ= γ REML If one defines and takes Q r = M 1 r the formula (29) may be used, where where q ii is the ith diagonal element of Q r Rather than working through approximate normality of bγ and quadratic shape for l (β, γ) or lr (γ) near its maximum, it is possible to work directly with bσ 2 and get a different formula parallel to (28) for intervals centered at bσ 2 i Although the two approaches are equivalent in the limit, in finite samples, method (29) does a better job of holding its nominal confidence level, so it is the one commonly used As discussed in Section 241 of Pinhiero and Bates, it is common to want to compare two random effectsstructuresforagiveny and fixed effects structure, where the second is more general than the first (the models are nested ) If λ 1 and λ 2 are the corresponding maximized log likelihoods (or maximized restricted log likelihoods), the test statistic 2(λ 2 λ 1 ) canbeusedtotestthehypothesisthatthesmaller/simpler/less general model is adequate If there are p 1 variance components in the first model and p 2 in the second, large sample theory (see Appendix 732) suggests a χ 2 p 2 p 1 reference distribution for testing the adequacy of the smaller model 35 Linear Combinations of Mean Squares and the Cochran- Satterthwaite Approximation It is common in simple mixed models to consider linear combinations of certain independent mean squares as estimates of variances It is thus useful to have some theory for such linear combinations 28

Suppose that MS 1,MS 2,,MS l are independent random variables and for constants df 1,df 2,,df l each Consider the random variable Clearly, Further, it is possible to show that (df i ) MS i EMS i χ 2 df i (30) S 2 = a 1 MS 1 + a 2 MS 2 + + a l MS l (31) ES 2 = a 1 EMS 1 + a 2 EMS 2 + + a l EMS l VarS 2 =2 X a 2 (EMS i ) 2 i df i and one might estimate this by plugging in estimates of the quantities (EMS i ) 2 The most obvious way of estimating (EMS i ) 2 is using (MS i ) 2,whichleadsto the estimator \VarS 2 =2 X a 2 (MS i ) 2 i df i A somewhat more refined (but less conservative) estimator follows from the observation that E(MS i ) 2 = (EMS i ) 2, which suggests df i+2 df i \VarS 2 =2 X a 2 (MS i ) 2 i df i +2 p q Clearly, either \VarS2 or \VarS 2 could function as a standard error for S 2 Beyond producing a standard error for S 2, it is common to make a distributional approximation for S 2 The famous Cochran-Satterthwaite approximation is most often used This treats S 2 as approximately a multiple of a chi-square random variable That is, while S 2 does not exactly have any standard distribution, for an appropriate ν one might hope that νs 2 ES 2 χ 2 ν (32) Notice that the variable νs 2 /ES 2 has mean ν Setting Var νs 2 /ES 2 =2ν (the variance of a χ 2 ν random variable) and solving for ν produces ν = ES 2 2 P (ai EMS i ) 2 df i (33) 29

and approximation (32) with degrees of freedom (33) is the Cochran-Satterthwaite approximation This approximation leads to (unusable) confidence limits for ES 2 of the form µ νs 2 νs 2 upper α 2 point of, χ2 ν lower α 2 point of (34) χ2 ν These are unusable because ν depends on the unknown expected means squares But the degrees of freedom parameter may be estimated as bν = S 2 2 P (ai MS i ) 2 df i (35) and plugged into the limits (34) to get even more approximate (but usable) confidence limits for ES 2 of the form µ bνs 2 bνs 2 upper α 2 point of, χ2 ν lower α 2 point of (36) χ2 ν (Approximation (32) is also used to provide approximate t intervals for various kinds of estimable functions in particular mixed models) 36 Mixed Models and the Analysis of Balanced Two- Factor Nested Data There are a variety of standard (balanced data) analyses of classical ANOVA that are based on particular mixed models Koehler s mixed models notes treat several of these One is the two-factor nested data analysis of Sections 281-285 of Neter et al that we proceed to consider We assume that Factor A has a levels, within each one of which Factor B has b levels, and that there are c observations from each of the a b instances of a level of A and a level of B within that level of A Then a possible mixed effects model is that for i =1, 2,,a, j =1, 2,,b, and k =1, 2,,c y ijk = μ + α i + β ij + ijk where μ is an unknown constant (the model s only fixed effect) and the a random effects α i are iid N 0,σα 2, independent of the ab random effects βij that are iid ³ N 0,σ 2 β,andthesetofα i and β ij is independent of the set of ijk that are iid N 0,σ 2 This can be written in the basic mixed model form Y = Xβ+Zu+², 30