2.7 Estimation with linear Restriction

Size: px

Start display at page:

Download "2.7 Estimation with linear Restriction"

Mervyn Powers
5 years ago
Views:

1 Proof (Method 1: show that that a C(W T ), which implies that the GLSE is an estimable function under the old model is also an estimable function under the new model; secnd show that E[a T ˆβ G ] = a T β and a T ˆβ G is a linear function of Z.) First notice that a T ˆβ G is a linear function of Y. Rewrite the model to Z = W β + ɛ, where Z = Σ 1/2 Y, W = Σ 1/2 X, and E[ɛ ] = 0, and V ar[ɛ ] = I n. It can be shown that a T β is also estimable under the new model If this is true, a T ˆβ G is the BLUE for a T β. a T β is estimable a C(X T ). Claim: C(X T ) = C(X T Σ 1/2 ) = C(W T ). Based on the claim, a C(W ), which implies that a T β is estimable for β under Z = W β + ɛ. Because ˆβ G is the OLSE of β under Z = W β + ɛ, a T ˆβ G is the BLUE of a T β under Z = W β + ɛ, which is equivalent to the original model. Proof of the claim. For any y C(X T Σ 1/2 ), c R n s.t. y = X T Σ 1/2 = X T (Σ 1/2 ), which implies that y C(X T ). This proves that C(X T Σ 1/2 ) C(X T ). For any y C(X T ), c R n s.t. y = X T c = X T Σ 1/2 Σ 1/2 c = (X T Σ 1/2 )(Σ 1/2 c), which implies that C(X T ) C(X T Σ 1/2 ). Therefore, C(X T Σ 1/2 ) = C(X T ). Proof 1. Since Σ is positive, its square root exists and is nonsingular. Let Z = Σ 1/2 Y and W = Σ 1/2 X. Then Z = Σ 1/2 Y = Σ 1/2 Xβ + Σ 1/2 ɛ and in class we showed that ˆβ G = (X T Σ 1 X) X T Σ 1 Y = (W T W ) W T Z. Since a T β is estimable, there exists a c such that a = X T c = X T Σ 1/2 Σ 1/2 c = W T Σ 1/2 c (W T ), which implies that a T β is also estimable under the new model. 2. a T ˆβG = a T (W T W ) W T Z = c T W (W T W ) W T Z = c T P W W β Since P W is invariant to (W T W ), a T ˆβ G is also invariant to choice of generalized inverse. E[a T ˆβG ] = E[a T (W T W ) W T Z] = a T (W T W ) W T W β = a T β since if a T β is estimable a T (W T W ) W T W = a T (Thm 4.9 (2)). 3. It is clear that any linear estimator can be written as d T, because we can always write it to (d T Σ)Y. Clearly ˆβ G is the OLSE for Z = W β + ɛ with ɛ (0, I). Thus a T ˆβ G is the BLUE. 2.7 Estimation with linear Restriction Design Matrix of Full rank Let Y = Xβ + ɛ, where X is full rank. Suppose we want minimize (Y Xβ) T (Y Xβ) 11

2 subject to the linear restriction Aβ = c, where A is a known q p matrix of rank q and c is a known q 1 vector. We will use the Lagrange multipliers to find the least square estimate subject to the linear restriction. Take derivative with respect to β, we have f(β, λ) = (Y Xβ) T (Y Xβ) + λ T (Aβ c) 2X T Y + 2X T Xβ + A T λ = 0 Let ˆβ H denote the estimate subject to the linear restriction, The linear restriction ˆβ H = (X T X) 1 X T Y 1 2 (XT X) 1 A T ˆλH = ˆβ 1 2 (XT X) 1 A T ˆλH c = A ˆβ H = A ˆβ 1 2 A(XT X) 1 A T ˆλH which gives 1 2 ˆλ H = [A(X T ) 1 A T ] 1 (c A ˆβ) Substitute in the formula for ˆβ H, we have ˆβ H = ˆβ + (X T X) 1 A T [A(X T X) 1 A T ] 1 (c A ˆβ) Note, when X does not have full rank, not all linear restrictions have a solution. Proposition 2.14 The estimate given above does minimize ɛ T ɛ subject to Aβ = c. Proof First, ɛ T ɛ = Y Xβ 2 = Y Ŷ + Ŷ Xβ 2 = Y Ŷ 2 + Ŷ Xβ 2 + 2(Y Ŷ )T (Ŷ Xβ) = Y Ŷ 2 + Ŷ Xβ 2 + 2Y T (I P X ) T X( ˆβ β) = Y Ŷ 2 + X( ˆβ β) 2 Second X( ˆβ β) 2 = ( ˆβ β) T X T X( ˆβ β) = ( ˆβ ˆβ H + ˆβ H β) T X T X( ˆβ ˆβ H + ˆβ H β) = X( ˆβ ˆβ H ) 2 + X( ˆβ H β) 2 The last step is true because 2( ˆβ ˆβ H ) T X T X( ˆβ H β) = ˆλ H A(X T X) 1 X T X( ˆβ H β) = ˆλ T HA( ˆβ H β) = ˆλ T H(c c) = 0 Thus, e T e = Y Ŷ 2 + X( ˆβ ˆβ H ) 2 + X( ˆβ H β) 2 It is minimized when β = ˆβ H. 12

3 Here is another proof Proof Y Xβ 2 = Y X ˆβ + X ˆβ X ˆβ H + X ˆβ H Xβ 2 = Y X ˆβ 2 + X ˆβ X ˆβ H 2 + X ˆβ H Xβ 2 + 2[ ] = Y X ˆβ 2 + X ˆβ X ˆβ H 2 + X ˆβ H Xβ 2 The last step is true because Y X ˆβ = (I P X )Y X ˆβ X ˆβ H = X( ˆβ ˆβ H ) = 1 2 X(XT X) 1 A T ˆλH X ˆβ H Xβ = X( ˆβ H β) Thus, (Y X ˆβ) T (X ˆβ X ˆβ H ) = Y T (I P X )X( ˆβ ˆβ H ) = 0 (Y X ˆβ) T (X ˆβ H Xβ) = Y T (I P X )X( ˆβ H β) = 0 (X ˆβ X ˆβ H ) T (X ˆβ H Xβ) = ( ˆβ ˆβ H ) T X T X( ˆβ H β) = 1 2 ˆλ T HA(X T X) 1 X T X( ˆβ H β) = 1 2 ˆλ T H(A ˆβ H Aβ) = 1 c ˆλ T H(c c) = 0 An interesting observation (Exercise 3.g.4 on page 62): Y ŶH 2 Y Ŷ 2 = σ 2ˆγ T H(V ar[ˆγ H ]) 1ˆγ H Example Consider the linear model Y ij = µ i + ɛ ij ; i = 1, 2, j = 1, 2 where ɛ (0, Iσ 2 ). Consider the linear restriction µ 1 = µ 2, or µ 1 µ 2 = 0. We can rewrite the linear restriction to Aβ = (1 1)β = 0. A = (1 1) ˆβ = (Ȳ1., Ȳ2.) T [ ( ˆβ H = ˆβ 1/ / (1, 1) 0 1/ /2 1 ) = ˆβ 1/2 ( 1/2 ˆβ 1 ˆβ 2 ) = ( Ȳ1. Ȳ2. 2 Ȳ 1. Ȳ2. 2 )] 1 ( ˆβ1 (0 (1, 1) ˆβ 2 ) ) Example Consider the model Y i = µ i + ɛ i, i=1,2,3,4 where ɛ (0, I 4 σ 2 ). We are interested the the restriction µ 1 + µ 2 + µ 3 + µ 4 = 0. 13

4 Clearly X = I 4, ˆθ = (y 1, y 2, y 3, y 4 ) T, A = (1, 1, 1, 1), and c = 0. Thus the restrcited estimate is y 1 ȳ ˆβ H = ˆβ + (I 4 ) 1 A T (4) 1 (0 (y 1, y 2, y 3, y 4 ) T ) = y 2 ȳ y 3 ȳ y 4 ȳ Design Matrix of Less Than Full Rank Take a subset of X, X 1 and a subset of Z, Z Adding/Deleting Covariates/Cases The technique introduced here is closely related to the sweep algorithm for linear model fitting. See Chapter 11 for more about the sweep algorithm Adding Covariates Suppose we have fit the model Y = Xβ + ɛ and obtained the OLSE ˆβ. Now we have a new set of variables Z and we want to add it to the model. In other words, we want to obtain the OLSE of β and γ under the following model: β Y = Xβ + Zγ + ɛ = (X Z) + ɛ = W δ + ɛ γ Here we assume that (1) the columns of X and the columns of Z are linearly independent; (2) both X and X are full rank; (3) ɛ (0, σ 2 I). One can refit the model and estimate β and γ using the new model. However, this is not efficient. A more efficient way is to use what we have learned from the old model to obtain OLSE for the new model. Lemma 2.15 Let R = I P X. Suppose the columns of X and the columns of Z are linearly independent, both X and Z are full rank (rank(x) = p, rank(z) = t), and the their columns are linearly independent, then (1) Z T RZ is p.d.; (2) rank(z T RZ) = rank(z) = t. Before proving the lemma, let s review a property regarding p.d. matrices. We learned that If A n p has rank p, then A T A > 0. Proof: x T A T Ax = y T y 0 with = iff Ax = 0 iff x = 0. The last iff is because rank(a) = p thus the columns of A are linearly independent. Here is the proof of the lemma: Proof Suppose the columns of RZ are not linearly independent, then a 0 s.t. RZa = 0, i.e., Za = P x a = Xb. Because we the columns of X and Z, we have a = b = 0. The contradiction implies the columns of RZ are linearly independent and rank(rz) = t. In addition, Z T RZ = (RZ) T RZ. Therefore, (RZ) T RZ = Z T RZ is p.d. and rank(z T RZ) = rank(rz) = t. Theorem 2.16 Let R G = I P W, L = (X T X) 1 X T Z, M = (Z T RZ) 1, and ˆδ G is the OLSE of δ. Then 1. ˆγ G = (Z T RZ) 1 Z T RY = [(RZ) T (RZ)] 1 (RZ) T RY = MZ T RY 14

5 2. ˆβG = (X T X) 1 X T (Y Zˆγ G ) = ˆβ Lˆγ G 3. RSS new = (Y ŶG) T (Y ŶG) = (Y Zˆγ G ) T R(Y Zˆγ G ) = Y T RY ˆγ G T ZT RY (X 4. var(ˆδ G ) = σ 2 T X) 1 + LML T LM ML T M Proof 1. First orthogonalize Z by writing it to Z = P X Z + R Z. Thus the new model is Y = Xβ + P X Z + RZγ + ɛ = Xα + RZγ + ɛ = (X RZ) ( α γ ) + ɛ = V λ + ɛ where α = β + (X T X) 1 X T Zγ = β + Lγ. Note that C(X) C(RZ) (this is true because for a and b we have a T X T RZb = a T (X T X T )Zb = 0), thus the columns of X and RZ are linearly independent and V is full rank. The OLSE of λ is ˆλ = (V T V ) 1 V T X Y = T X X T 1 RZ X T Z T RX Z T RZ Z T Y R X = T 1 X 0 X T 0 Z T RZ Z T Y R ) (X = T X) 1 X T Y (ˆαG (Z T RZ) 1 Z T = RY ˆγ G Thus ˆγ G = (Z T RZ) 1 Z T RY. Because R is a projection matrix, we can rewrite ˆγ G to ˆγ G = (Z T RZ) 1 Z T RY = [(RZ) T (RZ)] 1 (RZ) T RY Note that if Z is the only set of covariates in the model, we would solve Z T Zγ = Z T Y. With X already in the model, we replace Z with RZ and replace Y with RY. 2. Since ˆα G = ˆβ G + Lˆγ G we have ˆβ G = (X T X) 1 X T Y Lˆγ G This correlates the OLSE under the old model to that under the new model. 3. Here we want to show the relationship between the RSS of the old model to that of the new model. Note that Thus Y Ŷnew = Y Ŷnew = Y X ˆβ + XLˆγ G Zˆγ G = (I P X )Y + (X(X T X) 1 X T I)Zˆγ G = RY RZˆγ G RSS new = (RY RZˆγ G ) T (RY RZˆγ G ) = (Y Zˆγ G ) T R(Y Zˆγ G ) Also note that RZˆγ G = RZ(Z T RZ) 1 Z T RY = P RZ Y. Hence, RSS new = Y T RY 2Y T (RZγ G ) + (γ G ) T Z T RZγ G = Y T RY 2Y T (RZγ G ) + (RZγ G ) T (RZγ G ) = Y T RY 2Y T P RZ Y + Y T P RZ Y = RSS old Y T P RZ Y This result indicates that adding new variables will decrease RSS. 15

6 4. Note that ˆγ G = (Z T RZ) 1 Z T RY = MZ T RY. Thus V ar(ˆγ G ) = V ar(mz T RY ) = MZ T RZMσ 2 = σ 2 M Cov( ˆβ, ˆγ G ) = cov((x T X) 1 X T Y, MZ T RY ) = (X T X) 1 X T RZM 1 Lσ 2 = 0(X T R = 0) Cov( ˆβ G, ˆγ G ) = Cov( ˆβ Lˆγ G, ˆγ G ) = cov( ˆβ, ˆγ G ) LV ar(ˆγ G ) = σ 2 LM V ar( ˆβ G ) = V ar( ˆβ) 2cov[ ˆβ, Lˆγ G ] + V ar(lˆγ G ) = (X T X) 1 σ LV ar(ˆγ G )L T = σ 2 [(X T X) 1 + LML T ] These results imply that if we add one single covariate into the model, then ˆβold lz ˆβ new = T RY m z T RY m where l = (X T X) 1 X T z, m = (z T Rz) 1. This is a homework assignment We can also use A.9.1 to prove the results. A.9.1 states that if all inverses exist, then 1 ( A11 A 12 A 11 A = 12 A 1 A 21 A 22 A 21 A 22 = 11 + B 12B22 1 B 21 B 12 B22 1 ) B22 1 B 21 B22 1 where B 22 = A 22 A 21 A 1 11 A 12, B 12 = A 1 11 A 12, B 21 = A 21 A Now let A 11 = (X T X), A 12 = X T Z, A 21 = Z T X, A 22 = Z T Z and L = (X T X) 1 X T Z, M = (Z T RZ) 1, we can show that A 11 = (X T X) 1 + LML T, A 12 = LM, A 21 = ML T, A 22 = M Because We have ˆβ G = ˆβ Lˆγ G. ˆβG A 11 A = 12 X T Y ˆγ G A 21 A 22 Z T Y Example of Seber (page 57). Let the columns of X be denoted by x (j), j = 0, 1,, p 1, so that E[Y ] = x (0) β 0 + x (1) β 1 + x (2) β x (p 1) β p 1 Suppose now that we want to introduce the explanatory variable x (p) into the model. Thus Z T RZ = x T (p) Rx (p) is a scalar. Hence, ˆβ p,g = ˆγ G = (Z T RZ) 1 Z T RY = xt (p) RY ( ˆβ 0,G,, ˆβ p 1,G ) T = ˆβ (X T X) 1 X T x (p) ˆβp,G Y T R G Y = Y T RY ˆβ p,g x T (p) RY x T (p) Rx (p) 16

7 2.8.2 Removing Covariates This is a homework problem Assume that X 1 and X 2 are full rank and the columns between the two matrices are linearly independent. Suppose we have obtained the OLSE of β 1 and β 2, denoted by ˆβ 1 and ˆβ 2 respectively, in the following full model: Y = X 1 β 1 + X 2 β 2 + ɛ Now consider removing X 2 from the model and fit the following new model: Y = X 1 β 1 + ɛ and denote the corresponding OLSE as β 1. Then β 1 = ˆβ 1 + (X T 1 X 1 ) 1 X T 1 X 2 ˆβ2 Question: When does adding/removing covariates will not effect the OLSE for the covariates in the model? Think about perpendicular covariates and Analysis of variance (balanced). This happens when the columns of X and the columns of Z are orthogonal Adding Subjects Suppose we have fit a linear model using n subjects with the design matrix X. Now data from k new subjects are available and we denote the design matrix for them by X 1 and the responses by Y 1. How can we update our model efficiently? Let ˆβ old denote the OLSE based on the first n subjects, i.e., ˆβ old = (X T X) 1 X T Y and let ˆβ new denote the OLSE using all available data. Then we have ˆβ new = (X T X + X T 1 X 1 ) 1 (X T Y + X T 1 Y 1 ). Recall the Sherman-Morrison-Woodbury formula (A.9.3), (A + UBV ) 1 = A 1 A 1 UB(B + BV A 1 UB) 1 BV A 1 Here A = X T X, U = X T 1, B = I k, and V = X 1. We have Thus, (X T X + X T 1 X 1 ) 1 = (X T X) 1 (X T X) 1 X T 1 (I + X 1 (X T X) 1 X T 1 ) 1 X 1 (X T X) 1 ˆβ new = [(X T X) 1 (X T X) 1 X T 1 (I + X 1 (X T X) 1 X T 1 ) 1 X 1 (X T X) 1 ](X T Y + X T 1 Y 1 ) = ˆβ old + (X T X) 1 X T 1 (I + X 1 (X T X) 1 X T 1 ) 1 [ X 1 (X T X) 1 X T Y X 1 (X T X) 1 X T 1 Y 1 + (I + X 1 (X T X) 1 X T 1 )Y 1 ] = ˆβ old + (X T X) 1 X T 1 (I + X 1 (X T X) 1 X T 1 ) 1 [Y 1 X 1 ˆβold ] When adding one subject We add a single row to X and let x T 1 corresponding response. We have denote the row. Let y 1 denote the ˆβ new = ˆβ old + (X T X) 1 x x T 1 (XT X) 1 x 1 [y 1 x T 1 ˆβ old ] Note, this can also be proved using A.9.4 b let C = (X T X) 1 x T 1, D = (1+x 1 (X T X) 1 x T 1 ) 1 = (1+x 1 C) 1. 17

8 By A.9.4, (X T X + x T 1 x 1 ) 1 = (X T X) 1 CDC T. Thus, ˆβ new = (X T X + x T 1 x 1 ) 1 (X T Y + x T 1 y 1 ) = ˆβ old CDC T (X T Y + x T 1 y 1 ) + cy 1 = ˆβ old + CD[D 1 y 1 C T X T Y C T x T 1 y 1 ] = ˆβ old + CD[y 1 C T X T Y ] = ˆβ old + CD[y 1 x T 1 ˆβ old ] = ˆβ old + (X T X) 1 x x T 1 (XT X) 1 x 1 [y 1 x T 1 ˆβ old ] Removing subjects We single out the situation when we want to remove the ith subject. Use A.9.4, i.e., (A uv T ) 1 = A 1 + A 1 uv T A 1 1 v T A 1 u Let X (i) denote the design matrix with the ith observation remved. Note that X T X = x i x T i. we have (X T (i) X (i)) 1 = (X T X x i x T i ) 1 = (X T X) 1 + (XT X) 1 x i x T i (XT X) 1 1 x T i (XT X) 1 x i = (X T X) 1 + (XT X) 1 x i x T i (XT X) 1 1 h ii The last equation is true because we define h ii as the (i, i) element of X(X T X) 1 X T. Hence, ˆβ (i) = [X T (i) X (i)] 1 (X T Y x i y i ) = [(X T X) 1 + (XT X) 1 x i x T i (XT X) 1 1 h ii ](X T Y x i y i ) = ˆβ (XT X) 1 x i 1 h ii [y i (1 h i ) x T i ˆβ + h ii y i ] = ˆβ (XT X) 1 x i e i 1 h ii Note, this result is very useful in diagnostics, as it shows the influence of a single data point on OLSE. Since hii = trace(h) = p, the average of h ii is p/n. If h ii is close to 1, the ith observation has a large influence on parameter estimation. The difference, ˆβ ˆβ (i) (called DF BET A) and its standardized version is often used to test outliers. We will discuss this later. 3 Hypothesis Testing 3.1 Introduction Discuss all three tests: Wald, LRT, and Score? In this section we assume Y = Xβ + ɛ where ɛ N(0, σ 2 I n ), and X is full rank, i.e. Rank(X) = p. 18

3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No.

3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No. 7. LEAST SQUARES ESTIMATION 1 EXERCISE: Least-Squares Estimation and Uniqueness of Estimates 1. For n real numbers a 1,...,a n, what value of a minimizes the sum of squared distances from a to each of