1 2 This work was supported by NSF Grants GU-2059 and GU-19568 and by U.S. Air Force Grant No. AFOSR-68-1415. On leave from Punjab Agricultural University (India). Reproduction in whole or in part is permitted for any purpose of the United States Government PARTIAL RIDGE REGRESSION 1 by D: Raghavarao 2 and K.J. C. Smith Department of Statistics University of North CaroUna at Chapel. HiU Institute of Statistics Himeo Series No. 863 February, 1973
PARTIAL RIDGE REGRESSION~ by 2 D. Raghavarao and K.J.C. Smith Department of Statistics University of North CaroZina at ChapeZ BiZZ ABSTRACT A partial ridge estimator is proposed as a modification of the Hoerl and Kennard ridge regression estimator. It is shown that the proposed estimator has certain advantages over the ridge estimator. The problem of taking an additional observation to meet certain optimality criteria is also discussed. 1 This work was supported by NSF Grants GU-2059 and GU-19568 and by U~S. Air Force Grant No. AFOSR-68-l4l5. 2 On leave from Punjab Agricultural University (India)
1. Introduction. Consider the problem of fitting a linear model z. = X.. + f.., where z.' "" (Yl'Y2' "'Yn) is a vector of n observations on the dependent variable; X = (X ij ) is an n x p matrix of rank p, ~ "" (xil,xiz""'xip) being the vector of i-th observations on the independent variables (i "" 1,2,..,n); ~' "" (Sl'SZ""'Sp) is the vector of parameters to be estimated; and E' is an n-dimensional vector of random errors assumed to be distributed with mean vector QI and dispersion matrix 2 a In' Q being a zero vector and In the identity matrix of order n. Without loss of generality we assume that the dependent and independent variables are standardized so that XiX is a correlation matrix. Let Al ~ A2~.. ~ A p be the eigenvalues of XIX and let ~1'~2""'~ be a set of ortho-normal eigenvectors associated with the eigenvalues Ai (i = 1,2,...,p). Let a =~! S for i -'1.- i = 1,2,,p. The usual least squares estimator of ~ is given by (1.1) i = (X I X) -1 XI y.. and has the unsatisfactory property, when X'X differs substantially from an identity matrix, that the mean squared error or expected squared distance from! to! tends to be large compared to that of an orthogonal system. Often an investigator is interested in obtaining a variance balanced design in which each parameter Si is estimated with equal precision. The departure of a design from variance balancedness increases the more XIX differs from an identity matrix. The ridge regression method proposed by Hoerl and Kennerd (1970) estimates ~ by the ridge estimator given by (1. 2) _ 8* = (XiX + ki p.j... )-~'v,
where k is a positive real number satisfying 2 (1.3) 2 2 k < (1 la == max a max being the maximum of ai(i = 1,2,... p). The estimator A f* is a biased estimator of ~ but has a smaller mean squared error than the least squares estimator!. We propose here as an alternative to the ridge estimator estimator i*, the (1.4) where (1.5) This estimator may be called the partial ridge estimator of ~. We show e in Section 2 that the partial ridge estimator estimates ~! (wit:h minimum mean squared error and estimates the..ilg) (i = 1,'::,,p-1) unbiasedly. In Section 3 we consider the problem of ta~ing e.g a,~ditio:l",l observation so as to rel'~ove the bias of t~e ~rtia1.riqge'2~timatorand to attain certain optimality criteria. 2. Partial Ridge Estimator. To control the mean squared error of the estimator of the coefficient vector ~ in the model y.. = X! 4-, Hoer1 and Kennard (1970) proposed a ridge estimator, s*, defined by (1.2) and showed that the mean squared error of S* was less than that of the least squares estimator ~ of ~. Specifically. the mean squared error of e* is (2.1) [[(e* = (12! Ai i=l (A i +k)2 = y1 (k) + Y2 (k) + i=l! (A.+k) 2 1. say,
where E[] denotes the expected value of the term in braces. The term 3 Y l (k) is the sum of the variances of the components of e* and the term y 2 (k) is the bias component of the mean squared error. When k = 0, the ridge estimator coincides with the least squares estimator. We propose as an alternative to the ridge estimator of! a partial ridge estimator of!, denoted by S, defined by (1.4). The partial -p ridge estimator has the following property: Theorem 2.1. The partial ridge estimator S = (X'X + k ~ ~,)-l X'v, wh ere 1 2/ 2, ~ = 0 a, is such that ~ S p P --PI' with minimum mean squared error and estimator of ~~ 8 (i = 1,2,,p-l). -:L - ~'8,-p I' P'"""P""""'P..lis the linear estimator of ~' 8 1'- is the best linear unbiased Proof. Since ~l' ~2""'~ are a set of ortho-normal eigenvectors - - """'"P associated with the eigen values A >A >... >A 1 = 2 = = p of X'X; will be eigenvectors associated with the eigenvalues of xx' (i = 1,2,...p). Let.!ll' n Z ',.!:lq._p be a set of orthogonal eigenvectors associated with the zero eigen value of multiplicity n-p of XX'. The vectors x~ (i = 1,2,...,p) and.!:lj (j = 1,2,., n-p) form a basis of an n-dimensional vector space. Without loss of generality any linear estimator of t = ~! can be taken to be (2.2) where and ~ n-p R: = 2.c i s-i x' y +.L d j ~ Y., i=1 J=l J are scalars. The mean squared error of as an estimator of t can be shown to be '" 2 p-l 2 2 2 2 A _1)2a 2 + 2 C A 2 n-p (2.3) E[(t-t) ] = ) c 0 d 2 2 i (Ai a. + o Ai) + (c + I 0 J. J.=l P P P P P j=l J Minimizing (2.3) with respect to the coefficients c i and d. we have J
(2.4) =... = c 1 p- d n-p = 0, c p = 1 2 A+~ P 2 a p 4 2 2 Choosing k = a la,the linear estimator of ~'e with least mean squared p p -p - error is given by (A + k )-1 ~' X'y. The best linear unbiased estimators p p t' of ~! are the least squares estimators 1\i,-1 2..i..J.. ~'X'v (i =" 1 2,p- 1) ~laking a 1-1 correspondence of estimators of ~! with estimators of Si' the required estimator ~ of ~ is given by p-l 1 1 4 = (I c ~ ti + A +k ~ ~)X'y i=l ~ P P = (X'X + k E; ~,)-lx'y. P --p --p This completes the proof of Theorem 2.1. The problem of estimating k can be solved either by graphical or p iterative procedures as described by Hoerl and Kennard (1970). From (2.1) and (2.3) we note that the bias component in the mean squared error of the partial ridge estimator is smaller than that of the ridge estimator. 3. Optimum choice of an additional observation. The equation (1.4) defining the partial ridge estimator suggests taking an additional observation Yn+l on the dependent variable corresponding to some choice of values of the independent variables. Let us assume without loss of generality that the design matrix with an additional observation is (3.1) x J x' - n+l
where ~+l ~+l = 1 and w is a non-zero scalar. The least squares 5 estimator of ~ using the additional observation is (3.2) (X 'X + ' )-1 X' = w ~+l ~+l 1 which is an unbiased estimator of ~. Before discussing the optimal choice of the additional observation, we shall introduce the following: Definition 3.1 Let Al ~ A 2 ~ ~ A p be the eigenvalues of X'X, where X is a n x p design matrix. The departure from variance balanced- e (3.3) ness of the design X is measured by An equivalent expression for Q(X) is (3.4) where tr[a] denotes the trace of the matrix A. Definition 3.2. [Kiefer (1959)] Of the class of all n x p design matrices X, the design X is A-optimal if tr[(x,x)-l] is minimum. Definition 3.3. [Kiefer (1959)] Of the class of all n x p design matrices X, the design X is D-optimal if det[(x'x)-l] is minimum, where det[ ] denotes the determinant of the matrix in braces
The following theorem gives the optimum choice of w and ~+1 for an- 6 additional observation: Theorem 3.1. Given the n x p design matrix X, among possible choices of w and ~l in (3.1), the design (3.5) X = * J~ X ~, 1- - t> p has the following properties: (i) Q,(X*) < Q(X) (ii) Among the class of designs Xl in (3.1), Q(X *) ~ Q(X ) 1 (iii) Among the class of designs Xl in (3.1) and subject to Q(X ) 1 minimum, X* is A- and D-optimal. Proof. For the design Xl of (3.1), 2 1 = Q(X) + 2w ~+l X'X ~+l+ w (1 - p) - 2w A The quadratic form ~+l (XiX)~+1 is minimized when ~+l = ~ and the minimum value is A p Substituting this least value of ~+lx'x ~+1 in (3.6) and minimizing with respect to w, we obtain the stationary value of w (3.7) to be A - A P w = 1 1 p
Substituting into (3.6), the minimum value of Q(X l ) is * (r _A )2 (3.8) ~in(xl) = Q(X) = Q(X) - 1 _ rp Thus Q(X) * < Q(X). Moreover Q(X) * is the minimum value of Q(X ) 1 Now 7 (3.9) -1-1-1 = det[(x'x) ] (1 + w ~+l (XiX) ~+l) The maximum value of I (' )-1 ~+l X X ~+l is minimized with respect is lla for Xl = ~I P -0+1 '"'""P order that Q(X 1 ) be least, w must be given by (3.7). Hence Hence to ~+l when ~+l =~. In X* is D-optimal among the class of designs Xl with minimum Q(X ). 1 To prove the A-optimality of X* among the class of designs Xl with minimum (3.10) Q(X l ), we observe that I (I )-2 1 1 w ~+l X X ~+l tr[(xi X1 )- ] 1 = tr[(xix)- ] - I,-1 1+w?S.n+1 (X X)?S.n+1 The maximum value of the second term on the right hand side of (3.10) is the maximum of 1 (3.11) )J =, A (~+ 1) w where A's are the eigenvalues of XiX. In order that Q(X l ) is least, w is given by (3.7) and the maximum )J x' = ~ I Thus X* -0+1 '"'""P minimum Q(X l ) is attained when A = A and p is A-optimal among the class of Xl matrices with
References 8 Hoer1, Arthur E. and Kennard, Robert lot. (1970). "Ridge Regression: Biased Estimation for Non-orthogonal Problems. II TeC!hnometriC!s, 12, 55-67. Kiefer, J. (1959). "Optimum Experimental Designs." J. Roy. Statist. SoC!., 21B, 272-304