The successive raising estimator and its relation with the ridge estimator

Size: px

Start display at page:

Download "The successive raising estimator and its relation with the ridge estimator"

Stanley Owen
5 years ago
Views:

1 The successive raising estimator and its relation with the ridge estimator J. García and D.E. Ramirez July 27, 206 Abstract The raise estimators are used to reduce collinearity in linear regression models by raising a column in the experimental data matrix which may be nearly linear with the other columns. The raising procedure has two components, namely stretching and rotating which we can analyze separately. We give the relationship between the raise estimators and the classical ridge estimators. Using a case study, we show how to determine the perturbation parameter for the raise estimators by controlling the amount of precision to be retained in the original data. J. García Department of Economics and Business, Almeria University, Almeria, Spain jgarcia@ual.es D. Ramirez Department of Mathematics, University of Virginia, Charlottesville, VA USA der@virginia.edu Keywords: Raise estimators, ridge estimators, collinearity, Stein estimators MSC: 62J05, 62J07 Introduction A technique to mitigate the existence of collinearity is constrained estimators such as ridge estimators and Stein estimators. In both cases, the goal is to diminish the Mean Square Error MSE and to mitigate the problems inherent with collinearity. These estimators improve the effect of collinearity by constraining the Mahalanobis distance of the estimator β under the constraint β Qβ = δ 2 with Q a positive definite matrix and δ > 0. From the model y = Xβ + u with uncorrelated, zero-mean, and homoscedastic errors, we know that the OLS estimators β are given by β = X X X y.

2 A direct application of Lagrange multipliers gives the constrained estimator β cs with β csq β cs = δ 2 as β cs = X X+kQ X y, 2 where k is the Lagrange parameter chosen to satisfy the constraint β csq β cs = δ 2. From the Loewner partial ordering X X L X X+kQ, the inverses will have the opposite partial order X X+kQ L X X. When Q and X X commute, for example with Q = ki p or Q = kx X, the two cases considered below, the pair X X, X X+kQ can be simultaneously diagonalized into the pair I p, Λ with Λ the diagonal matrix containing the characteristic roots of X X+kQ. From the Loewner partial ordering X X L X X+kQ, the Loewner partial ordering I p L Λ follows, and thus I 2 p L Λ 2 so X X 2 L X X+kQ 2, and it follows that β cs 2 β 2 with the constrained estimators shown to have decreased in length. We note that in general, A L B does not imply that A 2 L B 2, Puntanen et al. 20, p. 36. From this idea, we consider two types of constrained estimators:. Stein estimators: From Q = X X the expression 2 becomes: ˆβ s = X X + kx X X y = + k ˆβ. 3 For a detailed study of the Stein estimators we recommend the original reference of Stein 960. These estimators were also studied by Mayer and Willke 973. These estimators, which satisfy expression 3, define a whole class of biased estimators, the so called shrinkage estimators. They are obtained by shrinking the least squares estimator towards the origin, and they satisfy the M SE Admissibility Condition which assures an improvement decrease in MSE for some k 0,. 2. Ridge estimators: From Q = I p, the expression 2 becomes: ˆβ r = X X + ki p X y, 4 where I p is the matrix identity. Hoerl and Kennard 970 established that the ridge estimators also satisfy the M SE Admissibility Condition assuring an improvement in MSE for some k 0,. The earliest detailed expositions of these estimators are found in Marquardt 963 and Hoerl and Kennard 970 with Marquardt 963 acknowledging that Levenberg 944 had observed that a perturbation of the diagonal improved convergence in steepest descent algorithms. The history of the early use of matrix diagonal increments in statistical problems is given in the article by Piegorsh and Casella 989. Numerous papers have been written justifying the need to find values 0 < k i i p to perturb each of the characteristic roots of the matrix X X to 2

3 improve the condition number of the inverse X X. For a positive definite matrix A, we define the condition number κ as the ratio of the largest to smallest of the eigenvalues of A. From this, the generalized ridge estimator is established as β r = X X + K X y, where K is the diagonal matrix diagk,..., k p containing these perturbations. This regression was briefly commented on by Hoerl and Kennard 970 and Goldstein and Smith 974. As the ridge estimator is not equivariant under scaling, the common convention is to scale X X to correlation form with the explanatory variables centered and scaled to unit length and allowing for a constant term in the linear model. With X X in correlation form, the estimated variances varβ i of the estimates are equal, and the perturbation matrix is usually taken to be K =ki p with k > 0. The attempts to find the basis of the ridge estimator have been numerous; for example, by using the constrained regression imposing the constraint β β δ 2, Davidov 2006 used the Kuhn-Tucker Theorem to show that the conditioned least squared solution agrees with the ridge solution with constraint β β = δ 2. Some anomalies of the ridge methodology can be found in Jensen and Ramirez 2008, Kapat and Goel 200, and Jensen and Ramirez 200a. Using the notation from García et al. 205 define the augmented matrix for X and the augmented vector for y by X A = [X, ki p ] and y A = [y, 0 p], respectively. The least squared solutions for the constrained model X Xβ = X y subject to β β = δ 2 and the unconstrained model X A X Aβ = X A y A both have solution β = X X+kI p X y. Augmenting the matrix and the response vector is similar in spirit to imposing identifiability constraints on the parameters as in Seber 977. Hoerl and Kennard 970, Eq. 2.3 suggested the relationship between the ridge estimator and the OLS estimator as β r = [I p + kx X ] β. However, note that since β r is constrained to lie on the sphere β β = δ 2, β r has a joint singular distribution in R n with rank n while β has a joint non-singular distribution in R n with rank n. These and other anomalies with ridge regression are discussed in Jensen and Ramirez 200a. To avoid problems of singularity in the distribution of β r, one can consider a perturbation of the data X X S with X S X S = X X+kI p as in Jensen and Ramirez They call X S the surrogate matrix, and it is defined using a perturbation of the singular values of X, and yields the surrogate estimator β S from X S X Sβ = X Sy. Similarly, the raised estimators, discussed below, also improve the ill-conditioning X X with X used on both sides of the normal equation X Xβ = X y, an importance difference with ridge estimators. Some authors bypass the constrained optimization foundation of ridge estimators and adopt the view, from numerical analysis given in Riley 955, that perturbations of the diagonal improves the stability for matrix inverses. For the augmented model y A = X A β+u the error terms cannot be centered. This could be understood as a critique to ridge estimator obtained from the 3

4 augmented model. Minimizing the sum of the squared errors e i = y i ŷ i for the augmented rows p i= e2 i = k β β replaces the constraint β β = δ 2. Case studies in García et al. 205 show that the variance inflation factors V IF i i p computed for the surrogate model y = X S β + u and for the augmented model y A = X A β + u A are asymptotically equal n. To conclude, by using the words of Marshall and Olkin 979 it seems that the ridge estimator can find a justification in the result provided by Riley 955 where it is shown that the matrix X X + ki p is better conditioned than the matrix X X. In this paper, we present some properties of the raised estimators originally presented by García et al. 20 and its varieties: the stretched and the rotated estimators. We focus on finding an explication about the basis of the ridge and Stein estimators. Raised regression is similar to ridge estimator. One difference is that the ridge procedure seeks stability of X X on the left-hand side of the normal equations X X β= X y, transforming the normal equations into the ridge equations X X+kI p β= X y. Our procedure is in the spirit of the surrogate estimators of Jensen and Ramirez 200b where X is modified X X and the modified matrix is used on both sides of the normal equations. Thus the raised procedure allows the user to visualize the perturbed design matrix X, a feature not available with ridge regression. The paper is organized as follows: Section 2 presents the generalized raised estimator for two variables. Section 3 presents the Stretched and the Rotated estimators as varieties of the raised estimator. Section 4 studies the raised estimators in the case of two variables and shows the way to obtain X X+kI p by using the successive raising procedure. Section 5 studies the variance inflation factors and the metric number. Section 6 gives the relation between the raise and ridge estimators. Our case study is presented in Section 7. Finally, Section 8 contains the conclusions. Some technical results are shown in the two appendices. 2 The generalized raise method for two variables We consider the linear model y = Xβ + u where Eu = 0, Euu = σ 2 I n and X is a full rank matrix n p. For convenience, we have taken σ 2 =. We assume that the variables are centered and standardized, that is, X X is in correlation form. For the n k matrix A = [a, a 2,..., a p ], the columns span is denoted by SpA, with A j denoting the j th column vector a j, and A [j] denoting the n p matrix formed by deleting A j from A. For the linear model y = Xβ + u, central to a study of collinearity is the relationship between X j and SpX [j]. Indeed, there is a monotone relationship between the collinearity indices κ j of Stewart 987, the variance inflation factors V IF j, and the angle between X j, SpX [j], see Jensen and Ramirez 203, Theorem 4. 4

5 Let P [j] = X [j] X [j] X [j] X [j] be the projection operator onto the subspace SpX [j] R n spanned by the columns of the reduced or relaxed matrix X [j]. From the geometry of the right triangle formed by X j, P [j] X j, the squared lengths satisfy X j 2 = P [j] X j 2 + X j P [j] X j 2, where X j P [j] X j 2 = RSS j is the residual sum of squares from the regression X j = X [j] α + v with α = P [j] X j. The angle θ j between X j and SpX [j] is given by the angle between X j and P [j] X j and satisfies X j cosθ j = P X [j] j X j P [j] X j = P X [j] j. X j We show the details leading to the raise method with p = 2. In the case of two variables, X = [x, x 2 ], the model is given by y = β x + β 2 x 2 +u where the matrix X X is defined as X x2i x X = i = x2i x i ρ with ρ the correlation coeffi cient between the exogenous variables. ρ On the other hand, the matrix X y is given by X y = x i y i, x 2i y i. The collinearity problem occurs when the vectors x and x 2 are very close; that is, when the angle between the vectors is very small. To correct this problem, we will try to separate both vectors. The raise method begins with the regression between x and x 2 in the following way. Pick one of the columns of X, say X, and regress this column using the remaining columns: X = X [] α + v x = αx 2 + v. The matrix X 2 is X 2 = x 2, x 22,..., x 2n, with X 2 X 2 = x 2 2i = and X [] X = X 2 X = x 2i x i = ρ. Thus, ˆα = X [] X [] X [] X = [ ] [ ] X 2 X 2 X 2 X = ρ. 5 Summarizing, with x 2 = = x 2 2, we can write that x = ρx 2 + e with the estimated errors e orthogonal to SpX [] ; and, in particular, e x 2 with e = x ρx 2 with x 2 = P [] x 2 + e 2 x 2 = ρx e 2. As x 2 = x 2 2 =, the raise method transforms x x by the rule x = x + λe = x + λx ρx 2 = + λx λρx 2 5

6 with x 2 = + λ 2 x λλρx x 2 + λρ 2 x 2 2 = + λ 2 ρ 2 λ 2 + 2λ. 6 Equivalently with x 2 = +2λ ρ 2 +λ 2 ρ 2 with inner product x x 2 = x + λe x 2 = ρ + λe x 2 = ρ. Denote X = [ x, x 2]. With λ > 0, x 2 > x 2, and we say x has been raised from an angle θ to an angle θ with cosθ = P [] X / x and cos θ = P [] X / x. The Variance Inflation Factor V IF for X is functionally related to the angle θ by the rule θ = arccos /V IF, for example Jensen and Ramirez 203. Thus as λ, θ 90 and the variance inflation factor V IF converges to one indicating that collinearity is being eliminated, as in García et al. 20, Theorem 4.2. If we replace x by x in the original linear regression from Eq., the raised model will be y = β λ x +β 2 λx 2 +u where the estimated parameters depend on λ and will be denoted as β λ and β 2 λ. Summarizing, in the case of two variables, X = [x, x 2 ], we regressed the first column using the remaining second column from X = X [] α + v with x = ρx 2 + e and x = x + λe. With the first column raised, the raised transformation X X <> has the raised design matrix given by X <> = x, x 2 = + λ x λ ρx 2, x 2, 7 using <,..., k > to indicate the columns that have been raised, notation that will be helpful in the general case. We note that X + 2λ ρ <>X <> = 2 + λ 2 ρ 2 ρ, 8 ρ with only the, element of X X being effected by raising the first column of X. We consider additional columns to be raised in Section 4. With ridge regression, the coeffi cient of determination R 2 k monotonically decreases as k, MacDonald 200. We give the parallel result for surrogate regression for RS 2 k in Appendix A. The importance of these results is that it allows the user to set a lower bound for changes to the original model, such as R 2 k 0.95 or RS 2 k One very desirable property of the raised regression method is that the coeffi cient of determination does not change with R 2 λ = R 2 0 for all λ 0,. Additionally, the predicted values with the OLS regression X β and the predicted values with the raised regression X βλ are the same: X β = X βλ. 9 Thus raising a column vector in X is not effecting the basic OLS regression model. These results follow from noting that the raised vector remains in the original SpX, e j = X j P [j] X j SpX so Sp X = SpX, as shown in García et al

7 3 The stretched and rotated estimators The raised column x j = x j + λe j with λ 0 has two effects; namely, stretching the column and rotating the column. We study the effects separately with a case study. Set X 0 = [, x, x 2, x 3 ] the 20 4 matrix from Rawlings 998, Ex.. which has been designed to be highly collinear. The pattern has x as the sequence from 20 to 29 and repeated; x 2 is x 25 with the st and th term changed from 5 to 4; x 3 is the repeated sequence 5, 4, 3, 2,, 2, 3, 4, 5, 6. The eigenvalues for X 0X 0 are [2439, 52.4, 40., ] with condition number for X 0X 0 as λ max /λ min = Scaling X 0 X u to have unit length reduces the eigenvalues for X ux u to [2.898,.007, , ] with condition number for X ux u as Centering and rescaling X u X c into correlation form without the constant term reduces the eigenvalues for X cx c to [2.67, 0.830, ] with condition number for X cx c as The correlation between column of X c and column 2 of X c2 is ρx c, X c2 = The variance inflation factors V IF j for X c are the diagonal elements of X cx c and are [69.4, 75.7,.688] from which the angles in radians between the column vector X cj and the span of the remaining columns SpX c[j] can be computed by θ j = arccos /V IF j, for example Jensen and Ramirez 203. Converting to degree measure, the angles are [4.407, 4.327, ] supporting that one should first try to improve collinearity by modifying the second column. Following the case study in García et al. 20, the responses are generated for the linear model y = X u, with u i generated from a normal N0, distribution. Using this data set, the centered and standardize model is y y n = X c β c + u = X c u, with OLS estimates β = 4.48, 3.8, The raised column will be the second column of the centered and standardized X c with x c2 = x c2 + λe 2 with λ 0. The empirical risk for the OLS estimator given the true value β c is β β c β β c = β β c 2 = The goal is to improve, that is decrease, the empirical risk by using the raised estimators and its two components separately; namely, the stretched estimators and the rotated estimators. Following the Stein procedure, we consider the stretched estimators based on the transformation X c2 + λx c2 ; that is, the second column of X c will be stretched by the factor + λ with λ 0. Denote the stretched estimators by β st λ. They are given by β st λ = β, β 2 / + λ, β 3 with β st λ 2 monotonically decreasing to β, 0, β 3 2. For this case study, the 7

8 empirical risk for the stretched estimators β st λ β c 2 can be reduced from β st 0 β c 2 = to min λ 0 β st λ β c 2 = for + λ = 5.402, noting that = 3.8/ Using the Jackknife procedure, the minimum for the average empirical risk for the stretched estimators over the n = 20 replications is min λ 0 avg n=20 β st λ β c 2 = for +λ = As this is a scaling procedure, the angle between + λx c2 and SpX c[2] has not changed, and thus the V IF s have not changed. The rotated estimators are similar to the raised estimators with the added constraint that the columns length remains at unit length. This allows one to track the improvement in ill-conditioning without confounding with a change in scale. The rotated estimators are based on the transformation X c2 [X c2 + λe 2 ]/ X c2 + λe 2 with λ 0; that is, the second column of X c is raised, but then scaled back to unit length. For the same λ, both the rotated model and the raised model will have the same correlation matrix, the same angle between the second column of the design matrix and the span of the remaining columns, and the same variance inflation factors. Denote the rotated estimators by β rot λ. As the second column is perturbed by a multiple of e 2, and since e 2 is orthogonal to the other columns, both the rotated estimators β rot λ and the raised estimators βλ have the same values except in the row corresponding to the raised column. For this case study, the empirical risk for the rotated estimators β rot λ β c 2 can be reduced from β rot 0 β c 2 = to min λ 0 β rot λ β c 2 = for λ = For λ = 3.502, the V IF s are [9.394, 9.68,.2] with associated angle θ 2 = 8.8. Using the Jackknife procedure, the minimum for the average empirical risk for the rotated estimators over the n = 20 replications is min λ 0 avg n=20 β rot λ β c 2 = for λ = For this case study, the empirical risk for the raised estimators βλ β c 2 can be reduced from β0 β c 2 = to min λ 0 βλ β c 2 = for λ = For λ = 3.40, V IF s are [9.744, 9.98,.22] with associated angle θ 2 = Using the Jackknife procedure, the minimum for the average empirical risk for the raised estimators over the n = 20 replications is min λ 0 avg n=20 βλ β c 2 = for λ = These values are comparable to those using the rotated estimator. Summarizing, the raised estimators are based on stretching and rotating the column to be improved. Both effects should be positive as shown with the stretching estimators and the rotating estimators. Computationally, the raised estimators are easier to compute lacking the square root in the denominator of the transformed columns. For this case study, the rotated and raised methods yielded comparable results and both performed better than the stretching method. The Jackknife procedure showed that the average over the replications for the raised estimators was smaller better than the average over the replications for the stretched estimators. 8

9 4 The MSE risk for two variables We consider the linear model y = Xβ + u where Eu = 0, Euu = σ 2 I n and X is a full rank matrix n p. In this section p = 2. For convenience, we often take σ 2 =, however, for these calculations we need to track the value of σ 2. We assume that the variables are centered and standardized, that is, X X is in correlation form. With respect to the MSE risk, Risk θ = E θ θ E θ θ + trvar θ, we compare the three estimators: the raised βλ, the stretched β st λ, and the rotated β rot λ. For the raised model with column raised with X X <>, the design matrix and moment matrix are given in Eqs. 7 and 8. The calculations follow from standard results for OLS estimators and where we denote E β = δ. From Eq. 9 the predicted values using OLS estimators and raised estimators are the same with β x + β 2 x 2 = X β = X βλ = β λ x + β 2 λx 2 = β λ[ + λx λρx 2 ] + β 2 λx 2 = + λ β λx + [ β 2 λ λρ β λ]x 2 and thus β λ = + λ β β 2 λ = β 2 + λρ + λ β. A more general result is given in García et al. 20. It follows that E β λ = +λ δ, E β 2 λ = δ 2 + λρ +λ δ, with squared bias E βλ δ E βλ δ = λ2 +ρ 2 +λ δ 2. The variances are given by 2 var β λ σ 2 = var β 2 λ σ 2 = 2 + λ ρ 2 2 λρ ρ λρ ρ λ ρ λ ρ 2. The minimum value for Riskβλ can be shown to be For the stretched estimator β st λ, σ 2 λ = ρ 2 δ 2. β st, λ = + λ β β st,2 λ = β 2, with E β st, λ = +λ δ, E β st,2 λ = δ 2, with squared bias E β st λ 9

10 δ E β st λ δ = λ2 +λ 2 δ 2, and with variances var β st, λ σ 2 = var β st,2 λ σ 2 = with the minimum value for Risk β st λ when + λ 2 ρ 2 ρ 2, σ 2 λ st = ρ 2 δ 2. We note that this is the same λ value as with the raised estimator. For the rotated estimator β rot λ the calculations are more tedious. The design matrix is X <>rot = x +λe / x +λe, x 2 = +λx λρx 2 / x + λe, x 2 with moment matrix X <>rotx <>rot = ρ x +λe ρ x +λe recalling that e x 2. As the predicted responses are the same for both the OLS model and the rotated model, β x + β 2 x 2 = X β = X <>rot βrot λ = β rot, λ x / x +λe + β rot,2 λx 2 = β rot, λ[+λx λρx 2 / x +λe ]+ β rot,2 λx 2 = +λ β λρ x +λe rot, λx +[ β rot,2 λ β x +λe rot, λ]x 2. And thus β rot, λ = x + λe β + λ β rot,2 λ = β 2 + λρ + λ β, with E β rot, = x+λe +λ δ, E β rot,2 λ = δ 2 + λρ +λ δ, with squared bias 2 2 E β rot λ δ E β rot λ δ = x+λe +λ δ 2 + λρ +λ δ 2, and with variances var β rot, λ 2 x + λe σ 2 = + λ ρ 2 var β rot,2 λ σ 2 = ρ λρ + λ ρ ρ 2, 2 λρ + + λ ρ 2. The value of λ rot for the minimum of Risk β rot λ was found using Maple. First one solves for the positive root t 0 of the quartic polynomial p 4 t = a 0 t 4 + a t 3 + a 2 t 2 + a 3 t + a 4 with a 0 = ρ 2, a = 2 ρ 2, a 2 = ρ ρ 2 + σ δ 2 2, a 3 = 2ρ 2 ρ 2, a 4 = ρ 2 ρ 2. Appendix B establishes that 0

11 real roots exist for p 4 t. The value of λ rot is then given by 2 t 0 ρ 2 σ + 2t 0 δ λ rot = t 0 + ρ 2. As an example, with {ρ = 0.95, δ = 2, σ = }: for the raised estimator the Riskβ0 = 20.5 and decreases to Riskβ λ = with λ = for the stretched estimator, the risk decreases to Risk β st λ = 3.3 with λ st = and 3 for the rotated estimator p 4 t = t t t t with positive root t 0 =.69, λ rot = 3.479, and Risk β st λ rot = In summary, the raised and stretched estimators have their minimum risk at the same λ value; the raised and rotated estimators have comparable results; and the rotated estimators are more complex to compute. 5 Variance inflation factors and the metric number We consider the linear model y = Xβ + u where Eu = 0, Euu = σ 2 I n and X is a full rank matrix n p. We assume that the variables are centered. Reorder X = [X [p], X p ] with X p = x p the p th column and X [p], the design matrix X without the p th column, the resting columns. The problem in this section is to measure the effect of adding the last column X p to X [p]. An ideal column would be orthogonal to the previous columns with the entries in the off diagonal elements of the p th row and p th column of X X all zeros. Denote by M p the idealized moment matrix [ ] X M p = [p] X [p] 0 p 0 p x. px p The metric number associated to x p is defined by detx MNx p = X detm p. The metric number is easy to compute and has been used in García et al. 20 as a measure of collinearity. A similar measure of collinearity is mentioned in Footnote 2 in Wichers 975 and Theorem of Berk 977. The geometry for the metric number has been shown in García et al The case study in García et al. 20 suggests the functional relationship between MNx p and the variance inflation factor for β p as V IF β p = MNx p 2.

12 We note that this relationship holds, and so the metric number MNx j is also functionally equivalent to the collinearity indices κ j of Stewart 987, the variance inflation factors V IF j, and the angle between X j, SpX [j]. To evaluate V IF β p = V IF j, transform X X into correlation form R with S = diagx x,..., x px p the diagonal matrix with entries from the diagonal of X X with R = S /2 X XS /2. The V IF s are the diagonal entries of R = S /2 X X S /2. It remains to note that the inverse R can be computed using cofactors C i,j, and, in particular, V IF β p = R p,p = [S /2 X X S /2 ] p,p = x px p /2 detc p,p detx X x px p /2 = detm p detx X = MNx p 2. With the example in the previous section with {p = 2, ρ = 0.95}, V IF 2 is easy to compute from Eq. with: for the raised estimator V IF β 2 0 = 0.26 and decreases to V IF β 2 λ =.729 with λ = for the stretched estimator, V IF β st,2 λ st = 0.26 for any λ st 0, since there is no change in the angle between X 2, SpX [2] 3 for the rotated estimator V IF β rot,2 λ rot =.080 for λ rot = 3.479, noting that V IF for the rotated estimator is smaller than the raised estimator since the rotated estimator has the larger angle between X 2, SpX [2] as λ rot > λ. 6 How to overcome the ill-conditioned of the matrix X X by successive raises We will focus on finding a natural way to get the perturbation matrix X X+kI p from the raised estimators as this matrix is an essential component in ridge regression. We will show that the raised estimators allow an estimation by OLS which retains the coeffi cient of determination. Firstly, we will present the estimation for two variables and, secondly, to the general case. 6. For two variables In the case of two variables, X = [x, x 2 ] with the first column raised, X X <> with the raised design matrix given by X <> = x, x 2 = + λ x λ ρx 2, x 2 and X + 2λ ρ <>X <> = 2 + λ 2 ρ 2 ρ. ρ 2

13 After having raised column with λ > 0, we will raise column 2 with λ 2 > 0. First regress the second column of X <> using the remaining columns as X <>2 = αx <>[2] + v; that is, x 2 = α x + v. Scaling x to unit length gives x 2 = α x x / x + v with α x = ρ as in Eq. 5; and with x 2 = + λ 2 ρ 2 λ 2 + 2λ from Eq. 6. Thus x 2 = ρ x x = ρ x + λ x λ ρx 2, with residual e 2 from x 2 = ˆx 2 + e 2 = ρ x x + e 2 so e 2 = x 2 ρ x / x with e 2 x. We now raise the vector x 2 using the residual e 2 by the rule x 2 = x 2 + λ 2 e 2 = x 2 + λ 2 x 2 ρ x x = + λ 2 x 2 λ 2ρ x x. The doubly raised matrix is given by X <,2> = + λ x λ ρx 2 + λ 2 x 2 λ2ρ x x For any values k > 0, k 2 > 0, it is possible to show that there exist raising parameters λ > 0, λ 2 > 0 such that X + k ρ <,2>X <,2> =, 2 ρ + k 2 the matrix used in generalized ridge regression. This follows from noting that x 2 = +2 ρ 2 λ + ρ 2 λ 2 viewed as a polynomial in λ has positive coeffi cients so it can achieve any value greater than one; and similarly for x 2 2 = +2 ρ2 +k λ 2 + ρ 2 2 +k λ 2 2. The off-diagonal entries in Eq. 2 are not effected by raising column 2 since x x 2 = x x 2 + λ 2 e 2 = x x 2 since e 2 x ; and earlier we had established that x x 2 = x + λ e x 2 = x x 2 = ρ since e x 2. Thus, in particular, given k > 0 there are raising parameters λ and λ 2 such that X + k ρ <,2>X <,2> = = X X + ki ρ + k p, the perturbation matrix usually used in ridge regression. Some advantages of the raise method for addressing the problem of collinearity are: The raise estimators are estimated by OLS and thus confidence intervals can be computed. It retains the coeffi cient of determination of the initial regression. The V IF s associated with the raise estimators are monotone functions decreasing with k, see García et al

14 6.2 For the general case In the general case we start with a standardized matrix X with the following column vectors X = x, x 2,..., x p, and the raising procedure will be given by the following steps: Step : Firstly, we raise the vector x from the regression of x with the resting vectors X [] = x 2, x 3,...x p. For this, we take the residual e from x = x + λ e such that e SpX []. The raised design matrix is denoted X <> = x, x 2,..., x p. Step 2: Next, we raise the vector x 2 from the regression of x 2 with the resting vectors from X <>, namely X <>[2] = x, x 3,..., x p. Then, we get the residual e 2 from x 2 = x 2 + λ 2 e 2 such that e 2 SpX <>[2]. The raised design matrix is denoted X <,2> = x, x 2,..., x p. Step j: We raise the vector x j from the regression of x j with the resting vectors from X <,...,j >, namely X <,...,j >[j] = x,..., x j, x j+,..., x p. Then, we get the residual e j from x j = x j + λ j e j such that e j SpX <,...,j >[j]. The raised design matrix is denoted X <,...,j> = x,..., x j, x j+,..., x p. Step p: We raise the vector x p from the regression of x p with the resting vectors from X <,...,p >, namely X <,...,p >[p] = x,..., x p. Then, we get the residual e p from x p = x p +λ p e p such that e p SpX <,...,p >[p]. The raised design matrix is denoted X <,...,p> = x,..., x p. Suppose columns,..., j have been raised to form X <,...,j >. The j th column of X <,...,j > is raised to define X <,...,j>, with X <,...,j >[j] the resting columns, by the rule X <,...,j>j = X <,...,j >j + λ j e j with e j SpX <,...,j >[j]. Noted that X <,...,j> X <,...,j> will differ from X <,...,j > X <,...,j > only in the j, j entry as the j th raised column is perturbed by the vector λ j e j orthogonal to the span of the remaining columns. The [j, j] entry for the j th raised moment matrix is X <,...,j>x <,...,j> [j, j] = X <,...,j >j X <,...,j >j + 2λ j X <,...,j >j e j + λ 2 je je j, with the surplus term denoted by k j = 2λ j X <,...,j >j e j + λ 2 je j e j. We view X <,...,j> X <,...,j>[j, j] as a quadratic polynomial in λ j ; and since the coeffi - cients of λ 2 j and λ j are both positive, the quadratic polynomial can achieve any value larger than the original value X <,...,j >j X <,...,j >j. Let P [j] be the projection matrix for SpX <,...,j >[j]. That the coeffi cient of λ j is positive follows from 4

15 X <,...,j >j e j = X <,...,j >j I p P [j] X <,...,j >j 3 = X <,...,j >j I p P [j] 2 X <,...,j >j = I p P [j] X <,...,j >j 2 = e j 2 > 0. We started with X X in correlation form and with final raising matrix having moment matrix + k ρ 2 ρ 2... ρ p ρ 2 + k 2 ρ ρ 2p X <,2,...,p>X <,2,...,p> = ρ 3 ρ 23 + k 3... ρ 3p ρ p ρ 2p ρ 3p... + k p = X X + K, with K the diagonal matrix K = diagk,..., k p with k j 0 for j p. Thus the raised regression perturbation matrix is equivalent to a generalized ridge regression perturbation matrix. And conversely, any generalized ridge regression matrix has a corresponding raised regression matrix. Given ridge parameters k,..., k p, the associated raised parameters are given from Eq. 3 by k j = 2λ j X <,...,j >j e j + λ 2 je j e j = 2λ j e j 2 + λ 2 j e j 2 = 2λ j + λ 2 j e j 2, and thus λ j = 7 Empirical application + k j e j 2 4 k j = e j 2 λ 2 j + 2λ j. 5 We have chosen the following data set for an empirical application because it is relatively recent and presents a high level of collinearity. In the course of the financial crisis in the United States and over the whole world there is a big discussion about the life of the American people on credit. The original data set is presented in Table and it is taken from the Economic Report of the President 2008 where y is the outstanding Mortgage Debt in trillions of dollars and the three independent variables are X Personal Consumption in trillions of dollars, X 2 Personal Income in trillions of dollars, and X 3 Consumer credit outstanding in trillions of dollars. 5

16 Table. y Mortgage debt, X Personal Consumption, X 2 Personal Income, and X 3 Consumer Credit. year y X X 2 X The ridge procedure perturbs the moment matrix X X while the raised and surrogate procedures perturbs the design matrix X. This second approach has the distinct advantage of allowing the user to specify, for each of the variables, a precision that the data will retain during the perturbation stages by restricting the mean absolute perturbations λ j n n e j,i = π j. 6 i= For the case study, we set the precision π j = in the first trial, and for the second trial we set the precision π j = Thus given a specified precision π j > 0, we raise column j in X <,...,j> to x j = x j + λ j e j where λ j is solved from Eq. 6. The precision values should be based on the researcher s belief in the accuracy of the data. The raised parameters λ j are constrained to assure that the original data has not been perturbed more than what the researcher has permitted. The original data X 0 is centered and standardized X 0 X. For convenience, the columns are raised in sequence, 2, 3. In Table 2 with precision π j = for all j and in Table 3 with precision π j = 0.05 for all j, we report the raised parameters λ j, the corresponding generalized ridge parameters k j the V IF s, the associated angles θ, the condition number κ for the moment matrices as the ratio of the largest to smallest eigenvalue for the moment matrices, and 6

17 the squared length of the coeffi cient vector βλ j 2. Recall from Eq. 7 that the final raised matrix X <,..,p> has moment matrix X X + diagk,..., k p. Note that the initial V IF s present high values, and this fact justifies the need to correct the collinearity. With precision π = 0.05, after the successive raising process all the V IF s from X <,2,3> are less than 0, namely 7.063, 6.2, But what is more important is that the matrix we need to invert, namely X X + diagk,..., k p, is better conditioned than the initial matrix X X ; and it has been obtained by adding a parameter k i 0 in the main diagonal of the matrix X X, in a way similar to ridge estimation, and where we can preserve the original data to a specified precision. It is important to note that our methodology offers a concrete method to determine the perturbation parameter λ j for the raise estimators, and therefore of k, while the choice of the ridge parameter remains an open problem. As noted by one of the referees, "the authors have obtained, by using an intuitive way, the matrix X X+kI p with the successive raising estimator, while other proposed justifications, are a posteriori." Table 2. With precision π j = 0.005, raised parameters λ j, ridge parameters k j, V IF j, associated angles θ j, condition numbers κ = λ max /λ min and βλ j 2. X X <> X <,2> X <,2,3> λ j k j V IF θ κ βλ j

18 Table 3. With precision π j = 0.05, raised parameters λ j, ridge parameters k j, V IF j, associated angles θ j, condition numbers κ = λ max /λ min and βλ j 2. X X <> X <,2> X <,2,3> λ j k j V IF θ κ βλ j Simulation Example To determine if collinearity is present in a design matrix X, a common criterion is max{v IF j : i j p} > 0.0. As shown in Table 2, for X <,2,3> with precision π j = 0.005, max{v IF j } = 45.6 so the precision needs to be increased. From Table 3, for X <,2,3> with precision π j = 0.05, max{v IF j } = and collinearity has been eliminated in the design matrix X <,2,3> ; and the matrix X <,2,3> is the recommended design matrix. We return to the simulated data example considered in Section 3. This example is from Rawlings 998, and it has been studied in Garcia et al. 20. The example will demonstrate how the raise estimator βλ j will reduce collinearity and can also improve the Mean Squared Error βλ j β 2, a measure of the squared distance between the raise estimators and the true parameters. From Eq. 0 the centered and scaled model has true coeffi cients β = [5.38, 2.440, 2.683]. A measure of how the original design has been perturbed is the Mean Absolute Deviation X X <,...,j> MAD = nk i,l Xi, l X <,...,j>i, l. An important feature with the raise procedure is that the new design matrix can be explicitly computed; and that the perturbations used can be controlled column by column. This is not possible with the ridge procedure which perturbs X X. Table 4 reports the results using precision π j = As successive columns are raised the V IF s decrease, the corresponding angles θ increase, and X X <,2,3> MAD = the required precision. The matrix X <,2> satisfies the criterion max{v IF j } = < 0 and unknown to the user shows a decrease in Mean Squared Error from down to Thus X <,2> is the recommended design with reduced collinearity with the measure of perturbation in the design given by X X <,2> MAD = Note that column 3 8

19 of both X and X <,2> are the same since columns that need not be perturbed can be left as is, a feature not possible with the ridge or surrogate procedures. Table 5 reports the results using precision π j = As in Table 4, as successive columns are raised, the V IF s decrease, the corresponding angles θ increase, and X X <,2,3> MAD = the required precision. The matrix X <> satisfies the criterion max{v IF j } = 7.76 < 0 and shows a decrease in Mean Squared Error from down to For the raised matrix X <>, the Mean Absolute Deviation X X <> MAD = with columns 2 and 3 the same for both X and X <>. The raise procedure is able to perturb some or all columns and with varying precisions a feature not available with the ridge or surrogate procedures. The recommended new design is either X <,2> from Table 4 or X <> from Table 5 both having identical MADs. In summary, the raise procedure has identified candidates for the perturbed design matrix which have eliminated collinearity and are nearly identical to the original design matrix. 9

20 Table 4. With precision π j = 0.025, raised parameters λ j, ridge parameters k j, V IF j, associated angles θ j, condition numbers κ = λ max /λ min, βλ j 2, βλ j, βλ j β 2, and Mean Absolute Deviation X X <,...,j> MAD X X <> X <,2> X <,2,3> λ j k j V IF θ κ βλ j βλ j βλ j β M AD Table 5. With precision π j = 0.05, raised parameters λ j, ridge parameters k j, V IF j, associated angles θ j, condition numbers κ = λ max /λ min, βλ j 2, βλ j, βλ j β 2, and Mean Absolute Deviation X X <,...,j> MAD X X <> X <,2> X <,2,3> λ j k j V IF θ κ βλ j βλ j βλ j β M AD

21 9 Conclusions The ridge procedure perturbs the moment matrix X X X X + λi p but does not allow the user to compute the perturbed design so the changes to X will be unknown to the user. The surrogate procedure has the advantage of allowing the user to explicitly compute the surrogate design X S in terms of the singular values of X. Thus i,j Xi, j X Si, j can the Mean Absolute Deviation X X S MAD = nk be computed showing the average change in the design X X S. However, the surrogate procedure does not allow for varying perturbations for the varying explanatory variables. The surrogate procedure is based on perturbating the singular values of X and thus it is not intuitive nor geometric. On the other hand, the raise procedure it both intuitive and geometric. It is important to note that the raise procedure does yield the explicit new design with X X, and that the mean absolute perturbations given in Eq. 6 are permitted to be different for each explanatory variable. Thus the user can set the mean absolute deviation to be smaller for variables which are known to be accurate and allow larger deviations for variables which are known to be less accurate. The raised estimator βλ = X X X y obtained from the successive raising process in Section 4 improves collinearity. The raised regression retains the value of the coeffi cient of determination, contrary to what happens with the ridge and surrogate estimators. Thus, when the collinearity is mitigated with the ridge or surrogate estimator, the model can become not globally significant while with the successive raising estimator the model will remain globally significant, if initially it was, since the coeffi cient of determination is invariant. The estimator obtained from the successive raising process is given by βλ = X X + K X y with K a diagonal matrix. If we want K =ki p to be a constant diagonal matrix then this can be achieved with the proper choice of λ = λ,..., λ p using Eq. 4. To conclude, if the user wants a better conditioned matrix than X X, the proposed methodology offers a procedure to improve the collinearity in the spirit of Levenberg 944 with a perturbation matrix similar to the perturbation matrix used in ridge regression. Unlike ridge regression, the user can visualize the perturbations of the underlying model and easily control the amount of perturbations to the original data retaining a specified precision in the data. Unlike ridge regression where the choice of the ridge parameter remains an open problem, our methodology offers a concrete procedure for determining the perturbation parameter for the raise estimators. 0 Appendix A In a full rank model {Y = Xβ + ε} having zero mean, uncorrelated, and homoscedastic errors, the Ordinary Least Squares OLS estimator β solves the p equations {X Xβ = X y}. These solutions are unbiased with dispersion ma- 2

22 trix V β = σ 2 X X. Near dependency among the columns of X, as ill conditioning, often causes the OLS solutions to have inflated size β 2, and of questionable signs, and very sensitive to small changes in X Belsley, 986. Among standard remedies are the ridge system {X X + ki p β = X y; k 0} Hoerl and Kennard, 970 with solutions { β R k; k 0}. The OLS solution at k = 0 is known to be unstable with a slight movement away from k = 0 giving completely different estimates of the coeffi cients. McDonald 2009, 200 showed that the square of the correlation coeffi cient R 2 k of the observed values y and the predicted values ŷ R k = X β R k for the ridge regression estimators β R k is a monotone decreasing function of the ridge parameter k. We show the corresponding result for the surrogate estimators, Jensen and Ramirez 2008, 200b. The singular decomposition X = P D ξ Q, with P P = I p, together with θ = Q β as an orthogonal reparametrization, give y = Xβ + ε y = P D ξ Q β + ε U = P y = D ξ θ + P ε as a canonical and equivalent model on R p. Gauss Markov assumptions regarding y = Xβ + ε stipulate that Eε = 0 R n and V ε = σ 2 I n. From these assumptions, it follows that EP ε = 0 R p and V P ε = σ 2 P I n P = σ 2 I p, so that EU = D ξ θ and V U = σ 2 I p. Specifically, the singular decomposition X = P D ξ Q, and the surrogate X k = P Diagξ 2 + k 2,..., ξ 2 p + k 2 Q with singular values {ξ 2 ξ ξ 2 p > 0}, give X kx k = X X + ki p with {X kx k β = X y} for the ridge estimators. For the surrogate model, {y = X k β + ε} is taken as an approximation, or surrogate, for the ill conditioned model {y = Xβ + ε} itself with {X kx k β = X ky} for the ridge estimators. The canonical estimators { θ, θ R, θ S } in terms of {k, ξ i } are given by Estimator Definition θ D ξ U i θr D ξ i U ξ 2 i +k θs D U ξ 2 i +k Following McDonald 2009, 200, we assumed the model has been standardized with y = 0 and y y = and with X X having correlation form; in particular, X = 0 with ŷ R = X β R = 0 = X β S = ŷ S. It follows that the squared correlation coeffi cient RS 2 k of the observed values y and the 22

23 predicted values ŷ S k = X β S in terms of the canonical variables is [y y ŷ S k ŷ S k ] 2 RSk 2 = [ ] [y y y y] ŷ S k ŷ S k ŷ S k ŷ S k 2 = y ŷ S k y 2 X β U D ŷ S kŷ Sk. = S β = SX X β S R 2 Rk = ξ i ξ 2 i +k 2 U U D ξ 2 i ξ 2 i +k U with RS 2 0 = U U 2 /U U =. For ridge estimators, McDonald 200 gives the parallel result for ridge estimators U ξ 2 2 i D ξ 2 i +k U with drr 2 k/dk < 0 for k > 0. For the surrogate estimators, with dr 2 S k dk = 2G 2 H J G = p i= u2 i J = p i= u2 i ξ i ξ 2 i +k ξ 2 i ξ 2 i +k U D ξ 4 i ξ 2 i +k2 U G2 K J 2 We show that HJ + GK < 0 for k > 0 : p p HJ + GK = H i i= = H = p i= u2 i j= K = p i= u2 i J j + with H i J i + G i K i = 0. For i j, we consider, G HJ + GK J 2 p i= ξ i ξ 2 i +k3/2 ξ 2 i ξ 2 i +k2 G i p j= K j, H i J j H j J i + G i K j + G j K i = u 2 ξ i i ξ 2 u 2 ξ 2 j i + k 3/2 j u 2 ξ j j ξ j + k ξ 2 j + k 3/2 u 2 i ξ i ξ 2 j + u 2 ξ 2 j i + k ξ 2 j + k 2 = u2 i u2 j ξ iξ j ξ 2 i + k ξ 2 j + k ξ 2 i + k 5/2 ξ 2 j + k 5/2 = u2 i u2 j ξ iξ j ξ 2 i + k ξ 2 j + k ξ 2 i + k 5/2 ξ 2 j + k 5/2 + u 2 j ξ j ξ 2 j + k u 2 i u 2 i ξ 2 i ξ i + k ξ 2 i ξ i + k 2 [ ξ 2 i ξ 2 ] j ξ j ξ 2 i + k ξ i ξ 2 j + k [ ξ 2 i ξ 2 ] j ξ 2 i ξ 2 j + kξ 2 j ξ 2 i ξ 2 j + kξ 2 i, 23

24 with the term in the brackets negative as the two terms in parentheses will have opposite signs. Appendix B With p = 2, to find the minimum MSE risk for the rotated estimators required solving for the roots of the quartic polynomial p 4 t = a 0 t 4 +a t 3 +a 2 t 2 +a 3 t+a with a 0 = ρ 2, a = 2 ρ 2, a 2 = ρ ρ 2 σ + δ, a 3 = 2ρ 2 ρ 2, a 4 = ρ 2 ρ 2. Yang 999 has given the root classification for quartic polynomials with general coeffi cients. Following his notation, we compute D 4 = 256a 3 0a a 2 0a a 2 0a 3 a 2 4a 27a 4 a 2 4 6a 0 a 2 a 4 a a 2 2a 2 3a 2 4a 0 a 3 2a a 2 a 4 a 3 a 3 +44a 0 a 2 a 2 4a 2 80a 0 a 2 2a 4 a a 3 + 8a 0 a 2 a 3 3a 4a 3 2a 4 a 2 4a 3 a a 0 a 4 2a 4 28a 2 0a 2 2a a 2 0a 2 a 4 a 2 3. A quartic polynomial has two distinct roots when D 4 < 0. For p 4 t, this follows from factoring D 4 with A, B, C, D, E, F, G, H, I > 0 as D 4 = AB C 4σ 2 D 4σ 4 E 6σ 6 F 6σ 8 G 64σ 0 H σ 2 I where A = 64 ρ2 2 ρ 2 > 0 δ 6 B = δ 2 δ 2 ρ 2 + σ 2 2 > 0 C = 27δ 2 + ρ 2 2 ρ 4 + ρ 4 > 0 D = 54δ 0 ρ 2 + ρ 4 ρ 3 + ρ 3 > 0 E = 9δ ρ 2 + 9ρ 4 ρ 2 + ρ 2 > 0 F = 68δ 6 ρ 3 + ρ 3 > 0 G = 57δ 4 ρ 2 + ρ 2 > 0 H = 6δ 2 ρ + ρ > 0 I = 64 > 0. 2 References. Belsley, D.A Centering, the constant, first differencing, and assessing conditioning. In Model Reliability E. Kuh and D.A. Belsley, D.A. eds.. Cambridge: MIT Press. 2. Berk, K Tolerance and condition in regression computations. J. Amer. Stat. Assoc. 72:

25 3. Davidov, O Constrained estimation and the theorem of Kuhn- Tucker. J. Appl. Math. Decis. Sci. Article ID 92970: García, J., Andújar, A., Soto, J Acerca de la detección de la colinealidad en los modelos de regresión lineal. XIII Reunión Asepelt- España. Universidad de Burgos. 5. García, C.B., García, J. and Soto, J. 20. The raise method: An alternative procedure to estimate the parameters in presence of collinearity. Quality and Quantity 45: García, C.B., García, J., López Martin, M.M. and Salmeron, R Collinearity: Revisiting the variance inflation factor in ridge regression. Journal of Applied Statistics 42: Goldstein, M. and Smith, A. F. M Ridge-type estimators for regression analysis. J. of R. Stat. Soc. B 36: Hoerl, A.E Ridge Analysis. Chemical Engineering Progress, Symposium Series 60: Hoerl, A.E. and Kennard, R.W Ridge regression: biased estimation for nonorthogonal problems. Technometrics 2: Jensen, D.R. and Ramirez, D.E Anomalies in the foundations of ridge regression. Int. Stat. Rev. 76: Jensen, D.R. and Ramirez, D.E. 200a. Tracking MSE effi ciencies in ridge regression. Advances and Applications in Statistical Sciences : Jensen, D.R. and Ramirez, D.E. 200b. Surrogate models in ill-conditioned systems. Journal of Statistical Planning and Inference 40: Jensen, D.R. and Ramirez, D.E Revision: Variance inflation in regression. Adv. Decis. Sci. Article ID 67204: Kapat, P. and Goel, P.K Anomalies in the foundations of ridge Regression: Some Clarifications. Int. Stat. Rev. 78: Levenberg, K A method for the solution of certain non-linear problems in least-squares. Quart. of Appl. Math. 2: Marshal, A.W. and Olkin, I Inequalities : Theory of Majorization and Its Applications. New York: Academic Press. 7. McDonald, G.C Ridge regression. Wiley Interdisciplinary Reviews: Computational Statistics : McDonald, G.C Tracing ridge regression coeffi cients. Wiley Interdisciplinary Review: Computational Statistics 2:

26 9. Marquardt, D.W An algorithm for least-squares estimation for nonlinear parameters. J. Soc. Ind. Appl. Math. : Mayer, L.S. and Willke, T.A.973. On Biased Estimation in Linear Models. Technometrics 5: The Economic Report of the President Piegorsch, W. and Casella, G The early use of matrix diagonal increments in statistical problems. SIAM Review 3: Puntanen, S., Styan, G.P.H., and Isotalo, J. 20. Matrix Tricks for Linear Statistical Models. Berlin: Springer-Verlag. 24. Rawlings, J.O., Pentula, S.G., and Dickey, D.A Applied Regression Analysis: A Research Tool - 2 nd ed. New York: Springer. 25. Riley, J Solving Systems of linear equations with a positive definite, symmetric but possibly ill-conditioned matrix. Math Tables Aids Comput. 9: Seber, G.A.F Linear Regression Analysis. New York: John Wiley and Sons. 27. Stein, C.M. 960 Multiple regression, contribution to probability and Statistics. Chapter 37 in Essays in Honor of Harold Hotelling I. Olkin, S. Ghurye, W. Hoeffding, W. Madow and H. Mann eds. Stanford: Stanford University Press. 28. Stewart, G.W Collinearity and least squares regression. Stat. Sci. 2: Wichers, R The detection of multicollinearity: A comment. Review Econ. Stat. 57: Yang, L Recent advances on determining the number of real roots of parametric polynomials. J. Symbolic Computation 28:

Mitigating collinearity in linear regression models using ridge, surrogate and raised estimators

STATISTICS RESEARCH ARTICLE Mitigating collinearity in linear regression models using ridge, surrogate and raised estimators Diarmuid O Driscoll 1 and Donald E. Ramirez 2 * Received: 21 October 2015 Accepted: