Focused fine-tuning of ridge regression

Size: px

Start display at page:

Download "Focused fine-tuning of ridge regression"

Agnes Powell
5 years ago
Views:

1 Focused fine-tuning of ridge regression Kristoffer Hellton Department of Mathematics, University of Oslo May 9, 2016 K. Hellton (UiO) Focused tuning May 9, / 22

2 Penalized regression The least-squares (LS) estimate β = (X T X) 1 X T Y is sensitive to random errors or not unique if (X T X) 1 becomes close to or exactly singular, for instance as p > n. Ridge regression addresses this problem by penalizing the residual sum-of-squares: n ( ˆβ ridge = arg min yi x T i β ) 2 p + λ β i=1 introducing a tuning parameter λ. j=1 The solution and its relation to the LS estimate (p < n): β 2 j, ˆβ = (X T X + λi p ) 1 X T Y = (X T X + λi p ) 1 X T X β. K. Hellton (UiO) Focused tuning May 9, / 22

3 How to choose the tuning parameter? Many different procedures: K-fold cross-validation, Hastie et al. (2009) Allen s PRESS statistic, leave-one-out CV, Allen (1974) Generalized cross-validation, Golub et al. (1979) 1 Bootstrap, Delaney and Chatterjee (1986) Small sample corrected AIC, Hurvich and Tsai (1989) Marginal maximum likelihood The current unchallenged favorite: 10-fold cross-validation 1 Already Golub et al. (1979) provide references to more than 15 procedures. K. Hellton (UiO) Focused tuning May 9, / 22

4 Thought experiment A medical study has measured covariates and observed outcomes for a group of patients. A new patient enters the doctors office, and we wish to predict the outcome for his/her specific set of covariates, x 0, as best as possible. Cross-validation would select λ to be optimal for the overall distribution of covariates, but not the specific x 0. Instead; frame the specific prediction µ = x T 0 β as a focus parameter: 1 find the bias and variance of ˆµ, 2 and minimize the estimated mean squared error (MSE) of ˆµ as a function of λ to find ˆλ. K. Hellton (UiO) Focused tuning May 9, / 22

5 Given the standard fixed design regression model y i = x T i β + ε i, ε i N(0, σ 2 ), and the focus µ 0 = Ey 0 = x T 0 β, root mse the MSE(ˆµ) is a function of λ and the model parameters: Tuning parameter K. Hellton (UiO) Focused tuning May 9, / 22

6 Given the standard fixed design regression model y i = x T i β + ε i, ε i N(0, σ 2 ), and the focus µ 0 = Ey 0 = x T 0 β, root mse the MSE(ˆµ) is a function of λ and the model parameters: Tuning parameter MSE(ˆµ; λ, x 0, β, σ 2 ) = { x T 0 ((X T X + λi p ) 1 X T X I p )β } 2 + σ 2 x T 0 (X T X + λi p ) 1 X T X(X T X + λi p ) 1 x 0 For each covariate set x 0 the MSE curves, as functions of λ, have different minima. K. Hellton (UiO) Focused tuning May 9, / 22

7 The minimand of the theoretical MSE defines the oracle tuning λ best = arg min λ MSE(ˆµ; λ, x 0 ). K. Hellton (UiO) Focused tuning May 9, / 22

8 The minimand of the theoretical MSE defines the oracle tuning λ best = arg min λ MSE(ˆµ; λ, x 0 ). But an estimator of λ best requires pilot estimates of β and σ 2. Our proposed estimator uses the LS estimates (for p < n): ˆλ best = arg min λ MSE(ˆµ; λ, x 0 ), = arg min λ Var(ˆµ; λ, x0 ) + bias 2 (ˆµ; λ, x 0 ) = arg min λ { Var + max{( bias) 2 Var bias, 0} }, K. Hellton (UiO) Focused tuning May 9, / 22

9 The minimand of the theoretical MSE defines the oracle tuning λ best = arg min λ MSE(ˆµ; λ, x 0 ). But an estimator of λ best requires pilot estimates of β and σ 2. Our proposed estimator uses the LS estimates (for p < n): ˆλ best = arg min λ MSE(ˆµ; λ, x 0 ), = arg min λ Var(ˆµ; λ, x0 ) + bias 2 (ˆµ; λ, x 0 ) = arg min λ { Var + max{( bias) 2 Var bias, 0} or a simplified version without bias correction, written out as: { ˆλ best = arg min (x T 0 ((X T X + λi p ) 1 X T X I p ) β) 2 λ + σ 2 x T 0 (X T X + λi p ) 1 X T X(X T X + λi p ) 1 } x 0 }, K. Hellton (UiO) Focused tuning May 9, / 22

10 In one dimension Suppose p = 1, then ˆβ = M M + λ β, M = and the mean squared error is given n x 2 i, ( ) λ 2 ( ) 1 2 MSE(ˆµ; λ) = x 2 0M 2 β 2 + x 2 M + λ 0σ 2 M, M + λ i=1 such that the minimand λ best does not depend on x 0! Our proposed estimator loses its focus, more interesting with p 2. K. Hellton (UiO) Focused tuning May 9, / 22

11 Orthogonal case For general p, assume σ 2 to be known and the design matrix to be orthogonal with equal M j : X T X = diag(m 1,..., M p ) = M I p. K. Hellton (UiO) Focused tuning May 9, / 22

12 Orthogonal case For general p, assume σ 2 to be known and the design matrix to be orthogonal with equal M j : X T X = diag(m 1,..., M p ) = M I p. Then the estimated bias based on the LS pilot estimate is bias = λ ( M + λ xt β 0 N p λ σ 2 λ 2 x T M + λ xt 0 0 β, x ) 0 M(M + λ) 2, K. Hellton (UiO) Focused tuning May 9, / 22

13 Orthogonal case For general p, assume σ 2 to be known and the design matrix to be orthogonal with equal M j : X T X = diag(m 1,..., M p ) = M I p. Then the estimated bias based on the LS pilot estimate is bias = λ ( M + λ xt β 0 N p λ σ 2 λ 2 x T M + λ xt 0 0 β, x ) 0 M(M + λ) 2, giving the estimated mean squared error as MSE(ˆµ; λ) = where ( ) + = max(, 0). ( x T 0 β σ2 x T 0 x ) ( ) 0 λ 2 + σ2 x T 0 x ( ) 0 M 2, M + M + λ M M + λ K. Hellton (UiO) Focused tuning May 9, / 22

14 Focused tuning estimate The minimand, our focused tuning estimator, is ˆλ best = σ 2 Mx T 0 x 0 ( M(x T β) ), 0 2 σ 2 x T 0 x 0 + resulting in the focused prediction 0 if x T β 0 σ x 0 / M, ŷ(ˆλ best ) = M(x 0 2 σ 2 x T 0 x 0 M(x T β) x T β if x T β 0 > σ x 0 / M, when σ 2 known and X T X = diag(m 1,..., M p ) = M I p. K. Hellton (UiO) Focused tuning May 9, / 22

15 Simplified focused tuning estimate In the orthogonal case, the simplified mean squared error without the bias correction is given ( ) MSE(ˆµ; λ) = (x T β) λ 2 ( ) M σ 2 x T 0 x 0, M + λ M + λ with the minimand, the simplified focused tuning ˆλ best = σ 2 xt 0 x 0 (x T 0 β) 2. K. Hellton (UiO) Focused tuning May 9, / 22

16 Simplified focused tuning estimate In the orthogonal case, the simplified mean squared error without the bias correction is given ( ) MSE(ˆµ; λ) = (x T β) λ 2 ( ) M σ 2 x T 0 x 0, M + λ M + λ with the minimand, the simplified focused tuning ˆλ best = σ 2 xt 0 x 0 (x T 0 β) 2. The prediction has a geometric interpretation in terms of α, the angle between vectors x 0 and β and R 2 = x 0 2 / β 2 : ŷ(ˆλ best ) = M(x T β) 0 2 M(x T β) σ 2 x T 0 x x T β 0 = 0 M cos 2 α M cos 2 α + σ 2 R 2 xt 0 β. K. Hellton (UiO) Focused tuning May 9, / 22

17 Prediction risk: p = 1 Risk as function of β with original and simplified focused tuning: Risk for original FIC (black) and adjusted FIC (red) Risk Regression parameter Beta Both do better then least-squares (scaled to 1) for small β. The original is better then the simplified for β close to zero, but worse for medium β. NB: it does not depend on x 0. K. Hellton (UiO) Focused tuning May 9, / 22

18 Risk Prediction risk: p = 2 For p 2, the prediction risk risk(x T 0 ˆβ(ˆλ best )) will vary with x 0. For the simplified tuning, the risk surface as function of β (for fixed x 0 ) shows a trench orthogonal to x 0 : Risk surface Risk surface Risk K. Hellton (UiO) Focused tuning May 9, / 22

19 Prediction risk: p = 2 For fixed β, the risk surface for the simplified tuning can be a function of x 0 (scaled by the least-square risk): Risk surface as function of x0 Risk surface for FIC 1.2 risk x x Variable 1 Around the line in x 0 space orthogonal to β, the risk of the focused tuning ridge is smaller than LS risk. Variable K. Hellton (UiO) Focused tuning May 9, / 22

20 Cross-validation K-fold cross-validation has emerged as a standard for selecting tuning parameters. We will study n-fold or leave-one-out cross-validation ˆλ CV = arg min λ n (y i x T ˆβ ) 2 i i,λ i=1 where the ith observation is removed from the model fitting when predicting y i. For ridge regression, the leave-one-out cross-validation criterion simplifies to weighted sum of squared residuals: ( n y i x T i CV (λ) = ˆβ ) 2 λ, 1 H ii,λ i=1 where H = X(X T X + λ) 1 X T, the weights, quantify the distance of the covariates from the centroid (Golub et al. 1979). K. Hellton (UiO) Focused tuning May 9, / 22

21 Risk for cross-validation The prediction risk (for p=1) as function of β shows that the focused tuning, ˆλ best, gives smaller risk than cross-validation, ˆλ CV, for small β: Risk function for MSE (black) and CV (red) Risk Regression parameter K. Hellton (UiO) Focused tuning May 9, / 22

22 For a given data set (X i, y i ), i = 1,..., n, the cross-validation tuning parameter estimate ˆλ CV depends on the interplay between the residuals and position of the covariates Large residuals around the covariate centroid result in a smaller than average ˆλ cv. Large residuals for all covariate outliers result in a large than average ˆλ cv. K. Hellton (UiO) Focused tuning May 9, / 22

23 To compare prediction error as function of x 0 : fix X, simulate four different {y i }, i = 1,..., 40 and color the difference as focused tuning being better and cross-validation being better. Thick line: x T 0 β = 0. K. Hellton (UiO) Focused tuning May 9, / 22

24 To compare prediction error as function of x 0 : fix X, simulate four different {y i }, i = 1,..., 40 and color the difference as focused tuning being better and cross-validation being better. Thick line: x T 0 β = 0. Difference minimal MSE and CV predition Variable Variable 1 K. Hellton (UiO) Focused tuning May 9, / 22

25 To compare prediction error as function of x 0 : fix X, simulate four different {y i }, i = 1,..., 40 and color the difference as focused tuning being better and cross-validation being better. Thick line: x T 0 β = 0. Difference minimal MSE and CV predition Variable Difference minimal Variable MSE 1and CV predition Variable Variable 1 K. Hellton (UiO) Focused tuning May 9, / 22

26 To compare prediction error as function of x 0 : fix X, simulate four different {y i }, i = 1,..., 40 and color the difference as focused tuning being better and cross-validation being better. Thick line: x T 0 β = 0. Difference minimal MSE and CV predition 0.3 Difference minimal MSE and CV predition Variable Variable Difference minimal Variable MSE 1and CV predition Variable Variable Variable 1 K. Hellton (UiO) Focused tuning May 9, / 22

27 To compare prediction error as function of x 0 : fix X, simulate four different {y i }, i = 1,..., 40 and color the difference as focused tuning being better and cross-validation being better. Thick line: x T 0 β = 0. Difference minimal MSE and CV predition 0.3 Difference minimal MSE and CV predition Variable Variable Difference minimal Variable MSE 1and CV predition Difference minimal Variable MSE 1and CV predition Variable Variable Variable Variable 1 K. Hellton (UiO) Focused tuning May 9, /

28 The pudding When taking the averaged over 2000 simulated {y i }, i = 1,..., 40, the focused approach shows to be better than CV along the line x T Difference minimal MSE and CV predition 0 β = Variable Variable K. Hellton (UiO) Focused tuning May 9, / 22

29 The future Our goal is to be able to use this approach in a high-dimensional situation, p n. A motivating example: gene expression and weight change Cashion et al. (2013) measured gene expression profiles in adipose (fat) tissue taken from kidney transplant recipients to study association between gene activity and change in body weight: p = genes n = 25 patients Can the prediction focused tuning estimate gain something compared to cross-validation this high-dimensional setting? K. Hellton (UiO) Focused tuning May 9, / 22

30 To illustrate the procedure, we simulate outcomes based on gene expression y i = Xβ + ε i, β j = 0.01, ε i N(0, 1), and compare the cross-validation and focused (oracle) prediction: Focused tuning parameter (red) and CV (blue) Root prediction error In the case of known β: 60% of the predictions improved. But a real example needs a new pilot: ridge? principal component regression? K. Hellton (UiO) Focused tuning May 9, / 22

31 Concluding remarks As shown, the focused tuning estimator gives lower prediction error compared to cross-validation for certain x 0. Future work: controlling overfitting. find the best pilot estimate in high-dimension. Two-step ridge regression with cross-validation? PCR? exploit the low-dimensionality of the projection x T 0 β as p grows? explore further foci; β T β, individual β i, P [ŷ 0 > y thres ] = α K. Hellton (UiO) Focused tuning May 9, / 22

32 Thank you! K. Hellton (UiO) Focused tuning May 9, / 22

Improving ridge regression via model selection and focussed fine-tuning

Università degli Studi di Milano-Bicocca SCUOLA DI ECONOMIA E STATISTICA Corso di Laurea Magistrale in Scienze Statistiche ed Economiche Tesi di laurea magistrale Improving ridge regression via model selection