Direct Learning: Linear Regression
Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss: f (X) = β 0 + X T β; classification problem: I(β 0 + X T β > 0). The optimal prediction rule is the one in this class that minimizes EPE.
Why linear rules They are simple and have easy interpretation. The coefficients are informative of the important of each input feature. Estimation using linear rules is less variable. Although the best linear rule may not be the Bayes rule, the prediction performance is usually satisfactory in practice, especially with high-dimensional and noisy feature variables. Linear rules can be generalized to allow nonlinear interactions by replacing X by some basis functions of X (e.g., tensor splines).
Linear rule with squared loss Training data: (X 1, Y 1 ),..., (X n, Y n ). Learning the optimal linear rule is equivalent to minimizing LSE: p n (Y i β 0 X ij β j ) 2. i=1 j=1 The optimal rule is β 0 + x T β, where ( β 0, β) T = (X T X) 1 X T Y X is the feature variable matrix, and Y is the response vector. For any future subject, the prediction is x T ( β 0, β) T = x T (X T X) 1 X T Y. For Gaussian linear models, inference for the finite sample distribution of β has been well established.
Analysis of prostate cancer data
Improve LSE for linear rules LSE often has low bias but large variance. It is often interesting to determine a small subset of the feature variables that are more predictive of the outcome. The latter becomes even more significant when the dimensionality of the feature space is high or p >> n.
Best-subset selection This is an exhaustive search method to identify the best subset of {X 1,..., X p } optimizing a given crtieron. The procedure is: for a given subset of size k {0, 1, 2,..., p}, use an efficient algorithm (the leaps and bounds procedure, Furnival and Wilson, 1974) to identify the best subset of size k, C k, that gives the smallest RSS. we then select k such that C k minimizes a given criterion (to be discussed later).
Best-subset selection for prostate cancer data
Suboptimal-subset selection Best-subset selection is infeasible for not small p, e.g., p > 40. Suboptimal but computationally efficient-subset selection includes forward-stepwise selection: it is a greedy algorithm to search along an optimal sequence of increasing models; backward-stepwise selection: it starts from the full model and sequentially eliminates one variable from the model. Stepwise selection adds or deletes one variable based on certain significance testing so it is a locally optimal search.
Comparing different subset selection
Shrinkage methods Subset selection is a discrete process and so the variability jumping from one model to another is high. Discrete search algorithm is usually computationally intensive. Shrinkage methods provide more smooth search procedures for identify the best models. Such methods are often carried out via smooth regularization/penalization.
Ridge regression The estimate for β is obtained by minimizing resulting in p n (Y i β 0 Xi T β)2 + λ βj 2, i=1 j=1 β = (X T X + λi) 1 X T Y. That is, we add an additional L 2 -penalty to shrink coefficients towards zero. The optimization is equivalent to min p n (Y i β 0 Xi T β)2, subject to βj 2 C. i=1 Both λ and C are regularization parameters (tuning parameters) and will be chosen data-dependently based on certain criterion (to be discussed later). j=1
Prostate cancer example (ridge regression)
Lasso shrinkage The lasso minimizes 1 2 p n (Y i β 0 Xi T β)2 + λ β j i=1 so the regularization is L 1 -penalty for β. The objective function is convex and the computation is a quadratic programming problem. The whole solution path can be efficiently solved using Least Angle Regression (LAR) algorithm, which is a forward stepwise selection procedure. j=1
LAR algorithm
Prostate cancer example (Lasso)
Comparison among shrinkage methods
Comparison among shrinkage methods in prostate cancer data
Structure shrinkage methods Structural regularization: Group Lasso for L groups of features min n (Y i β 0 i=1 Elastic-net penalty p L Xl T β l) 2 + λ l=1 L pl β l L2 l=1 λ (αβj 2 + (1 α) β j ) j=1 Laplacian regularization with a Laplacian eigenmap matrix D, β T Dβ, to encourage similar coefficients of two variables on the same edge of a network.
Sparsity shrinkage methods Oracle selection regularization it selects feature variables with β j = 0 with proability tending to 1 (oracle property); it also shrinks non-zero coefficients towards zero; the regularization is a non-convex function of β such as the smoothly clipped absolute deviation (SCAD): q λ (β) β [ = λsign(β) I( β λ) + (αλ β ) ] + I( β > λ) (α 1)λ Computation for such regularization relies on some local approximation so may not achieve the global minimum..
Sparsity shrinkage methods: continue Alternatively, adaptive Lasso (ALasso) method uses the regularization term p λ c β j / β j γ, γ > 0, j=1 where β j is a consistent initial estimator for β j. ALasso requires initial estimates so may not be applicable when p >> n.
Graphic comparisons among all penalties
Sample R-codes for implementation Best Subset Selection Ridge regression Lasso
Tuning parameter selection There are often tuning parameters that need to be chosen: k subset size in best-subset selection; regularization parameters in all shrinkage methods. Larger regularization parameters lead to more shrunk coefficients (more sparse model) so less variable prediction; however, the resulting model yields higher bias in prediction. There should also be some bias and variance trade-off in tuning parameter selection. Some model selection criteria such as AIC and BIC can be used; however, they are based on information criterion so do not directly serve the purpose of prediction.
Mallow s CP criterion function for subset selection The criterion is based on the prediction error E[(Y f k (x 0 )) 2 X = x 0 ], where f k is the estimated function from the k best feature variables. Assuming Var(Y f (X)) = σ 2, this prediction error is σ 2 + (f (x 0 ) E[ f k (x 0 )]) 2 + Var( f k (x 0 )) so when average over x 0 from the empirical data, it is σ 2 + 1 n n i=1 (f (X i ) E[ f k (X i )]) 2 + σ2 n Trace(XT k (XT k X k) 1 X k ).
Mallow s CP: continue Since the in-sample error, n 1 n i=1 (Y i f k (X i )) 2, has an expectation approximated by σ 2 + 1 n n i=1 { f (X i ) 2 E[ f k (X i )] 2} 1 n n Var( f k (X i )), i=1 the expectation of the prediction error is equal to the expectation of the in-sample error plus 2σ 2 n 1 Trace(X T k (XT k X k) 1 X k ) = 2σ 2 k/n. Mallow s CP selects k to minimize 1 n n (Y i f k (X i )) 2 + 2 σ 2 k/n. i=1
Data-adaptive selection: Cross-validation The goal is to mimick scenarios of learning prediction rules using training samples then evaluating performance in future data. The idea is to randomly split data into training sample and testing sample training sample is used to train prediction rules using learning methods; testing sample is used to evaluate prediction errors of the learned rules. To avoid incidence of good or bad splits, this procedure repeats multiple times. Recommendation is often leave-one-out cross-validation, 5-fold or 10-fold cross-validation. The best tuning parameters are chosen to minimize the average of the prediction errors.
Generalized cross-validation CV is computationally costly, especially leave-one-out CV. Some approximation, called generalized cross-validation, is often used in practice: n 1 [ n Y i f ] 2 (X i ), 1 trace(σ)/n i=1 where ΣY is the prediction for all the subjects.