Regularized Regression David M. Blei Columbia University December 5, 205 Modern regression problems are ig dimensional, wic means tat te number of covariates p is large. In practice statisticians regularize teir models, veering away from te MLE solution to one were te coefficients ave smaller magnitude. Tis lecture is about regularization. It draws on te ideas and treatment in Hastie et al. (2009) (referred to below as ESL). Te bias-variance trade off We first discuss an important concept, te bias-variance trade off. In tis discussion we will take a frequentist perspective. Consider a set of random responses drawn from a linear regression wit true parameter, Y n j x n ; N.x n ; 2 /: () Te data are D D f.x n ; Y n /g. Note tat we are olding te covariates x n fixed; only te responses are random. (We are also assuming x n is a single covariate; in general, it is p-dimensional and we replace x n wit > x n.) Wit tis data set, te maximum likeliood estimate is a random variable wose distribution is governed by te distribution of te data O.D/. Recall tat is te true parameter tat generated te responses. How close to we expect O.D/ to be to? We can answer tis question in a couple of ways. First, suppose we observe a new data input x. We consider te mean squared error of our estimate of E O Œy j x D Ox. Tis is te difference between our predicted expectation of te response and te true expectation of te response, MSE D E. O.D/ > x > x/ 2 i : (2) It is important to keep track of wic variables are random. Te coefficient is not random;
it is te true parameter tat generated te data. Te coefficient O.D/ is random; it depends on te randomly generated data set D. Te expectation in tis equation is wit respect to te randomly generated data set. (For simplicity, we will sometimes supress tis notation below.) Te MSE decomposes in an interesting way, MSE D E. Ox/ i 2 D E. Ox/ i 2 D E. Ox/ i 2 i 2E Ox x C.x/ 2 2E. Ox/ i i 2 i E Ox C E Ox.x/ C.x/ 2 C E Ox/i 2. E. Ox/ i 2 x 2 (3) Te second term is te squared bias, i bias D E Ox x: (4) An estimate for wic tis term is zero is an unbiased estimate. Te first term is te variance, variance D E. Ox/ i 2 E Ox i 2 : (5) Tis reflects te spread of te estimates we migt find on account of te randomness inerent in te data. Note tat te decomposition olds for any linear function of te coefficients. A famous result in statistics is te Gauss-Markov teorem. Recall tat te MLE O is an unbiased estimate. Te teorem states tat te MLE is te unbiased estimate wit te smallest variance. If you insist on unbiasedness, and you care about te MSE, ten you can do no better tan te MLE. Often we care about expected prediction error. Suppose we observe a new input x. How wrong will we be on average wen we predict te true y j x wit E Œy j x from a fitted regression? Te expected squared prediction error is E D E Y. Ox Y / 2 ii Te first expectation is taken for te randomness of O, wic is a function of te data. Te 2
second is taken for te randomness of Y given x, wic comes from te true model. Tis decomposes as follows, E D E Y. Ox Y / 2 ii D Var.Y / C MSE. Ox/ (6) D 2 C Bias 2. Ox/ C Var. Ox/: (7) Te first term is te inerent uncertainty around te true mean; te second two terms are te bias variance decomposition of te estimator. We cannot do anyting about te inerent uncertainty; tus reducing te MSE also reduces expected prediction error. Classical statistics cared only about unbiased estimators. Modern statistics as explored te trade-off, were it may be wort accepting some bias for a reduction in variance. Tis can reduce te MSE and, consequently, te expected prediction error on future data. Here a simple picture to illustrate wy: 0.0 0. 0.2 0.3 0.4 6 4 2 0 2 4 6 beta at It may be tat te MSE is smaller for te biased estimator, because it nevers veers as far away from te trut as te unbiased estimator does. 2 Ridge regression Regularization. In regression, we can make tis trade-off wit regularization, wic means placing constraints on te coefficients. Here is a picture from ESL for our first example. 3
Elements of Statistical Learning c Hastie, Tibsirani & Friedman 200 Capter 3 2. ^ 2. ^ Figure 3.2: Estimation picture for te lasso (left) In tis picture, and ridgecontours regression represent (rigt). Sown values areofcontours witof equal te RSS (or, equivalently, likeliood). Our procedure error andfinds constraint te best functions. value tat Te solid is witin blue areas te blue are circle. te constraint regions β + β 2 t and β 2 + β2 2 t 2, respectively, wile te red ellipses are te contours of Tis reduces te variance because it limits te space tat te parameter vector can live te least squares error function. in. If te true MLE of lives outside tat space, ten te resulting estimate must be biased because of te Gauss-Markov teorem. Te picture also sows ow regularization encourages smaller and peraps simpler models. Simpler models are more robust to overfitting, generalizing pooly because of a close matc to te training data. Simpler models can also be more interpretable, wic is anoter goal of regression. (Tis is particularly true for te lasso, wic we will talk about later.) Ridge regression. Let s discuss te details of ridge regression. We optimize te RSS subject to a constraint on te sum of squares of te coefficients, minimize subject to P N nd P p id 2i 2.y n x n / 2 s (8) Tis constrains te coefficients to live witin a spere of radius s. (See te picture.) Question: Wat appens as te radius increases? Answer: Variance goes up; bias goes down. Wit some calculus, te ridge regression estimate can also be expressed as Oridge D arg min NX nd 2.y n x n / 2 C px id 2 i (9) Tis is nice because te problem is convex. Furter, it as an analytic solution. (See te reading.) Question: Is it sensitive to scaling? Answer: Yes, in practice we center and scale 4
te covariates. Tere is a - mapping between te radius s and complexity parameter. Eiter of tese parameters trades off an increase in bias for a decrease in variance. From ESL: 8 8 8 8 8 Coefficients 0.0 0.2 0.4 0.6 0.0 0.5.0.5 2.0 L Norm How do we coose? As we see, te value of te complexity parameter affects our estimate. Question: Wat would appen if we used training error as te criterion? (Look at te picture to see te answer.) In practice, we coose by cross validation. Tis is an attempt to minimize expected test error. (But later on we will discuss ierarcical models. Tis can be anoter way to coose te regularization parameter.) Here is ow it works: Divide te data into K folds (e.g., K D 0). Decide on candidate values of (e.g., a grid between 0 and ) For eac fold k and value of, Estimate Oridge k on te out-of-fold samples. For eac x n assigned to fold k, compute its squared error n D. Oy n y n / 2 ; (0) 5
were Oy n D E ridge O ŒY j x n. Note tat tis estimate of te coefficients did not use k.x n ; y n / as part of its training data. We now aggregate te individual errors. Te score for is MSE./ D N NX n : () nd Tis is an estimate of te test error. Coose tat minimizes tis score. Aside: Connection to Bayesian statistics. We ave motivated regularized regression via frequentist tinking, i.e., te bias-variance trade-off and an appeal to te true model. Regularized regression, in general, as connections to Bayesian modeling. We ave discussed two common ways of using te posterior to obtain an estimate. Te first is maximum a posteriori (MAP) estimation, Te second is te posterior mean, MAP D arg max p. j y ; : : : ; y N ; / (2) mean D E Œ j y ; : : : ; y N ; (3) Question: How are tese different from te MLE? Ridge regression and Bayesian metods. Ridge regression corresponds to MAP estimation in te following model: i N.0; =/ (4) y n j x n ; N.>x n ; 2 / (5) Here is te corresponding grapical model X n Y n β N λ [ Tis isn t quite rigt; sould be a small dot. ] 6
We will derive te relationsip. First, note tat p.i j / D p 2.=/ expf2 i g (6) We now compute te MAP estimate of, max p. j D; / D max D max D D max max log p. j y WN ; x WN ; / (7) log p.; y WN j x WN ; / (8) py log p.y WN j x WN ; / p.i j / (9) RSS.I D/ px id Ridge regression is equivalent to MAP estimation in te model. id 2 i : (20) Observe tat te yperparameter controls ow far away te estimate will be from te MLE. A small yperparameter (large variance) will coose te MLE; te data totally determine te estimate. As te yperparameter gets larger, te estimate moves furter from te MLE; te prior (E Œ D 0) becomes more influential. Tis matces our recurring teme in Bayesian estimation; bot te data and te prior influence te answer. Finally, note tat a true Bayesian would not set te yperparameter by cross-validation. Tis uses te data to set te prior. However, I tink it is a good idea. It is an instance of a more general principle called Empirical Bayes. Summary of ridge regression.. We constrain to be in a yperspere around 0. 2. Tis is equivalent to minimizing te RSS plus a regularization term. 3. We no longer find te O tat minimizes te RSS. (Contours illustrate constant RSS.) 4. Ridge regression is a kind of srinkage, so called because it reduces te components to be close to 0 and close to eac oter. 7
5. Ridge estimates trade off bias for variance. 3 Te lasso A closely related regularization metod is called te lasso. Te lasso optimizes te RSS subject to a different constraint, minimize subject to P N nd P p 2.y n x n / 2 id jij s Elements of Statistical Learning c Hastie, Tibsirani & Friedman 200 Capter 3 (2) Tis small cange yields very different estimates. Here is te picture of te constraint: From ESL: 2. ^ 2. ^ Figure 3.2: Estimation picture for te lasso (left) Question: Wat appens as s increases? Question: Were is te solution going to lie wit s and ridge regression (rigt). Sown are contours of te fixed? error and constraint functions. Te solid blue areas are te constraint regions β + β 2 t and β 2 + β2 2 t 2, It s a fact: unless it cooses O, terespectively, lasso (witwile p large) te red will ellipses set some are te ofcontours te coefficients of to te least squares error function. exactly zero. Te intuitions come from ESL: Unlike te disk, te diamond as corners; if te solution occurs at a corner, ten it as one parameter j equal to zero. Wen p > 2, te diamond becomes a romboid, and as many corners, flat edges and faces; tere are many more opporunities for te estimated parameters to be zero. (p 90). In a sense, te lasso is a form of feature selection, identifying a relevant subset of te covariates wit wic to predict. Like ridge regression, it trades off an increase in bias wit a decrease in variance. Furter, by zeroing out some of te covariates, it provides interpretable (as in, sparse) models. 8
Sparse models can also be important in real systems tat migt depend on many inputs. Once te sparse solution is found, we need only measure a few of te inputs in order to make predictions. Tis speeds up te performance of te system. Te lasso is equivalent to Olasso D arg min NX nd 2.y n x n / 2 C px jij (22) Again, tere is a - mapping between and s. Tis objective, toug it does not ave an analytic solution, is still convex. Wy is te lasso exciting? Prior to te lasso, te only sparse metod was subset selection, finding te best subset of features wit wic to model te data. But subset selection as problems: searcing over all subsets (of a fixed size) is computationally expensive. In contrast, te lasso efficiently finds a sparse solution by using convex optimization. In a sense, it is akin to a smoot version of subset selection. Note te lasso won t consider all possible subsets. From ESL: id 0 2 3 5 7 Coefficients -0.2 0.0 0.2 0.4 0.6 0.0 0.5.0.5 2.0 L Norm 9
Te Bayesian interpretation of te lasso. Like ridge regression, lasso regression corresponds to MAP estimation in a Bayesian model. For te lasso, te model is: i Laplace./ (23) Y n j x n ; N.>x n ; 2 /: (24) Here te coefficients come from a Laplace distribution, p.i j / D 2 expf jijg: (25) Te lasso, and te general idea of L penalized models, as become a cottage industry in modern statistics and macine learning. Te reason is tat we often want sparse solutions to ig-dimensional problems, and we want convex objective functions wen analyzing data. L penalized metods give us bot. Recent researc indicates tat tey ave good teoretical properties to boot. 4 (Optional) Generalized regularization In general, regularization can be seen as minimizing te RSS wit a constraint on a q- norm, minimize P N nd 2.y n x n / 2 subject to jjjj q s, were te penalty is jjjj q D =q px jij q id Te metods we discussed so far are q D 2 : ridge regression q D : lasso q D 0 : subset selection Here is te picture from ESL: 0
q =4 q =2 q = q =0.5 q =0. Figure 3.3: Contours of constant value of j β j q for given values of q. Tis brings us away from te minimum RSS solution, but migt provide better test prediction via te bias/variance trade-off. Complex models ave less bias; simpler models ave less variance. Regularization encourages simpler models. Note tat eac of tese metods correspond to a Bayesian solution wit a different coice of prior. Oridge D arg min NX nd 2.y n x n / 2 C jjjj q Te complexity parameter can be cosen wit cross validation. Lasso (q D ) is te only norm tat provides sparsity and convexity. And tere are oter variants, useful in te literature. Of note: Te elastic net is a convex combination of L and L 2. Te grouped lasso finds sparse groups of covariates to include. Finally, te glmnet package in R is amazing. It efficiently computes models for a regularization pat using L 2 or L penalization. It uses te same model syntax as lm or glm. References Hastie, T., Tibsirani, R., and Friedman, J. (2009). Te Elements of Statistical Learning. Springer, 2 edition.