Regularized Regression

Similar documents
COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Continuity and Differentiability Worksheet

Preface. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

f a h f a h h lim lim

Exam 1 Review Solutions

MVT and Rolle s Theorem

Combining functions: algebraic methods

Solution. Solution. f (x) = (cos x)2 cos(2x) 2 sin(2x) 2 cos x ( sin x) (cos x) 4. f (π/4) = ( 2/2) ( 2/2) ( 2/2) ( 2/2) 4.

Function Composition and Chain Rules

232 Calculus and Structures

NUMERICAL DIFFERENTIATION. James T. Smith San Francisco State University. In calculus classes, you compute derivatives algebraically: for example,

Notes on wavefunctions II: momentum wavefunctions

Introduction to Derivatives

1. Which one of the following expressions is not equal to all the others? 1 C. 1 D. 25x. 2. Simplify this expression as much as possible.

THE IDEA OF DIFFERENTIABILITY FOR FUNCTIONS OF SEVERAL VARIABLES Math 225

Financial Econometrics Prof. Massimo Guidolin

Consider a function f we ll specify which assumptions we need to make about it in a minute. Let us reformulate the integral. 1 f(x) dx.

A.P. CALCULUS (AB) Outline Chapter 3 (Derivatives)

2.8 The Derivative as a Function

Copyright c 2008 Kevin Long

2.11 That s So Derivative

Bob Brown Math 251 Calculus 1 Chapter 3, Section 1 Completed 1 CCBC Dundalk

Fundamentals of Concept Learning

Derivatives of Exponentials

CSCE 478/878 Lecture 2: Concept Learning and the General-to-Specific Ordering

Lecture XVII. Abstract We introduce the concept of directional derivative of a scalar function and discuss its relation with the gradient operator.

7.1 Using Antiderivatives to find Area

5.1 We will begin this section with the definition of a rational expression. We

LIMITS AND DERIVATIVES CONDITIONS FOR THE EXISTENCE OF A LIMIT

Differentiation in higher dimensions

Higher Derivatives. Differentiable Functions

Derivatives. By: OpenStaxCollege

A = h w (1) Error Analysis Physics 141

SECTION 3.2: DERIVATIVE FUNCTIONS and DIFFERENTIABILITY

INTRODUCTION AND MATHEMATICAL CONCEPTS

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER /2019

3.1 Extreme Values of a Function

Te comparison of dierent models M i is based on teir relative probabilities, wic can be expressed, again using Bayes' teorem, in terms of prior probab

2.3 Algebraic approach to limits

Name: Answer Key No calculators. Show your work! 1. (21 points) All answers should either be,, a (finite) real number, or DNE ( does not exist ).

1 Solutions to the in class part

Differential Calculus (The basics) Prepared by Mr. C. Hull

HOMEWORK HELP 2 FOR MATH 151

Cubic Functions: Local Analysis

Sin, Cos and All That

4. The slope of the line 2x 7y = 8 is (a) 2/7 (b) 7/2 (c) 2 (d) 2/7 (e) None of these.

IEOR 165 Lecture 10 Distribution Estimation

INTRODUCTION AND MATHEMATICAL CONCEPTS

Machine Learning for OR & FE

3.4 Worksheet: Proof of the Chain Rule NAME

LIMITATIONS OF EULER S METHOD FOR NUMERICAL INTEGRATION

1 The concept of limits (p.217 p.229, p.242 p.249, p.255 p.256) 1.1 Limits Consider the function determined by the formula 3. x since at this point

Polynomial Interpolation

Material for Difference Quotient

Section 15.6 Directional Derivatives and the Gradient Vector

Section 2: The Derivative Definition of the Derivative

2.3 Product and Quotient Rules

Section 3: The Derivative Definition of the Derivative

Lecture 15. Interpolation II. 2 Piecewise polynomial interpolation Hermite splines

Lab 6 Derivatives and Mutant Bacteria

Polynomial Interpolation

1. Questions (a) through (e) refer to the graph of the function f given below. (A) 0 (B) 1 (C) 2 (D) 4 (E) does not exist

How to Find the Derivative of a Function: Calculus 1

Solve exponential equations in one variable using a variety of strategies. LEARN ABOUT the Math. What is the half-life of radon?

Spike train entropy-rate estimation using hierarchical Dirichlet process priors

Recall from our discussion of continuity in lecture a function is continuous at a point x = a if and only if

Chapter 5 FINITE DIFFERENCE METHOD (FDM)

1watt=1W=1kg m 2 /s 3

1 Calculus. 1.1 Gradients and the Derivative. Q f(x+h) f(x)

Introduction to Machine Learning. Recitation 8. w 2, b 2. w 1, b 1. z 0 z 1. The function we want to minimize is the loss over all examples: f =

. If lim. x 2 x 1. f(x+h) f(x)

The Priestley-Chao Estimator

Math 312 Lecture Notes Modeling

CS522 - Partial Di erential Equations

University Mathematics 2

Physically Based Modeling: Principles and Practice Implicit Methods for Differential Equations

Time (hours) Morphine sulfate (mg)

MAT 145. Type of Calculator Used TI-89 Titanium 100 points Score 100 possible points

Pre-Calculus Review Preemptive Strike

Lecture 10: Carnot theorem

MAT244 - Ordinary Di erential Equations - Summer 2016 Assignment 2 Due: July 20, 2016

Average Rate of Change

Exponentials and Logarithms Review Part 2: Exponentials

New families of estimators and test statistics in log-linear models

1 Limits and Continuity

lecture 26: Richardson extrapolation

Math 212-Lecture 9. For a single-variable function z = f(x), the derivative is f (x) = lim h 0

AMS 147 Computational Methods and Applications Lecture 09 Copyright by Hongyun Wang, UCSC. Exact value. Effect of round-off error.

SECTION 1.10: DIFFERENCE QUOTIENTS LEARNING OBJECTIVES

A MONTE CARLO ANALYSIS OF THE EFFECTS OF COVARIANCE ON PROPAGATED UNCERTAINTIES

Lecture 21. Numerical differentiation. f ( x+h) f ( x) h h

3.4 Algebraic Limits. Ex 1) lim. Ex 2)

Mathematics 5 Worksheet 11 Geometry, Tangency, and the Derivative

Exercises for numerical differentiation. Øyvind Ryan

Numerical Differentiation

Polynomials 3: Powers of x 0 + h

The derivative function

Probabilistic Graphical Models Homework 1: Due January 29, 2014 at 4 pm

The Krewe of Caesar Problem. David Gurney. Southeastern Louisiana University. SLU 10541, 500 Western Avenue. Hammond, LA

Chapters 19 & 20 Heat and the First Law of Thermodynamics

Transcription:

Regularized Regression David M. Blei Columbia University December 5, 205 Modern regression problems are ig dimensional, wic means tat te number of covariates p is large. In practice statisticians regularize teir models, veering away from te MLE solution to one were te coefficients ave smaller magnitude. Tis lecture is about regularization. It draws on te ideas and treatment in Hastie et al. (2009) (referred to below as ESL). Te bias-variance trade off We first discuss an important concept, te bias-variance trade off. In tis discussion we will take a frequentist perspective. Consider a set of random responses drawn from a linear regression wit true parameter, Y n j x n ; N.x n ; 2 /: () Te data are D D f.x n ; Y n /g. Note tat we are olding te covariates x n fixed; only te responses are random. (We are also assuming x n is a single covariate; in general, it is p-dimensional and we replace x n wit > x n.) Wit tis data set, te maximum likeliood estimate is a random variable wose distribution is governed by te distribution of te data O.D/. Recall tat is te true parameter tat generated te responses. How close to we expect O.D/ to be to? We can answer tis question in a couple of ways. First, suppose we observe a new data input x. We consider te mean squared error of our estimate of E O Œy j x D Ox. Tis is te difference between our predicted expectation of te response and te true expectation of te response, MSE D E. O.D/ > x > x/ 2 i : (2) It is important to keep track of wic variables are random. Te coefficient is not random;

it is te true parameter tat generated te data. Te coefficient O.D/ is random; it depends on te randomly generated data set D. Te expectation in tis equation is wit respect to te randomly generated data set. (For simplicity, we will sometimes supress tis notation below.) Te MSE decomposes in an interesting way, MSE D E. Ox/ i 2 D E. Ox/ i 2 D E. Ox/ i 2 i 2E Ox x C.x/ 2 2E. Ox/ i i 2 i E Ox C E Ox.x/ C.x/ 2 C E Ox/i 2. E. Ox/ i 2 x 2 (3) Te second term is te squared bias, i bias D E Ox x: (4) An estimate for wic tis term is zero is an unbiased estimate. Te first term is te variance, variance D E. Ox/ i 2 E Ox i 2 : (5) Tis reflects te spread of te estimates we migt find on account of te randomness inerent in te data. Note tat te decomposition olds for any linear function of te coefficients. A famous result in statistics is te Gauss-Markov teorem. Recall tat te MLE O is an unbiased estimate. Te teorem states tat te MLE is te unbiased estimate wit te smallest variance. If you insist on unbiasedness, and you care about te MSE, ten you can do no better tan te MLE. Often we care about expected prediction error. Suppose we observe a new input x. How wrong will we be on average wen we predict te true y j x wit E Œy j x from a fitted regression? Te expected squared prediction error is E D E Y. Ox Y / 2 ii Te first expectation is taken for te randomness of O, wic is a function of te data. Te 2

second is taken for te randomness of Y given x, wic comes from te true model. Tis decomposes as follows, E D E Y. Ox Y / 2 ii D Var.Y / C MSE. Ox/ (6) D 2 C Bias 2. Ox/ C Var. Ox/: (7) Te first term is te inerent uncertainty around te true mean; te second two terms are te bias variance decomposition of te estimator. We cannot do anyting about te inerent uncertainty; tus reducing te MSE also reduces expected prediction error. Classical statistics cared only about unbiased estimators. Modern statistics as explored te trade-off, were it may be wort accepting some bias for a reduction in variance. Tis can reduce te MSE and, consequently, te expected prediction error on future data. Here a simple picture to illustrate wy: 0.0 0. 0.2 0.3 0.4 6 4 2 0 2 4 6 beta at It may be tat te MSE is smaller for te biased estimator, because it nevers veers as far away from te trut as te unbiased estimator does. 2 Ridge regression Regularization. In regression, we can make tis trade-off wit regularization, wic means placing constraints on te coefficients. Here is a picture from ESL for our first example. 3

Elements of Statistical Learning c Hastie, Tibsirani & Friedman 200 Capter 3 2. ^ 2. ^ Figure 3.2: Estimation picture for te lasso (left) In tis picture, and ridgecontours regression represent (rigt). Sown values areofcontours witof equal te RSS (or, equivalently, likeliood). Our procedure error andfinds constraint te best functions. value tat Te solid is witin blue areas te blue are circle. te constraint regions β + β 2 t and β 2 + β2 2 t 2, respectively, wile te red ellipses are te contours of Tis reduces te variance because it limits te space tat te parameter vector can live te least squares error function. in. If te true MLE of lives outside tat space, ten te resulting estimate must be biased because of te Gauss-Markov teorem. Te picture also sows ow regularization encourages smaller and peraps simpler models. Simpler models are more robust to overfitting, generalizing pooly because of a close matc to te training data. Simpler models can also be more interpretable, wic is anoter goal of regression. (Tis is particularly true for te lasso, wic we will talk about later.) Ridge regression. Let s discuss te details of ridge regression. We optimize te RSS subject to a constraint on te sum of squares of te coefficients, minimize subject to P N nd P p id 2i 2.y n x n / 2 s (8) Tis constrains te coefficients to live witin a spere of radius s. (See te picture.) Question: Wat appens as te radius increases? Answer: Variance goes up; bias goes down. Wit some calculus, te ridge regression estimate can also be expressed as Oridge D arg min NX nd 2.y n x n / 2 C px id 2 i (9) Tis is nice because te problem is convex. Furter, it as an analytic solution. (See te reading.) Question: Is it sensitive to scaling? Answer: Yes, in practice we center and scale 4

te covariates. Tere is a - mapping between te radius s and complexity parameter. Eiter of tese parameters trades off an increase in bias for a decrease in variance. From ESL: 8 8 8 8 8 Coefficients 0.0 0.2 0.4 0.6 0.0 0.5.0.5 2.0 L Norm How do we coose? As we see, te value of te complexity parameter affects our estimate. Question: Wat would appen if we used training error as te criterion? (Look at te picture to see te answer.) In practice, we coose by cross validation. Tis is an attempt to minimize expected test error. (But later on we will discuss ierarcical models. Tis can be anoter way to coose te regularization parameter.) Here is ow it works: Divide te data into K folds (e.g., K D 0). Decide on candidate values of (e.g., a grid between 0 and ) For eac fold k and value of, Estimate Oridge k on te out-of-fold samples. For eac x n assigned to fold k, compute its squared error n D. Oy n y n / 2 ; (0) 5

were Oy n D E ridge O ŒY j x n. Note tat tis estimate of te coefficients did not use k.x n ; y n / as part of its training data. We now aggregate te individual errors. Te score for is MSE./ D N NX n : () nd Tis is an estimate of te test error. Coose tat minimizes tis score. Aside: Connection to Bayesian statistics. We ave motivated regularized regression via frequentist tinking, i.e., te bias-variance trade-off and an appeal to te true model. Regularized regression, in general, as connections to Bayesian modeling. We ave discussed two common ways of using te posterior to obtain an estimate. Te first is maximum a posteriori (MAP) estimation, Te second is te posterior mean, MAP D arg max p. j y ; : : : ; y N ; / (2) mean D E Œ j y ; : : : ; y N ; (3) Question: How are tese different from te MLE? Ridge regression and Bayesian metods. Ridge regression corresponds to MAP estimation in te following model: i N.0; =/ (4) y n j x n ; N.>x n ; 2 / (5) Here is te corresponding grapical model X n Y n β N λ [ Tis isn t quite rigt; sould be a small dot. ] 6

We will derive te relationsip. First, note tat p.i j / D p 2.=/ expf2 i g (6) We now compute te MAP estimate of, max p. j D; / D max D max D D max max log p. j y WN ; x WN ; / (7) log p.; y WN j x WN ; / (8) py log p.y WN j x WN ; / p.i j / (9) RSS.I D/ px id Ridge regression is equivalent to MAP estimation in te model. id 2 i : (20) Observe tat te yperparameter controls ow far away te estimate will be from te MLE. A small yperparameter (large variance) will coose te MLE; te data totally determine te estimate. As te yperparameter gets larger, te estimate moves furter from te MLE; te prior (E ΠD 0) becomes more influential. Tis matces our recurring teme in Bayesian estimation; bot te data and te prior influence te answer. Finally, note tat a true Bayesian would not set te yperparameter by cross-validation. Tis uses te data to set te prior. However, I tink it is a good idea. It is an instance of a more general principle called Empirical Bayes. Summary of ridge regression.. We constrain to be in a yperspere around 0. 2. Tis is equivalent to minimizing te RSS plus a regularization term. 3. We no longer find te O tat minimizes te RSS. (Contours illustrate constant RSS.) 4. Ridge regression is a kind of srinkage, so called because it reduces te components to be close to 0 and close to eac oter. 7

5. Ridge estimates trade off bias for variance. 3 Te lasso A closely related regularization metod is called te lasso. Te lasso optimizes te RSS subject to a different constraint, minimize subject to P N nd P p 2.y n x n / 2 id jij s Elements of Statistical Learning c Hastie, Tibsirani & Friedman 200 Capter 3 (2) Tis small cange yields very different estimates. Here is te picture of te constraint: From ESL: 2. ^ 2. ^ Figure 3.2: Estimation picture for te lasso (left) Question: Wat appens as s increases? Question: Were is te solution going to lie wit s and ridge regression (rigt). Sown are contours of te fixed? error and constraint functions. Te solid blue areas are te constraint regions β + β 2 t and β 2 + β2 2 t 2, It s a fact: unless it cooses O, terespectively, lasso (witwile p large) te red will ellipses set some are te ofcontours te coefficients of to te least squares error function. exactly zero. Te intuitions come from ESL: Unlike te disk, te diamond as corners; if te solution occurs at a corner, ten it as one parameter j equal to zero. Wen p > 2, te diamond becomes a romboid, and as many corners, flat edges and faces; tere are many more opporunities for te estimated parameters to be zero. (p 90). In a sense, te lasso is a form of feature selection, identifying a relevant subset of te covariates wit wic to predict. Like ridge regression, it trades off an increase in bias wit a decrease in variance. Furter, by zeroing out some of te covariates, it provides interpretable (as in, sparse) models. 8

Sparse models can also be important in real systems tat migt depend on many inputs. Once te sparse solution is found, we need only measure a few of te inputs in order to make predictions. Tis speeds up te performance of te system. Te lasso is equivalent to Olasso D arg min NX nd 2.y n x n / 2 C px jij (22) Again, tere is a - mapping between and s. Tis objective, toug it does not ave an analytic solution, is still convex. Wy is te lasso exciting? Prior to te lasso, te only sparse metod was subset selection, finding te best subset of features wit wic to model te data. But subset selection as problems: searcing over all subsets (of a fixed size) is computationally expensive. In contrast, te lasso efficiently finds a sparse solution by using convex optimization. In a sense, it is akin to a smoot version of subset selection. Note te lasso won t consider all possible subsets. From ESL: id 0 2 3 5 7 Coefficients -0.2 0.0 0.2 0.4 0.6 0.0 0.5.0.5 2.0 L Norm 9

Te Bayesian interpretation of te lasso. Like ridge regression, lasso regression corresponds to MAP estimation in a Bayesian model. For te lasso, te model is: i Laplace./ (23) Y n j x n ; N.>x n ; 2 /: (24) Here te coefficients come from a Laplace distribution, p.i j / D 2 expf jijg: (25) Te lasso, and te general idea of L penalized models, as become a cottage industry in modern statistics and macine learning. Te reason is tat we often want sparse solutions to ig-dimensional problems, and we want convex objective functions wen analyzing data. L penalized metods give us bot. Recent researc indicates tat tey ave good teoretical properties to boot. 4 (Optional) Generalized regularization In general, regularization can be seen as minimizing te RSS wit a constraint on a q- norm, minimize P N nd 2.y n x n / 2 subject to jjjj q s, were te penalty is jjjj q D =q px jij q id Te metods we discussed so far are q D 2 : ridge regression q D : lasso q D 0 : subset selection Here is te picture from ESL: 0

q =4 q =2 q = q =0.5 q =0. Figure 3.3: Contours of constant value of j β j q for given values of q. Tis brings us away from te minimum RSS solution, but migt provide better test prediction via te bias/variance trade-off. Complex models ave less bias; simpler models ave less variance. Regularization encourages simpler models. Note tat eac of tese metods correspond to a Bayesian solution wit a different coice of prior. Oridge D arg min NX nd 2.y n x n / 2 C jjjj q Te complexity parameter can be cosen wit cross validation. Lasso (q D ) is te only norm tat provides sparsity and convexity. And tere are oter variants, useful in te literature. Of note: Te elastic net is a convex combination of L and L 2. Te grouped lasso finds sparse groups of covariates to include. Finally, te glmnet package in R is amazing. It efficiently computes models for a regularization pat using L 2 or L penalization. It uses te same model syntax as lm or glm. References Hastie, T., Tibsirani, R., and Friedman, J. (2009). Te Elements of Statistical Learning. Springer, 2 edition.