CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Size: px

Start display at page:

Download "CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015"

Gervais Lawson
5 years ago
Views:

1 CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015 Luke ZeElemoyer Slides adapted from Carlos Guestrin

2 Predic5on of con5nuous variables Billionaire says: Wait, that s not what I meant! You say: Chill out, dude. He says: I want to predict a con5nuous variable for con5nuous inputs: I want to predict salaries from GPA. You say: I can regress that

3 Linear Regression Prediction Prediction

4 Ordinary Least Squares (OLS) Observation Error or residual Prediction

5 The regression problem Instances: <x j, t j > Learn: Mapping from x to t(x) Hypothesis space: Given, basis func5ons {h 1,,h k } h i (x) 2 R Find coeffs w={w 1,,w k } Why is this usually called linear regression? model is linear in the parameters Can we es5mate func5ons that are not lines???

6 Linear Basis: 1D input Need a bias term: {h 1 (x) = x, h 2 (x)=1} y x

7 Parabola: {h 1 (x) = x 2, h 2 (x)=x, h 3 (x)=1} y 2D: {h 1 (x) = x 12, h 2 (x)= x 22, h 3 (x)=x 1 x 2, } Can define any basis functions h i (x) for n- dimensional input x=<x 1,,x n > x

8 The regression problem Instances: <x j, t j > Learn: Mapping from x to t(x) Hypothesis space: Given, basis func5ons {h 1,,h k } h i (x) 2 R Find coeffs w={w 1,,w k } Why is this usually called linear regression? model is linear in the parameters Can we es5mate func5ons that are not lines??? Precisely, minimize the residual squared error:

9 Regression: matrix nota5on K basis functions N data points weights K basis func N observed outputs measurements

10 Regression: closed form solu5on w = arg min w (Hw t)t (Hw t) F(w) =(Hw t) T (Hw t) O w F(w) =0 2H T (Hw t) =0 H T Hw H T t = 0 w =(H T H) 1 H T t

11 Regression solu5on: simple matrix math where k k matrix for k basis functions k 1 vector

12 But, why? Billionaire (again) says: Why sum squared error??? You say: Gaussians, Dr. Gateson, Gaussians Model: predic5on is linear func5on plus Gaussian noise t(x) = i w i h i (x) + ε Learn w using MLE:

13 X Maximizing X log- likelihood Maximize wrt w: X X X arg max w 1 ln 2 X NX =argmax w = arg min w j=1 X X N NX + P j=1 P P [t X j Pi w Pih i (x j )] NX X [t j w i h i (x j )] 2 j=1 i [t j Pi w ih i (x j )] Least-squares Linear Regression is MLE for Gaussians!!!

14 Regulariza5on in Linear Regression One sign of overfikng: large parameter values! Regularized or penalized regressions modified learning object to penalize large parameters

15 Ridge Regression Introduce a new objec5ve func5on: NX kx kx ŵ ridge = arg min w j=1 t(x j ) (w 0 + i=1 w i h i (x j ))! 2 + i=1 w 2 i Prefer low error but also add a squared penalize for large weights λ is hyperparameter that balances tradeoff Explicitly writing out bias feature (essentially h 0 =1), which is not penalized

16 Ridge Regression: matrix nota5on ŵ ridge = arg min w NX j=1 t(x j ) (w 0 + kx i=1 w i h i (x j ))! 2 + kx i=1 w 2 i + w T I 0+k w bias column and N data points weights k basis functs plus bias N observed outputs measurements I 0+k = k k+1 x k+1 identity matrix, but with 0 in upper left k+1 k basis functions

17 Ridge Regression: closed form solu5on ŵ ridge = arg min w NX j=1 t(x j ) (w 0 + O w F(w) =0 2 kx w i h i (x j ))! + i=1 kx i=1 w 2 i + w T I 0+k w F(w) =(Hw t) T (Hw t)+ w T I 0+k w 2H T (Hw t)+2 I 0+k w = 0 w ridge =(H T H + I 0+k ) 1 H T t

18 Regression solu5on: simple matrix math ŵ ridge = arg min w NX j=1 t(x j ) (w 0 + kx i=1 w i h i (x j ))! 2 + kx i=1 w 2 i + w T I 0+k w w ridge =(H T H + I 0+k ) 1 H T t Compare to un-regularized regression: w =(H T H) 1 H T t

19 Ridge Regression How does varying lambda change w? NX kx kx ŵ ridge = arg min w j=1 t(x j ) (w 0 + i=1 w i h i (x j ))! 2 + i=1 w 2 i Larger λ? Smaller λ? As λ à 0? Becomes same a MLE, unregularized As λ à? All weights will be 0!

20 Ridge Coefficent Path Feature Weight From Kevin Murphy textbook Larger λ ß ŵ 2 à Smaller λ

21 How to pick lambda? Experimentation cycle Select a hypothesis f to best match training set Tune hyperparameters on held-out set Try many different values of lambda, pick best one Or, can do k-fold cross validation No held-out set Divide training set into k subsets Repeatedly train on k-1 and test on remaining one Average the results Training Data Held-Out (Development) Data Test Data Training Part 1 Training Part 2 Training Part K Test Data

22 Why squared regulariza5on? Ridge: ŵ ridge = arg min w LASSO: ŵ LASSO = arg min w NX j=1 NX j=1 t(x j ) (w 0 + t(x j ) (w kx w i h i (x j ))! + i=1 2 kx w i h i (x j ))! + i=1 Linear penalty pushes more weights to zero Allows for a type of feature selec3on But, not differen5able and no closed form solu5on. kx i=1 w 2 i kx w i i=1

23 Geometric Intui5on w 2 β 2 ^. β w MLE w. 2 β 2 ^ β w MLE Ridge Regression β1 w 1 Lasso β 1 w 1 From Rob Tibshirani slides

24 LASSO Coefficent Path Feature Weight From Kevin Murp textbook Larger λ ß ŵ 1 à Smaller λ

25 Bias- Variance tradeoff Intui5on Model too simple: does not fit the data well A biased solu5on 1 t 0 1 M =0 Model too complex: small changes to the data, solu5on changes a lot A high- variance solu5on t x 1 M =9 0 x 1

26 (Squared) Bias of learner Given: dataset D with m samples Learn: for different data D, you will learn different h D (x) Expected predic5on (averaged over hypotheses): h(x) =E D [h D (x)] Bias: expected difference between expected predic5on and truth (here we square it) E x [(t(x) h(x)) 2 ] Measures how well you expect to represent true solu5on Decreases with more complex model Zero bias typically means good predic5ons as mà

27 Variance of learner Given: dataset D with m samples h(x) =E D [h D (x)] Learn: for different datasets D, you will learn different h D (x) Variance: difference between what you expect to learn and what you learn from a from a par5cular dataset E D [E x [(h D (x) h(x)) 2 ]] Measures how sensi5ve learner is to specific dataset Decreases with simpler model

28 = Bias Variance decomposi5on of error MSE = E D [E x [(t(x) h D (x)) 2 ]] = E D [E x [(t(x) h(x)+ h(x) hd (x)) 2 ]] = E D [E x [(t(x) h(x)) 2 ]] + E D [E x [( h(x) h D (x)) 2 ]] + + 2E D [E x [(t(x) h(x))( h(x) hd (x))]] =0, because h(x) =E D [h D (x)] E x [(t(x) h(x)) 2 ] Bias + E D [E x [(h D (x) h(x)) 2 ]] Variance

29 Bias- Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance

30 Training set error Given a dataset (Training data) Choose a loss func5on e.g., squared error (L 2 ) for regression Training error: For a par5cular set of parameters, loss func5on on training data:

31 Training error as a func5on of model complexity

32 Predic5on error Training set error can be poor measure of quality of solu5on Predic5on error (true error): We really care about error over all possibili5es:

33 Predic5on error as a func5on of model complexity

34 Compu5ng predic5on error To correctly predict error Hard integral! May not know t(x) for every x, may not know p(x) Monte Carlo integration (sampling approximation) Sample a set of i.i.d. points {x 1,,x M } from p(x) Approximate integral with sample average

35 Why training set error doesn t approximate predic5on error? Sampling approxima5on of predic5on error: Training error : Very similar equa5ons!!! Why is training set a bad measure of predic5on error???

36 Why training set error doesn t approximate predic5on error? Sampling approxima5on Because of predic5on you cheated!!! error: Training error good estimate for a single w, But you optimized w with respect to the training error, and found w that is good for this set of samples Training error : Training error is a (optimistically) biased estimate of prediction error Very similar equa5ons!!! Why is training set a bad measure of predic5on error???

37 Test set error Given a dataset, randomly split it into two parts: Training data {x 1,, x Ntrain } Test data {x 1,, x Ntest } Use training data to op5mize parameters w Test set error: For the final solu)on w*, evaluate the error using:

38 Test set error as a func5on of model complexity

39 Overfikng: this slide is so important we Assume: are looking at it again! Data generated from distribu5on D(X,Y) A hypothesis space H Define: errors for hypothesis h H Training error: error train (h) Data (true) error: error true (h) We say h overfits the training data if there exists an h H such that: and error train (h) < error train (h ) error true (h) > error true (h )

40 Summary: error es5mators Gold Standard: Training: op5mis5cally biased Test: our final measure

41 Error as a func5on of number of training examples for a fixed model complexity bias little data infinite data

42 Error as func5on of regulariza5on parameter, fixed model complexity λ= λ=0

43 Summary: error es5mators Gold Standard: Be careful!!! Test set only unbiased if you never never ever ever do any any any any learning on the test data Training: op5mis5cally biased For example, if you use the test set to select the degree of the polynomial no longer unbiased!!! (We will address this problem later in the semester) Test: our final measure

44 What you need to know Regression Basis func5on = features Op5mizing sum squared error Rela5onship between regression and Gaussians Regulariza5on Ridge regression math LASSO Formula5on How to set lambda Bias- Variance trade- off

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you