Ridge Regression: Regulating overfitting when using many features

Ridge Regression: Regulating overfitting when using man features Emil Fox Universit of Washington April 10, 2018 Training, true, & test error vs. model complexit Overfitting if: Error Model complexit 2 x x 1

Bias-variance tradeoff Model complexit 3 x x Error vs. amount of data Error 4 # data points in training set 2

Summar of assessing performance What ou can do now Describe what a loss function is and give examples Contrast training and test error Compute training and test error given a loss function Discuss issue of assessing performance on training set Describe tradeoffs in forming training/test splits List and interpret the 3 sources of avg. prediction error - Irreducible error, bias, and variance 6 3

Overfitting of polnomial regression Flexibilit of high-order polnomials i = w 0 + w 1 x i + w 2 x i2 + + w p x ip + ε i price ($) f ŵ square feet (sq.ft.) x price ($) square feet (sq.ft.) f ŵ x OVERFIT 8 4

Smptom of overfitting Often, overfitting associated with ver large estimated parameters ŵ 9 Overfitting of linear regression models more genericall 5

Overfitting with man features Not unique to polnomial regression, but also if lots of inputs (d large) Or, genericall, lots of features (D large) i = DX j=0 w j h j (x i ) + ε i - Square feet - # bathrooms - # bedrooms - Lot size - Year built - 11 How does # of observations influence overfitting? Few observations (N small) à rapidl overfit as model complexit increases Man observations (N ver large) à harder to overfit f ŵ price ($) f ŵ price ($) 12 square feet (sq.ft.) x square feet (sq.ft.) x 6

How does # of inputs influence overfitting? 1 input (e.g., sq.ft.): Data must include representative examples of all possible (sq.ft., $) pairs to avoid overfitting HARD price ($) f ŵ 13 square feet (sq.ft.) x How does # of inputs influence overfitting? 14 d inputs (e.g., sq.ft., #bath, #bed, lot size, ear, ): Data must include examples of all possible (sq.ft., #bath, #bed, lot size, ear,., $) combos to avoid overfitting MUCH!!! HARDER price ($) square feet (sq.ft.) x[2] # bathrooms x[1] 7

Adding term to cost-of-fit to prefer small coefficients Training Data Feature extraction h(x) ML model ŵ ŷ ML algorithm Qualit metric 16 8

Desired total cost format Want to balance: i. How well function fits data ii. Magnitude of coefficients Total cost = want measure to balance qualit of fit measure of fit + measure of magnitude of coefficients small # = good fit to training data small # = not overfit 17 Measure of fit to training data price ($) square feet (sq.ft.) x[1] x[2] # bathrooms NX RSS(w) = ( i -h(x i ) T w) 2 i=1 NX = ( i -ŷ i (w)) 2 i=1 18 9

Measure of magnitude of regression coefficient What summar # is indicative of size of regression coefficients? - Sum? - Sum of absolute value? - Sum of squares (L 2 norm) 19 Consider specific total cost Total cost = measure of fit + measure of magnitude of coefficients 20 10

Consider specific total cost Total cost = measure of fit + measure of magnitude of coefficients RSS(w) w 2 2 21 Consider resulting objective What if ŵ selected to minimize If λ=0: 2 RSS(w) + λ w 2 tuning parameter = balance of fit and magnitude If λ= : If λ in between: 22 11

Consider resulting objective What if ŵ selected to minimize 2 RSS(w) + λ w 2 tuning parameter = balance of fit and magnitude Ridge regression (a.k.a L 2 regularization) 23 Bias-variance tradeoff Large λ: high bias, low variance (e.g., ŵ =0 for λ= ) Small λ: low bias, high variance (e.g., standard least squares (RSS) fit of high-order polnomial for λ=0) In essence, λ controls model complexit 24 12

Revisit polnomial fit demo What happens if we refit our high-order polnomial, but now using ridge regression? Will consider a few settings of λ 25 Coefficient path 26 coefficients ŵ j 100000 0 100000 200000 bedrooms bathrooms sqft_living sqft_lot floors r_built r_renovat waterfront 0 50000 100000 150000 200000 λ 13

How to choose λ The regression/ml workflow 1. Model selection Need to choose tuning parameters λ controlling model complexit 2. Model assessment Having selected a model, assess generalization error 28 14

Hpothetical implementation Training set Test set 29 1. Model selection For each considered λ : i. Estimate parameters ŵ λ on training data ii. Assess performance of ŵ λ on test data iii. Choose λ * to be λ with lowest test error 2. Model assessment Compute test error of ŵ λ* (fitted model for selected λ * ) to approx. true error Overl optimistic! Hpothetical implementation Training set Test set Issue: Just like fitting ŵ and assessing its performance both on training data λ * was selected to minimize test error (i.e., λ * was fit on test data) If test data is not representative of the whole world, then ŵ λ* will tpicall perform worse than test error indicates 30 15

Practical implementation Validation Test Training set set Test set set set Solution: Create two test sets! 1. Select λ * such that ŵ λ* minimizes error on validation set 2. Approximate true error of ŵ λ* using test set 31 Practical implementation Training set Validation set Test set fit ŵ λ test performance of ŵ λ to select λ * assess true error of ŵ λ* 32 16

Tpical splits Training set Validation set Test set 80% 10% 10% 50% 25% 25% 33 How to handle the intercept PRACTICALITIES 17

35 Recall multiple regression model Model: i = w 0 h 0 (x i ) + w 1 h 1 (x i ) + + w D h D (x i )+ ε i DX = w j h j (x i ) + ε i j=0 feature 1 = h 0 (x) often 1 (constant) feature 2 = h 1 (x) e.g., x[1] feature 3 = h 2 (x) e.g., x[2] feature D+1 = h D (x) e.g., x[d] Do we penalize intercept? Standard ridge regression cost: Encourages intercept w 0 to also be small 2 RSS(w) + λ w 2 strength of penalt Do we want a small intercept? Conceptuall, not indicative of overfitting 36 18

Option 1: Don t penalize intercept Modified ridge regression cost: 2 RSS(w 0, w rest ) + λ w rest 2 37 Option 2: Center data first If data are first centered about 0, then favoring small intercept not so worrisome Step 1: Transform to have 0 mean Step 2: Run ridge regression as normal (closed-form or gradient algorithms) 38 19

Feature normalization PRACTICALITIES Normalizing features Feat. 0 Feat. 1 Feat. 2 Feat. D 40 Scale training columns (not rows!) as: h j (x k ) = p h j (x k ) NX h j (x i ) 2 i=1 Appl same training scale factors to test data: appl to test point h j (x k ) = p h j (x k ) h j (x i ) 2 NX i=1 Normalizer: z j Normalizer: z j summing over training points Training features Test features 20

Summar for ridge regression What ou can do now Describe what happens to magnitude of estimated coefficients when model is overfit Motivate form of ridge regression cost function Describe what happens to estimated coefficients of ridge regression as tuning parameter λ is varied Interpret coefficient path plot Use a validation set to select the ridge regression tuning parameter λ Handle intercept and scale of features with care 42 21