Ridge Regression: Regulating overfitting when using many features

Similar documents
Ridge Regression: Regulating overfitting when using many features. Training, true, & test error vs. model complexity. CSE 446: Machine Learning

Lasso Regression: Regularization for feature selection

Lasso Regression: Regularization for feature selection

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Linear classifiers: Logistic regression

Lecture 14: Shrinkage

Linear model selection and regularization

Polynomial Regression and Regularization

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

ECE 5424: Introduction to Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure

Bias-Variance Tradeoff

Machine Learning for OR & FE

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017

Machine Learning (CSE 446): Probabilistic Machine Learning

LINEAR REGRESSION, RIDGE, LASSO, SVR

Classification Logistic Regression

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

MS&E 226: Small Data

Statistical foundations

Linear Regression (continued)

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Week 3: Linear Regression

CSE446: non-parametric methods Spring 2017

Machine Learning 4771

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?

VBM683 Machine Learning

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Chapter 7: Model Assessment and Selection

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Learning From Data Lecture 7 Approximation Versus Generalization

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

A Modern Look at Classical Multivariate Techniques

1 A Non-technical Introduction to Regression

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Linear classifiers: Overfitting and regularization

Prediction & Feature Selection in GLM

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

STATISTICS 110/201 PRACTICE FINAL EXAM

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Stochastic Gradient Descent

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Decision Trees: Overfitting

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning

Announcements Kevin Jamieson

Overfitting, Bias / Variance Analysis

Applied Machine Learning Annalisa Marsico

ECON 497 Midterm Spring

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Applied Regression Modeling: A Business Approach Chapter 2: Simple Linear Regression Sections

Linear Models for Regression CS534

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Math-3. Lesson 2-8 Factoring NICE 3rd Degree Polynomials

High-Dimensional Statistical Learning: Introduction

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Regression I: Mean Squared Error and Measuring Quality of Fit

INTRODUCTION TO DATA SCIENCE

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Week 5: Logistic Regression & Neural Networks

CS 453X: Class 5. Jacob Whitehill

y Xw 2 2 y Xw λ w 2 2

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Dimension Reduction Methods

Sampling Beats Fixed Estimate Predictors for Cloning Stochastic Behavior in Multiagent Systems

Ad Placement Strategies

Logistic Regression Logistic

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections

Voting (Ensemble Methods)

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

Train the model with a subset of the data. Test the model on the remaining data (the validation set) What data to choose for training vs. test?

Lecture 5: Clustering, Linear Regression

Lecture 6: Linear Regression (continued)

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

LECTURE 03: LINEAR REGRESSION PT. 1. September 18, 2017 SDS 293: Machine Learning

CSC 411 Lecture 6: Linear Regression

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Gradient Descent. Stephan Robert

COMP 551 Applied Machine Learning Lecture 2: Linear regression

OPTIMIZATION METHODS IN DEEP LEARNING

ECE 5984: Introduction to Machine Learning

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Introduction to Machine Learning

CS7267 MACHINE LEARNING

Lecture 3: Introduction to Complexity Regularization

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Transcription:

Ridge Regression: Regulating overfitting when using man features Emil Fox Universit of Washington April 10, 2018 Training, true, & test error vs. model complexit Overfitting if: Error Model complexit 2 x x 1

Bias-variance tradeoff Model complexit 3 x x Error vs. amount of data Error 4 # data points in training set 2

Summar of assessing performance What ou can do now Describe what a loss function is and give examples Contrast training and test error Compute training and test error given a loss function Discuss issue of assessing performance on training set Describe tradeoffs in forming training/test splits List and interpret the 3 sources of avg. prediction error - Irreducible error, bias, and variance 6 3

Overfitting of polnomial regression Flexibilit of high-order polnomials i = w 0 + w 1 x i + w 2 x i2 + + w p x ip + ε i price ($) f ŵ square feet (sq.ft.) x price ($) square feet (sq.ft.) f ŵ x OVERFIT 8 4

Smptom of overfitting Often, overfitting associated with ver large estimated parameters ŵ 9 Overfitting of linear regression models more genericall 5

Overfitting with man features Not unique to polnomial regression, but also if lots of inputs (d large) Or, genericall, lots of features (D large) i = DX j=0 w j h j (x i ) + ε i - Square feet - # bathrooms - # bedrooms - Lot size - Year built - 11 How does # of observations influence overfitting? Few observations (N small) à rapidl overfit as model complexit increases Man observations (N ver large) à harder to overfit f ŵ price ($) f ŵ price ($) 12 square feet (sq.ft.) x square feet (sq.ft.) x 6

How does # of inputs influence overfitting? 1 input (e.g., sq.ft.): Data must include representative examples of all possible (sq.ft., $) pairs to avoid overfitting HARD price ($) f ŵ 13 square feet (sq.ft.) x How does # of inputs influence overfitting? 14 d inputs (e.g., sq.ft., #bath, #bed, lot size, ear, ): Data must include examples of all possible (sq.ft., #bath, #bed, lot size, ear,., $) combos to avoid overfitting MUCH!!! HARDER price ($) square feet (sq.ft.) x[2] # bathrooms x[1] 7

Adding term to cost-of-fit to prefer small coefficients Training Data Feature extraction h(x) ML model ŵ ŷ ML algorithm Qualit metric 16 8

Desired total cost format Want to balance: i. How well function fits data ii. Magnitude of coefficients Total cost = want measure to balance qualit of fit measure of fit + measure of magnitude of coefficients small # = good fit to training data small # = not overfit 17 Measure of fit to training data price ($) square feet (sq.ft.) x[1] x[2] # bathrooms NX RSS(w) = ( i -h(x i ) T w) 2 i=1 NX = ( i -ŷ i (w)) 2 i=1 18 9

Measure of magnitude of regression coefficient What summar # is indicative of size of regression coefficients? - Sum? - Sum of absolute value? - Sum of squares (L 2 norm) 19 Consider specific total cost Total cost = measure of fit + measure of magnitude of coefficients 20 10

Consider specific total cost Total cost = measure of fit + measure of magnitude of coefficients RSS(w) w 2 2 21 Consider resulting objective What if ŵ selected to minimize If λ=0: 2 RSS(w) + λ w 2 tuning parameter = balance of fit and magnitude If λ= : If λ in between: 22 11

Consider resulting objective What if ŵ selected to minimize 2 RSS(w) + λ w 2 tuning parameter = balance of fit and magnitude Ridge regression (a.k.a L 2 regularization) 23 Bias-variance tradeoff Large λ: high bias, low variance (e.g., ŵ =0 for λ= ) Small λ: low bias, high variance (e.g., standard least squares (RSS) fit of high-order polnomial for λ=0) In essence, λ controls model complexit 24 12

Revisit polnomial fit demo What happens if we refit our high-order polnomial, but now using ridge regression? Will consider a few settings of λ 25 Coefficient path 26 coefficients ŵ j 100000 0 100000 200000 bedrooms bathrooms sqft_living sqft_lot floors r_built r_renovat waterfront 0 50000 100000 150000 200000 λ 13

How to choose λ The regression/ml workflow 1. Model selection Need to choose tuning parameters λ controlling model complexit 2. Model assessment Having selected a model, assess generalization error 28 14

Hpothetical implementation Training set Test set 29 1. Model selection For each considered λ : i. Estimate parameters ŵ λ on training data ii. Assess performance of ŵ λ on test data iii. Choose λ * to be λ with lowest test error 2. Model assessment Compute test error of ŵ λ* (fitted model for selected λ * ) to approx. true error Overl optimistic! Hpothetical implementation Training set Test set Issue: Just like fitting ŵ and assessing its performance both on training data λ * was selected to minimize test error (i.e., λ * was fit on test data) If test data is not representative of the whole world, then ŵ λ* will tpicall perform worse than test error indicates 30 15

Practical implementation Validation Test Training set set Test set set set Solution: Create two test sets! 1. Select λ * such that ŵ λ* minimizes error on validation set 2. Approximate true error of ŵ λ* using test set 31 Practical implementation Training set Validation set Test set fit ŵ λ test performance of ŵ λ to select λ * assess true error of ŵ λ* 32 16

Tpical splits Training set Validation set Test set 80% 10% 10% 50% 25% 25% 33 How to handle the intercept PRACTICALITIES 17

35 Recall multiple regression model Model: i = w 0 h 0 (x i ) + w 1 h 1 (x i ) + + w D h D (x i )+ ε i DX = w j h j (x i ) + ε i j=0 feature 1 = h 0 (x) often 1 (constant) feature 2 = h 1 (x) e.g., x[1] feature 3 = h 2 (x) e.g., x[2] feature D+1 = h D (x) e.g., x[d] Do we penalize intercept? Standard ridge regression cost: Encourages intercept w 0 to also be small 2 RSS(w) + λ w 2 strength of penalt Do we want a small intercept? Conceptuall, not indicative of overfitting 36 18

Option 1: Don t penalize intercept Modified ridge regression cost: 2 RSS(w 0, w rest ) + λ w rest 2 37 Option 2: Center data first If data are first centered about 0, then favoring small intercept not so worrisome Step 1: Transform to have 0 mean Step 2: Run ridge regression as normal (closed-form or gradient algorithms) 38 19

Feature normalization PRACTICALITIES Normalizing features Feat. 0 Feat. 1 Feat. 2 Feat. D 40 Scale training columns (not rows!) as: h j (x k ) = p h j (x k ) NX h j (x i ) 2 i=1 Appl same training scale factors to test data: appl to test point h j (x k ) = p h j (x k ) h j (x i ) 2 NX i=1 Normalizer: z j Normalizer: z j summing over training points Training features Test features 20

Summar for ridge regression What ou can do now Describe what happens to magnitude of estimated coefficients when model is overfit Motivate form of ridge regression cost function Describe what happens to estimated coefficients of ridge regression as tuning parameter λ is varied Interpret coefficient path plot Use a validation set to select the ridge regression tuning parameter λ Handle intercept and scale of features with care 42 21