Learning From Data Lecture 12 Regularization

Similar documents
Learning From Data Lecture 13 Validation and Model Selection

Learning From Data Lecture 25 The Kernel Trick

Learning From Data Lecture 14 Three Learning Principles

Learning From Data Lecture 7 Approximation Versus Generalization

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Learning From Data Lecture 8 Linear Classification and Regression

Learning From Data Lecture 10 Nonlinear Transforms

Learning From Data Lecture 9 Logistic Regression and Gradient Descent

Learning From Data Lecture 5 Training Versus Testing

Learning From Data Lecture 2 The Perceptron

CSC321 Lecture 9: Generalization

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

CSC 411: Lecture 02: Linear Regression

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Machine Learning Foundations

CSC321 Lecture 9: Generalization

Learning From Data Lecture 26 Kernel Machines

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

ORIE 4741: Learning with Big Messy Data. Generalization

Introduction to Statistical modeling: handout for Math 489/583

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Lecture 3: Introduction to Complexity Regularization

12 Statistical Justifications; the Bias-Variance Decomposition

Support vector machines Lecture 4

Linear models and the perceptron algorithm

CSCI567 Machine Learning (Fall 2014)

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

On prediction. Jussi Hakanen Post-doctoral researcher. TIES445 Data mining (guest lecture)

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

High-dimensional regression

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Lecture 2: Linear regression

Week 3: Linear Regression

Learning From Data Lecture 3 Is Learning Feasible?

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

Overfitting, Bias / Variance Analysis

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

CSC 411 Lecture 10: Neural Networks

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

CSC 411 Lecture 17: Support Vector Machine

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Linear Models for Regression

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Today. Calculus. Linear Regression. Lagrange Multipliers

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Class 7. Prediction, Transformation and Multiple Regression.

CSC 411 Lecture 6: Linear Regression

Linear Models for Regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

COMS 4771 Regression. Nakul Verma

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Learning from Data: Regression

Lecture 2 Machine Learning Review

STA 414/2104, Spring 2014, Practice Problem Set #1

Lecture 4: Training a Classifier

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC321 Lecture 2: Linear Regression

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Support Vector Machines

Fundamentals of Machine Learning

Introduction to Machine Learning (67577) Lecture 7

SVMs: nonlinearity through kernels

Support Vector Machines

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Machine Learning and Data Mining. Linear regression. Kalev Kask

Neural Networks: Optimization & Regularization

Lecture 9: Generalization

Weighted Majority and the Online Learning Approach

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

VBM683 Machine Learning

Lecture 6. Regression

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Pattern Recognition 2018 Support Vector Machines

STAT 462-Computational Data Analysis

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

CSE446: non-parametric methods Spring 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Linear classifiers: Overfitting and regularization

CSC321 Lecture 5: Multilayer Perceptrons

Multivariate Time Series: Part 4

Lecture : Probabilistic Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Machine Learning 4771

Topic 4 Unit Roots. Gerald P. Dwyer. February Clemson University

Week 5: Logistic Regression & Neural Networks

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

STA 4273H: Sta-s-cal Machine Learning

Dan Roth 461C, 3401 Walnut

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Transcription:

Learning From Data Lecture 12 Regularization Constraining the Model Weight Decay Augmented Error M. Magdon-Ismail CSCI 4100/6100

recap: Overfitting Fitting the data more than is warranted Data Target Fit y c AM L Creator: Malik Magdon-Ismail Regularization: 2 /30 Noise

recap: Noise is Part of y We Cannot Model Stochastic Noise Deterministic Noise f() h y y y = f()+stoch. noise y = h ()+det. noise Stochastic and Deterministic Noise Hurt Learning Human: Good at etracting the simple pattern, ignoring the noise and complications. Computer: Pays equal attention to all piels. Needs help simplifying (features, regularization). c AM L Creator: Malik Magdon-Ismail Regularization: 3 /30 What is regularization?

Regularization What is regularization? A cure for our tendency to fit (get distracted by) the noise, hence improving E out. How does it work? By constraining the model so that we cannot fit the noise. putting on the brakes Side effects? Themedicationwillhavesideeffects ifwecannotfitthenoise,maybewecannotfitf (thesignal)? c AM L Creator: Malik Magdon-Ismail Regularization: 4 /30 Constraining

Constraining the Model: Does it Help? y...and the winner is: c AM L Creator: Malik Magdon-Ismail Regularization: 5 /30 Small weights

Constraining the Model: Does it Help? y y constrain weights to be smaller...and the winner is: c AM L Creator: Malik Magdon-Ismail Regularization: 6 /30 bias

Bias Goes Up A Little yḡ() yḡ() sin() sin() no regularization bias = 0.21 regularization bias = 0.23 side effect (Constant model had bias=0.5 and var=0.25.) c AM L Creator: Malik Magdon-Ismail Regularization: 7 /30 var

Variance Drop is Dramatic! yḡ() yḡ() sin() sin() no regularization bias = 0.21 var = 1.69 regularization bias = 0.23 var = 0.33 side effect treatment (Constant model had bias=0.5 and var=0.25.) c AM L Creator: Malik Magdon-Ismail Regularization: 8 /30 Regularication in a nutshell

Regularization in a Nutshell VC analysis: E out (g) E in (g)+ω(h) տ If you use a simpler H and get a good fit, then your E out is better. Regularization takes this a step further: If you use a simpler h and get a good fit, then is your E out better? c AM L Creator: Malik Magdon-Ismail Regularization: 9 /30 Polynomials

Polynomials of Order Q - A Useful Testbed H q : polynomials of order Q. Standard Polynomial 1 z = 2. q Legendre Polynomial 1 h() = w t z() L 1 () z = = w 0 +w 1 + +w q q L 2 (). L q () h() = w t z() ւ we re using linear regression = w 0 +w 1 L 1 ()+ +w q L q () տ allows us to treat the weights independently L 1 L 2 L 3 L 4 L 5 1 2 (32 1) 1 2 (53 3) 1 8 (354 30 2 +3) 1 8 (635 ) c AM L Creator: Malik Magdon-Ismail Regularization: 10 /30 recap: linear regression

recap: Linear Regression ( 1,y 1 ),...,( N,y N ) } {{ } X y (z 1,y 1 ),...,(z N,y N ) } {{ } Z y min : E in (w) = 1 N N (w t z n y n ) 2 n=1 = 1 N (Zw y)t (Zw y) w lin = (Z t Z) 1 Z t y linear regression fit ր c AM L Creator: Malik Magdon-Ismail Regularization: 11 /30 Already saw constraints

Constraining The Model: H 10 vs. H 2 H 10 = { h() = w 0 +w 1 Φ 1 ()+w 2 Φ 2 ()+w 3 Φ 3 ()+ +w 10 Φ 10 () } H 2 = { h() = w0 +w 1 Φ 1 ()+w 2 Φ 2 ()+w 3 Φ 3 ()+ +w 10 Φ 10 () such that: w 3 = w 4 = = w 10 = 0 ր a hard order constraint that sets some weights to zero } H 2 H 10 c AM L Creator: Malik Magdon-Ismail Regularization: 12 /30 Soft constraint

Soft Order Constraint Don t set weights eplicitly to zero (e.g. w 3 = 0). Give a budget and let the learning choose. H 10 q wq 2 C q=0 տ budget for weights C soft order constraint allows intermediate models H 2 c AM L Creator: Malik Magdon-Ismail Regularization: 13 /30 H C

Soft Order Constrained Model H C H 10 = { h() = w 0 +w 1 Φ 1 ()+w 2 Φ 2 ()+w 3 Φ 3 ()+ +w 10 Φ 10 () } H 2 = { h() = w0 +w 1 Φ 1 ()+w 2 Φ 2 ()+w 3 Φ 3 ()+ +w 10 Φ 10 () such that: w 3 = w 4 = = w 10 = 0 } H C = h() = w 0 +w 1 Φ 1 ()+w 2 Φ 2 ()+w 3 Φ 3 ()+ +w 10 Φ 10 () such that: 10 q=0 w 2 q C ր a soft budget constraint on the sum of weights VC-perspective: H C is smaller than H 10 = better generalization. c AM L Creator: Malik Magdon-Ismail Regularization: 14 /30 Fitting data

Fitting the Data The optimal weights ր regularized w reg H C should minimize the in-sample error, but be within the budget. w reg is a solution to min : E in (w) = 1 N (Zw y)t (Zw y) subject to: w t w C c AM L Creator: Malik Magdon-Ismail Regularization: 15 /30 Getting w reg

Solving For w reg min : E in (w) = 1 N (Zw y)t (Zw y) subject to: w t w C E in = const. Observations: 1. Optimal w tries to get as close to w lin as possible. Optimal w will use full budget and be on the surface w t w = C. 2. Surface w t w = C, at optimal w, should be perpindicular to E in. Otherwise can move along the surface and decrease E in. 3. Normal to surface w t w = C is the vector w. E in w lin w normal 4. Surface is E in ; surface is normal. E in is parallel to normal (but in opposite direction). w t w = C E in (w reg ) = 2λ C w reg ր λ C, the lagrange multiplier, is positive. The 2 is for mathematical convenience. c AM L Creator: Malik Magdon-Ismail Regularization: 16 /30 Unconstrained minimization

Solving For w reg E in (w) is minimized, subject to: w t w C E in (w reg )+2λ C w reg = 0 (E in (w)+λ C w t w) w=wreg = 0 E in (w)+λ C w t w is minimized, unconditionally There is a correspondence: C λ C c AM L Creator: Malik Magdon-Ismail Regularization: 17 /30 Augmented error

The Augmented Error Pick a C and minimize E in (w) subject to: w t w C Pick a λ C and minimize E aug (w) = E in (w)+λ C w t w unconditionally տ A penalty for the compleity of h, measured by the size of the weights. We can pick any budget C. Translation: we are free to pick any multiplier λ C What s the right C? What s the right λ C? c AM L Creator: Malik Magdon-Ismail Regularization: 18 /30 Linear regression

Linear Regression With Soft Order Constraint E aug (w) = 1 N (Zw y)t (Zw y)+λ C w t w տ Convenient to set λ C = λ N E aug (w) = (Zw y)t (Zw y)+λw t w N տ called weight decay as the penalty encourages smaller weights Unconditionally minimize E aug (w). c AM L Creator: Malik Magdon-Ismail Regularization: 19 /30 Linear regression solution

The Solution for w reg E aug (w) = 2Z t (Zw y)+2λw = 2(Z t Z+λI)w 2Z t y Set E aug (w) = 0 w reg = (Z t Z+λI) 1 Z t y λ determines the amount of regularization Recall the unconstrained solution (λ = 0): w lin = (Z t Z) 1 Z t y c AM L Creator: Malik Magdon-Ismail Regularization: 20 /30 Dramatic effect

A Little Regularization... Minimizing E in (w)+ λ N wt w with different λ s λ = 0 λ = 0.0001 Data Target Fit y Overfitting Wow! c AM L Creator: Malik Magdon-Ismail Regularization: 21 /30 Just a little works

...Goes A Long Way Minimizing E in (w)+ λ N wt w with different λ s λ = 0 λ = 0.0001 Data Target Fit y y Overfitting Wow! c AM L Creator: Malik Magdon-Ismail Regularization: 22 /30 Easy to overdose

Don t Overdose Minimizing E in (w)+ λ N wt w with different λ s λ = 0 λ = 0.0001 λ = 0.01 λ = 1 Data Target Fit y y y y Overfitting Underfitting c AM L Creator: Malik Magdon-Ismail Regularization: 23 /30 Overfitting and underfitting

Overfitting and Underfitting PSfrag 0.84 overfitting underfitting Epected Eout 0.8 0.76 0 0.5 1 1.5 2 Regularization Parameter, λ c AM L Creator: Malik Magdon-Ismail Regularization: 24 /30 Noise and regularization

More Noise Needs More Medicine 1 Epected Eout 0.75 0.5 0.25 σ 2 = 0.5 σ 2 = 0.25 σ 2 = 0 0.5 1 1.5 2 Regularization Parameter, λ c AM L Creator: Malik Magdon-Ismail Regularization: 25 /30 Deterministic too

...Even For Deterministic Noise 1 0.6 Epected Eout 0.75 0.5 0.25 σ 2 = 0.5 σ 2 = 0.25 σ 2 = 0 Epected Eout 0.4 0.2 Q f = 100 Q f = 30 Q f = 15 0.5 1 1.5 2 Regularization Parameter, λ 0.5 1 1.5 2 Regularization Parameter, λ c AM L Creator: Malik Magdon-Ismail Regularization: 26 /30 Variations on weight decay

Variations on Weight Decay Uniform Weight Decay Low Order Fit Weight Growth! 0.84 overfitting underfitting 0.84 weight growth Epected Eout 0.8 Epected Eout 0.8 Epected Eout weight decay 0.76 0.76 0 0.5 1 1.5 2 Regularization Parameter, λ 0 0.5 1 1.5 2 Regularization Parameter, λ Regularization Parameter, λ Q q=0 w 2 q Q q=0 qw 2 q Q q=0 1 w 2 q c AM L Creator: Malik Magdon-Ismail Regularization: 27 /30 Choosing a regularizer

Choosing a Regularizer A Practitioner s Guide The perfect regularizer: constrain in the direction of the target function. target function is unknown (going around in circles ). The guiding principle: constrain in the direction of smoother (usually simpler) hypotheses hurts your ability to fit the high frequency noise usually means smoother and simpler weight decay not weight growth. What if you choose the wrong regularizer? You still have λ to play with validation. c AM L Creator: Malik Magdon-Ismail Regularization: 28 /30 Regularization philosophy

How Does Regularization Work? Stochastic noise nothing you can do about that. Good features helps to reduce deterministic noise. Regularization: Helps to combat what noise remains, especially when N is small. Typical modus operandi: sacrifice a little bias for a huge improvement in var. VC angle: you are using a smaller H without sacrificing too much E in c AM L Creator: Malik Magdon-Ismail Regularization: 29 /30 E aug versus E in

Augmented Error as a Proy for E out E aug (h) = E in (h)+ λ N Ω(h) ւ this was w t w E out (h) E in (h)+ω(h) տ ( ) dvc this was O N lnn E aug can beat E in as a proy for E out. տ depends on choice of λ c AM L Creator: Malik Magdon-Ismail Regularization: 30 /30