Learning From Data Lecture 13 Validation and Model Selection

Similar documents
Learning From Data Lecture 12 Regularization

Learning From Data Lecture 10 Nonlinear Transforms

Learning From Data Lecture 25 The Kernel Trick

Learning From Data Lecture 5 Training Versus Testing

Learning From Data Lecture 14 Three Learning Principles

Learning From Data Lecture 8 Linear Classification and Regression

Learning From Data Lecture 7 Approximation Versus Generalization

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Learning From Data Lecture 2 The Perceptron

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Learning From Data Lecture 3 Is Learning Feasible?

Linear models and the perceptron algorithm

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Low Bias Bagged Support Vector Machines

Learning From Data Lecture 26 Kernel Machines

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Sample questions for Fundamentals of Machine Learning 2018

An Introduction to Statistical Machine Learning - Theoretical Aspects -

ORIE 4741: Learning with Big Messy Data. Generalization

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Learning From Data Lecture 9 Logistic Regression and Gradient Descent

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Lecture 3: Introduction to Complexity Regularization

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

ORIE 4741: Learning with Big Messy Data. Train, Test, Validate

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Machine learning - HT Basis Expansion, Regularization, Validation

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

VBM683 Machine Learning

Understanding Generalization Error: Bounds and Decompositions

Recap from previous lecture

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Holdout and Cross-Validation Methods Overfitting Avoidance

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Learning with multiple models. Boosting.

Performance Evaluation

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Linear Models for Regression

CSCI567 Machine Learning (Fall 2014)

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

CSC 411 Lecture 6: Linear Regression

Learning from Data: Regression

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Foundations of Computer Science Lecture 14 Advanced Counting

SVMs: nonlinearity through kernels

Foundations of Computer Science Lecture 23 Languages: What is Computation?

Machine Learning and Data Mining. Linear regression. Kalev Kask

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Introduction to Machine Learning

INTRODUCTION TO PATTERN RECOGNITION

Train the model with a subset of the data. Test the model on the remaining data (the validation set) What data to choose for training vs. test?

INTRODUCTION TO PATTERN

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Bias-Variance in Machine Learning

Machine Learning Foundations

Voting (Ensemble Methods)

MS&E 226: Small Data

Ensemble Methods and Random Forests

CSC321 Lecture 9: Generalization

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

ECE 5424: Introduction to Machine Learning

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Linear and Logistic Regression. Dr. Xiaowei Huang

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

PDEEC Machine Learning 2016/17

STA 414/2104, Spring 2014, Practice Problem Set #1

ESS2222. Lecture 3 Bias-Variance Trade-off

Machine Learning for OR & FE

ISyE 691 Data mining and analytics

Degrees of Freedom in Regression Ensembles

Computational statistics

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning

Dimension Reduction Methods

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

CS489/698: Intro to ML

CSC321 Lecture 9: Generalization

CSCI 5622 Machine Learning

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Machine Learning and Cross-Validation

ESS2222. Lecture 4 Linear model

Introduction to Statistical modeling: handout for Math 489/583

CSC321 Lecture 2: Linear Regression

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Support vector machines Lecture 4

Linear classifiers: Overfitting and regularization

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Machine Learning

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning. Ensemble Methods. Manfred Huber

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

High-Dimensional Statistical Learning: Introduction

PAC-learning, VC Dimension and Margin-based Bounds

CSC 411: Lecture 02: Linear Regression

Lecture 2 Machine Learning Review

Transcription:

Learning From Data Lecture 13 Validation and Model Selection The Validation Set Model Selection Cross Validation M. Magdon-Ismail CSCI 4100/6100

recap: Regularization Regularization combats the effects of noise by putting a leash on the algorithm. E aug (h) = E in (h)+ λ N Ω(h) Ω(h) smooth, simple h noise is rough, complex. Different regularizers give different results can choose λ, the amount of regularization. λ = 0 λ = 0.0001 λ = 0.01 λ = 1 Data Target Fit y y y y x x x x Overfitting Underfitting Optimal λ balances approximation and generalization, bias and variance. c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 2 /29 Peeking at E out

Validation: A Sneak Peek at E out E out (g) = E in (g)+overfit penalty }{{} VC bounds this using a complexity error bar for H regularization estimates this through a heuristic complexity penalty for g Validation goes directly for the jugular: E out (g) = E in (g)+overfit penalty. }{{} validation estimates this directly In-sample estimate of E out is the Holy Grail of learning from data. c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 3 /29 Test set

The Test Set D (N data points) D test (K test points) E test is an estimate for E out (g) E Dtest [e k ] = E out (g) g g E out (g) e k = e(g(x k ),y k ) e 1,e 2,...,e K E test = 1 K K k=1 e k E[E test ] = 1 K = 1 K e 1,...,e K are independent K E[e k ] k=1 K E out (g)= E out (g) k=1 Var[E test ] = 1 K Var[e K 2 k ] k=1 = 1 K Var[e] տ decreases like 1 K bigger K = more reliable E test. c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 4 /29 Validation set

The Validation Set D (N data points) D train (N K training points) D val (K validation points) 1. Remove K points from D g e k = e(g (x k ),y k ) e 1,e 2,...,e K 2. Learn using D train g. D = D train D val. g E val = 1 K K k=1 e k 3. Test g on D val E val. 4. Use error E val to estimate E out (g ). E out (g ) c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 5 /29 Validation

The Validation Set D (N data points) D train (N K training points) D val (K validation points) 1. Remove K points from D g e k = e(g (x k ),y k ) e 1,e 2,...,e K 2. Learn using D train g. D = D train D val. g E val = 1 K K k=1 e k 3. Test g on D val E val. 4. Use error E val to estimate E out (g ). E out (g ) c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 6 /29 Reliability of validation

The Validation Set D (N data points) E val is an estimate for E out (g ) E Dval [e k ] = E out (g ) D train (N K training points) g D val (K validation points) e k = e(g (x k ),y k ) e 1,e 2,...,e K E[E test ] = 1 K = 1 K K E[e k ] k=1 K E out (g )= E out (g ) k=1 g E out (g ) E val = 1 K K k=1 e k e 1,...,e K are independent Var[E val ] = 1 K Var[e K 2 k ] k=1 = 1 Var[e(g )] K տ decreases like 1 K depends on g, not H bigger K = more reliable E val? c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 7 /29 E val versus K

Sfrag Choosing K Expected Eval 10 20 30 Size of Validation Set, K Rule of thumb: K = N 5. c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 8 /29 Restoring D

Restoring D D (N) D train Primary goal: output best hypothesis. g was trained on all the data. (N K) Secondary goal: estimate E out (g). g is behind closed doors. g D val (K) E out (g) E out (g ) g E val (g ) E in (g) E val (g ) }{{} which should we use? CUSTOMER c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 9 /29 E val versus E in

E val Versus E in E out (g) E in (g)+o ( dvc N logn E out (g) E out (g ) E val (g )+O learning curve is decreasing (a practical truth, not a theorem) ւ Biased error bar depends on H. ) ( 1 K ) տ Unbiased error bar depends on g. E val (g) usually wins as an estimate for E out (g), especially when the learning curve is not steep. c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 10 /29 Model Selection

Model Selection The most important use of validation H 1 H 2 H 3 H M D train g 1 g 2 g 3 g M D val E 1 c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 11 /29 Validation estimate for E out (g 1 )

Validation Estimate for (H 1,g 1 ) The most important use of validation H 1 H 2 H 3 H M D train g 1 D val E val (g 1 ) c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 12 /29 Call it E 1

Validation Estimate for (H 1,g 1 ) The most important use of validation H 1 H 2 H 3 H M D train g 1 D val E 1 c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 13 /29 Validation estimates E 1,...,E M

Compute Validation Estimates for All Models The most important use of validation H 1 H 2 H 3 H M D train g 1 g 2 g 3 g M D val E 1 E 2 E 3 E M c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 14 /29 Pick best validation error

Pick The Best Model According to Validation Error The most important use of validation H 1 H 2 H 3 H M D train g 1 g 2 g 3 g M D val E 1 E 2 E 3 E M c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 15 /29 Biased E val (g m )

E val (g m ) is not Unbiased For E out (g m ) 0.8 Expected Error 0.7 0.6 0.5 E out (g m ) E val (g m )...because we choose one of the M finalists. 5 15 25 Validation Set Size, K E out (g m ) E val (g m )+O ( ) lnm K VC error bar for selecting a hypothesis from M using a data set of size K. c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 16 /29 Restoring D

Restoring D H 1 H 2 H 3 H M g 1 g 2 g 3 g M E 1 Model with best g also has best g We can find model with best g using validation leap of faith true modulo E val error bar c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 17 /29 Comparing E in and E val

Comparing E in and E val for Model Selection 0.56 validation: g m H 1 H 2 H M D train in-sample: g m g 1 g 2 g M Expected Eout 0.52 validation: g m D val E 1 E E } 2 {{} M pick the best 0.48 optimal 5 15 25 Validation Set Size, K D (H m, E m ) g m c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 18 /29 Selecting λ

Application to Selecting λ Which regularization parameter to use? λ 1,λ 2,...,λ M. This is a special case of model selection over M models, (H,λ 1 ) (H,λ 2 ) (H,λ 3 ) (H,λ M ) g 1 g 2 g 3 g M Picking a model amounts to chosing the optimal λ c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 19 /29 Tradeoff with K

The Dilemma When Choosing K Validation relies on the following chain of reasoning, E out (g) E out (g ) E val (g ) (small K) (large K) c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 20 /29 K = 1?

Can we get away with K = 1? Yes, almost! c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 21 /29 Leave one out

The Leave One Out Error (K = 1) y e 1 x g 1 E[e 1 ] = E out (g 1 )...but it is a wild estimate c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 22 /29 E cv

The Leave One Our Errors e 2 e 3 y y y e 1 x x x E[e 1 ] = E out (g 1 ) E cv = 1 N N n=1 e n c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 23 /29 CV is unbiased

Cross Validation is Unbiased Theorem. E cv is an unbiased estimate of Ēout(N 1). տ Expected E out when learning with N 1 points. c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 24 /29 Reliability of E cv

Reliability of E cv e n and e m are not independent. e n depends on g n which was trained on (x m,y m ). e m is evaluated on (x m,y m ). e n E cv 1 N Effective number of fresh examples giving a comparable estimate of E out c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 25 /29 Computational considerations

Cross Validation is Computationally Intensive N epochs of learning each on a data set of size N 1. Analytic approaches, for example linear regression with weight decay w reg = (Z t Z+λI) 1 Z t y H(λ) = Z(Z t Z+λI) 1 Z t. E cv = 1 N N n=1 ( ) ŷn y 2 n 1 H nn (λ) 10-fold cross validation D {}}{ D 1 D 2 D 3 D 4 D 5 D 6 D 7 D 8 D 9 D 10 train validate train c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 26 /29 Restoring D

Restoring D D D 1 g 1 D 2 g 2 D N (x 1,y 1 ) (x 2,y 2 ) (x N,y N ) g N e 1 } e 2 {{ e N } take average ( E out (g (N) ) Ēout(N 1) E cv +O learning curve ) 1 N. nearly independent e n g E cv CUSTOMER E cv can be used for model selection just as E val, for example to choose λ. c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 27 /29 Digits

Digits Problem: 1 Versus Not 1 0.03 E out Symmetry Error 0.02 0.01 E cv 1 Not 1 # Features Used E in 5 10 15 20 Average Intensity x = (1,x 1,x 2 ) z = (1,x 1,x 2,x 2 1,x 1 x 2,x 2 2,x 3 1,x 2 1x 2,x 1 x 2 2,x 3 2,...,x 5 1,x 4 1x 2,x 3 1x 2 2,x 2 1x 3 2,x 1 x 4 2,x 5 2) }{{} 5th order polynomial transform 20 dimensional non linear feature space c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 28 /29 Validation Wins

Validation Wins In the Real World Symmetry Symmetry Average Intensity Average Intensity no validation (20 features) E in = 0% E out = 2.5% cross validation (6 features) E in = 0.8% E out = 1.5% c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 29 /29