Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Similar documents
Resampling Methods. Chapter 5. Chapter 5 1 / 52

Simple Linear Regression (single variable)

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

COMP 551 Applied Machine Learning Lecture 4: Linear classification

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

What is Statistical Learning?

IAML: Support Vector Machines

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Logistic Regression. and Maximum Likelihood. Marek Petrik. Feb

LDA, QDA, Naive Bayes

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Pattern Recognition 2014 Support Vector Machines

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Distributions, spatial statistics and a Bayesian perspective

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Module 3: Gaussian Process Parameter Estimation, Prediction Uncertainty, and Diagnostics

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

Linear Classification

Chapter 3: Cluster Analysis

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

Part 3 Introduction to statistical classification techniques

Lecture 8: Multiclass Classification (I)

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

Tree Structured Classifier

Linear programming III

Support-Vector Machines

Support Vector Machines and Flexible Discriminants

CS 109 Lecture 23 May 18th, 2016

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Smoothing, penalized least squares and splines

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

NUMBERS, MATHEMATICS AND EQUATIONS

The blessing of dimensionality for kernel methods

SIZE BIAS IN LINE TRANSECT SAMPLING: A FIELD TEST. Mark C. Otto Statistics Research Division, Bureau of the Census Washington, D.C , U.S.A.

Data Mining: Concepts and Techniques. Classification and Prediction. Chapter February 8, 2007 CSE-4412: Data Mining 1

Data Mining with Linear Discriminants. Exercise: Business Intelligence (Part 6) Summer Term 2014 Stefan Feuerriegel

Churn Prediction using Dynamic RFM-Augmented node2vec

Introduction to Regression

COMP9444 Neural Networks and Deep Learning 3. Backpropagation

Kinetic Model Completeness

Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

Contents. This is page i Printer: Opaque this

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

Support Vector Machines and Flexible Discriminants

7 TH GRADE MATH STANDARDS

INSTRUMENTAL VARIABLES

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

, which yields. where z1. and z2

Computational modeling techniques

Artificial Neural Networks MLP, Backpropagation

Turing Machines. Human-aware Robotics. 2017/10/17 & 19 Chapter 3.2 & 3.3 in Sipser Ø Announcement:

Statistics, Numerical Models and Ensembles

Chapter 15 & 16: Random Forests & Ensemble Learning

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

Checking the resolved resonance region in EXFOR database

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

Physical Layer: Outline

Elements of Machine Intelligence - I

Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

initially lcated away frm the data set never win the cmpetitin, resulting in a nnptimal nal cdebk, [2] [3] [4] and [5]. Khnen's Self Organizing Featur

Maximum A Posteriori (MAP) CS 109 Lecture 22 May 16th, 2016

Enhancing Performance of MLP/RBF Neural Classifiers via an Multivariate Data Distribution Scheme

Localized Model Selection for Regression

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

Determining the Accuracy of Modal Parameter Estimation Methods

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA

The general linear model and Statistical Parametric Mapping I: Introduction to the GLM

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

ON-LINE PROCEDURE FOR TERMINATING AN ACCELERATED DEGRADATION TEST

AP Statistics Practice Test Unit Three Exploring Relationships Between Variables. Name Period Date

You need to be able to define the following terms and answer basic questions about them:

Statistical classifiers: Bayesian decision theory and density estimation

ECEN 4872/5827 Lecture Notes

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

24 Multiple Eigenvectors; Latent Factor Analysis; Nearest Neighbors

Name AP CHEM / / Chapter 1 Chemical Foundations

Inference in the Multiple-Regression

Lecture 13: Markov Chain Monte Carlo. Gibbs sampling

CS:4420 Artificial Intelligence

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

EASTERN ARIZONA COLLEGE Introduction to Statistics

Contents. This is page i Printer: Opaque this

Hiding in plain sight

Statistical Learning. 2.1 What Is Statistical Learning?

Lecture 10, Principal Component Analysis

On Out-of-Sample Statistics for Financial Time-Series

Chapter 11: Neural Networks

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

ANALYTICAL SOLUTIONS TO THE PROBLEM OF EDDY CURRENT PROBES

Some Theory Behind Algorithms for Stochastic Optimization

15-381/781 Bayesian Nets & Probabilistic Inference

Transcription:

Resampling Methds Crss-validatin, Btstrapping Marek Petrik 2/21/2017 Sme f the figures in this presentatin are taken frm An Intrductin t Statistical Learning, with applicatins in R (Springer, 2013) with permissin frm the authrs: G. James, D. Witten, T. Hastie and R. Tibshirani

S Far in ML Regressin vs classificatin Linear regressin Lgistic regressin Linear discriminant analysis, QDA Maximum likelihd

Discriminative vs Generative Mdels Discriminative mdels Estimate cnditinal mdels Pr[Y X] Linear regressin Lgistic regressin Generative mdels Estimate jint prbability Pr[Y, X] = Pr[Y X] Pr[X] Estimates nt nly prbability f labels but als the features Once mdel is fit, can be used t generate data LDA, QDA, Naive Bayes

Lgistic Regressin Y = { 1 if default 0 therwise Linear regressin Lgistic regressin 0 500 1000 1500 2000 2500 0.0 0.2 0.4 0.6 0.8 1.0 Balance Prbability f Default 0 500 1000 1500 2000 2500 0.0 0.2 0.4 0.6 0.8 1.0 Balance Prbability f Default Predict: Pr[default = yes balance]

LDA: Linear Discriminant Analysis Generative mdel: capture prbability f predictrs fr each label 0 1 2 3 4 5 4 2 0 2 4 3 2 1 0 1 2 3 4 Predict: 1. Pr[balance default = yes] and Pr[default = yes] 2. Pr[balance default = n] and Pr[default = n] Classes are nrmal: Pr[balance default = yes]

Bayes Therem Classificatin frm label distributins: Pr[Y = k X = x] = Pr[X = x Y = k] Pr[Y = k] Pr[X = x] Example: Ntatin: Pr[default = yes balance = $100] = Pr[balance = $100 default = yes] Pr[default = yes] Pr[balance = $100] Pr[Y = k X = x] = π kf k (x) K l=1 π lf l (x)

LDA with Multiple Features Multivariate Nrmal Distributins: x 2 x 2 x 1 x 1 Multivariate nrmal distributin density (mean vectr µ, cvariance matrix Σ): ( 1 p(x) = (2π) p/2 Σ exp 1 ) 1 /2 2 (x µ) Σ 1 (x µ)

Multivariate Classificatin Using LDA Linear: Decisin bundaries are linear X2 4 2 0 2 4 X2 4 2 0 2 4 4 2 0 2 4 X 1 4 2 0 2 4 X 1

QDA: Quadratic Discriminant Analysis X2 4 3 2 1 0 1 2 X2 4 3 2 1 0 1 2 4 2 0 2 4 X 1 4 2 0 2 4 X 1

Cnfusin Matrix: Predict default True Yes N Ttal Predicted Yes a b a + b N c d c + d Ttal a + c b + d N Result f LDA classificatin: Predict default if Pr[default = yes balance] > 1 /2 Predicted True Yes N Ttal Yes 81 23 104 N 252 9 644 9 896 Ttal 333 9 667 10 000

Tday Successfully using basic machine learning methds Prblems: 1. Hw well is the machine learning methd ding 2. Which methd is best fr my prblem? 3. Hw many features (and which nes) t use? 4. What is the uncertainty in the learned parameters?

Tday Successfully using basic machine learning methds Prblems: 1. Hw well is the machine learning methd ding 2. Which methd is best fr my prblem? 3. Hw many features (and which nes) t use? 4. What is the uncertainty in the learned parameters? Methds: 1. Validatin set 2. Leave ne ut crss-validatin 3. k-fld crss validatin 4. Btstrapping

Prblem: Hw t design features? Miles per galln 10 20 30 40 50 Linear Degree 2 Degree 5 50 100 150 200 Hrsepwer

Benefit f Gd Features Y 10 0 10 20 Mean Squared Errr 0 5 10 15 20 0 20 40 60 80 100 X gray: training errr 2 5 10 20 Flexibility red: test errr

Just Use Training Data?

Just Use Training Data? Using mre features will always reduce MSE

Just Use Training Data? Using mre features will always reduce MSE Errr n the test set will be greater Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 X gray: training errr 2 5 10 20 Flexibility red: test errr

Slutin 1: Validatin Set Just evaluate hw well the methd wrks n the test set Randmly split data t: 1. Training set: abut half f all data 2. Validatin set (AKA hld-ut set): remaining half

Slutin 1: Validatin Set Just evaluate hw well the methd wrks n the test set Randmly split data t: 1. Training set: abut half f all data 2. Validatin set (AKA hld-ut set): remaining half Chse the number f features/representatin based n minimizing errr n validatin set

Feature Selectin Using Validatin Set Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 gray: training errr X 2 5 10 20 Flexibility red: test errr (validatin set)

Prblems using Validatin Set 1. Highly variable (imprecise) estimates: Each line shws validatin errr fr ne pssible divisin f data Mean Squared Errr 16 18 20 22 24 26 28 Mean Squared Errr 16 18 20 22 24 26 28 2 4 6 8 10 Degree f Plynmial 2 4 6 8 10 Degree f Plynmial

Prblems using Validatin Set 1. Highly variable (imprecise) estimates: Each line shws validatin errr fr ne pssible divisin f data Mean Squared Errr 16 18 20 22 24 26 28 Mean Squared Errr 16 18 20 22 24 26 28 2 4 6 8 10 Degree f Plynmial 2 4 6 8 10 Degree f Plynmial 2. Only subset f data is used (validatin set is excluded nly abut half f data is used)

Slutin 2: Leave-ne-ut Addresses prblems with validatin set Split the data set int 2 parts: 1. Training: Size n 1 2. Validatin: Size 1 Repeat n times: t get n learning prblems

Leave-ne-ut Get n learning prblems: Train n n 1 instances (blue) Test n 1 instance (red) MSE i = (y i ŷ i ) 2 LOOCV estimate CV (n) = 1 n n MSE i i=1

Leave-ne-ut vs Validatin Set Advantages

Leave-ne-ut vs Validatin Set Advantages 1. Using almst all data nt just half

Leave-ne-ut vs Validatin Set Advantages 1. Using almst all data nt just half 2. Stable results: Des nt have any randmness

Leave-ne-ut vs Validatin Set Advantages 1. Using almst all data nt just half 2. Stable results: Des nt have any randmness 3. Evaluatin is perfrmed with mre test data

Leave-ne-ut vs Validatin Set Advantages 1. Using almst all data nt just half 2. Stable results: Des nt have any randmness 3. Evaluatin is perfrmed with mre test data Disadvantages

Leave-ne-ut vs Validatin Set Advantages 1. Using almst all data nt just half 2. Stable results: Des nt have any randmness 3. Evaluatin is perfrmed with mre test data Disadvantages Can be very cmputatinally expensive: Fits the mdel n times

Speeding Up Leave-One-Out 1. Slve each fit independently and distribute the cmputatin

Speeding Up Leave-One-Out 1. Slve each fit independently and distribute the cmputatin 2. Linear regressin:

Speeding Up Leave-One-Out 1. Slve each fit independently and distribute the cmputatin 2. Linear regressin: Slve nly ne linear regressin using all data

Speeding Up Leave-One-Out 1. Slve each fit independently and distribute the cmputatin 2. Linear regressin: Slve nly ne linear regressin using all data Cmpute leave-ne-ut errr as: n ( yi ŷ i ) 2 CV (n) = 1 n i=1 1 h i

Speeding Up Leave-One-Out 1. Slve each fit independently and distribute the cmputatin 2. Linear regressin: Slve nly ne linear regressin using all data Cmpute leave-ne-ut errr as: n ( yi ŷ i ) 2 CV (n) = 1 n i=1 1 h i True value: y i, Predictin: ŷ i

Speeding Up Leave-One-Out 1. Slve each fit independently and distribute the cmputatin 2. Linear regressin: Slve nly ne linear regressin using all data Cmpute leave-ne-ut errr as: n ( yi ŷ i ) 2 CV (n) = 1 n i=1 1 h i True value: y i, Predictin: ŷ i hi is the leverage f data pint i: h i = 1 n + (x i x) 2 n j=1 (x j x) 2

Slutin 3: k-fld Crss-validatin Hybrid between validatin set and LOO Split training set int k subsets 1. Training set: n n /k 2. Test set: n /k k learning prblems Crss-validatin errr: CV (k) = 1 k k MSE i i=1

Crss-validatin vs Leave-One-Out k-fld Crss-validatin Leave-ne-ut

Crss-validatin vs Leave-One-Out LOOCV 10 fld CV Mean Squared Errr 16 18 20 22 24 26 28 Mean Squared Errr 16 18 20 22 24 26 28 2 4 6 8 10 Degree f Plynmial 2 4 6 8 10 Degree f Plynmial

Empirical Evaluatin: 3 Examples Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Mean Squared Errr 0 5 10 15 20 2 5 10 20 Flexibility 2 5 10 20 Flexibility 2 5 10 20 Flexibility Blue True errr Dashed LOOCV estimate Orange 10-fld CV

Hw t Chse k in CV? As k increases we have: 1. Increasing cmputatinal cmplexity 2. Decreasing bias (mre training data) 3. Increasing variance (bigger verlap between training sets) Empirically gd values: 5-10

Crss-validatin in Classificatin

Lgistic Regressin Predict prbability f a class: p(x) Example: p(balance) prbability f default fr persn with balance Linear regressin: Lgistic regressin: p(x) = β 0 + β 1 p(x) = eβ 0+β 1 X 1 + e β 0+β 1 X the same as: ( ) p(x) lg = β 0 + β 1 X 1 p(x) Linear decisin bundary (derive frm lg dds: p(x 1 ) p(x 2 ))

Features in Lgistic Regressin Lgistic regressin decisin bundary is als linear...nn-linear decisins? Degree=1 Degree=2 Degree=3 Degree=4

Lgistic Regressin with Nnlinear Features Linear: ( ) p(x) lg = β 0 + β 1 X 1 p(x) Nnlinear dds: ( ) p(x) lg = β 0 + β 1 X + β 2 X 2 + β 3 X 3 1 p(x) Nnlinear prbability: p(x) = eβ 0+β 1 X+β 2 X 2 +β 3 X 3 1 + e β 0+β 1 X+β 2 X 2 +β 3 X 3

Crss-validatin in Classificatin Wrks the same as fr regressin D nt use MSE but: CV (n) = 1 n n Err i i=1 Errr is an indicatr functin: Err i = I(y i ŷ i )

K in KNN Hw t decide n the right k t use in KNN?

K in KNN Hw t decide n the right k t use in KNN? Crss-validatin! Lgistic regressin KNN Errr Rate 0.12 0.14 0.16 0.18 0.20 Errr Rate 0.12 0.14 0.16 0.18 0.20 2 4 6 8 10 Order f Plynmials Used 0.01 0.02 0.05 0.10 0.20 0.50 1.00 1/K Brwn Test errr Blue Training errr Black CV errr

Overfitting and CV Is it pssible t verfit when using crss-validatin?

Overfitting and CV Is it pssible t verfit when using crss-validatin? Yes!

Overfitting and CV Is it pssible t verfit when using crss-validatin? Yes! Inferring k in KNN using crss-validatin is learning

Overfitting and CV Is it pssible t verfit when using crss-validatin? Yes! Inferring k in KNN using crss-validatin is learning Insightful theretical analysis: Prbably Apprximately Crrect (PAC) Learning

Overfitting and CV Is it pssible t verfit when using crss-validatin? Yes! Inferring k in KNN using crss-validatin is learning Insightful theretical analysis: Prbably Apprximately Crrect (PAC) Learning Crss-validatin will nt verfit when learning simple cncepts

Overfitting with Crss-validatin Task: Predict mpg pwer Define a new feature fr sme βs: f = β 0 + β 1 pwer + β 2 pwer 2 + β 3 pwer 3 + β 4 pwer 4 +... Linear regressin: Find α such that: mpg = α f Crss-validatin: Find values f βs

Overfitting with Crss-validatin Task: Predict mpg pwer Define a new feature fr sme βs: f = β 0 + β 1 pwer + β 2 pwer 2 + β 3 pwer 3 + β 4 pwer 4 +... Linear regressin: Find α such that: mpg = α f Crss-validatin: Find values f βs Will verfit Same slutin as using linear regressin n entire data (n crss-validatin)

Preventing Overfitting Gld standard: Have a test set that is used nly nce Rarely pssible $1M Netflix prize design: 1. Publicly available training set 2. Leader-bard results using a test set 3. Private data set used t determine the final winner

Btstrap Gal: Understand the cnfidence in learned parameters Mst useful in inference Hw cnfident are we in learned values f β: mpg = β 0 + β 1 pwer

Btstrap Gal: Understand the cnfidence in learned parameters Mst useful in inference Hw cnfident are we in learned values f β: mpg = β 0 + β 1 pwer Apprach: Run learning algrithm multiple times with different data sets:

Btstrap Gal: Understand the cnfidence in learned parameters Mst useful in inference Hw cnfident are we in learned values f β: mpg = β 0 + β 1 pwer Apprach: Run learning algrithm multiple times with different data sets: Create a new data-set by sampling with replacement frm the riginal ne

Btstrap Illustratin 2.8 5.3 3 1.1 2.1 2 2.4 4.3 1 Y X Obs 2.8 5.3 3 2.4 4.3 1 2.8 5.3 3 Y X Obs 2.4 4.3 1 2.8 5.3 3 1.1 2.1 2 Y X Obs 2.4 4.3 1 1.1 2.1 2 1.1 2.1 2 Y X Obs Original Data (Z) *1 Z *2 Z Z *B αˆ*1 ˆα *2 ˆα *B

Btstrap Results 0 50 100 150 200 0 50 100 150 200 α 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.4 0.5 0.6 0.7 0.8 0.9 α 0.3 0.4 0.5 0.6 0.7 0.8 0.9 α True Btstrap