Resampling Methds Crss-validatin, Btstrapping Marek Petrik 2/21/2017 Sme f the figures in this presentatin are taken frm An Intrductin t Statistical Learning, with applicatins in R (Springer, 2013) with permissin frm the authrs: G. James, D. Witten, T. Hastie and R. Tibshirani
S Far in ML Regressin vs classificatin Linear regressin Lgistic regressin Linear discriminant analysis, QDA Maximum likelihd
Discriminative vs Generative Mdels Discriminative mdels Estimate cnditinal mdels Pr[Y X] Linear regressin Lgistic regressin Generative mdels Estimate jint prbability Pr[Y, X] = Pr[Y X] Pr[X] Estimates nt nly prbability f labels but als the features Once mdel is fit, can be used t generate data LDA, QDA, Naive Bayes
Lgistic Regressin Y = { 1 if default 0 therwise Linear regressin Lgistic regressin 0 500 1000 1500 2000 2500 0.0 0.2 0.4 0.6 0.8 1.0 Balance Prbability f Default 0 500 1000 1500 2000 2500 0.0 0.2 0.4 0.6 0.8 1.0 Balance Prbability f Default Predict: Pr[default = yes balance]
LDA: Linear Discriminant Analysis Generative mdel: capture prbability f predictrs fr each label 0 1 2 3 4 5 4 2 0 2 4 3 2 1 0 1 2 3 4 Predict: 1. Pr[balance default = yes] and Pr[default = yes] 2. Pr[balance default = n] and Pr[default = n] Classes are nrmal: Pr[balance default = yes]
Bayes Therem Classificatin frm label distributins: Pr[Y = k X = x] = Pr[X = x Y = k] Pr[Y = k] Pr[X = x] Example: Ntatin: Pr[default = yes balance = $100] = Pr[balance = $100 default = yes] Pr[default = yes] Pr[balance = $100] Pr[Y = k X = x] = π kf k (x) K l=1 π lf l (x)
LDA with Multiple Features Multivariate Nrmal Distributins: x 2 x 2 x 1 x 1 Multivariate nrmal distributin density (mean vectr µ, cvariance matrix Σ): ( 1 p(x) = (2π) p/2 Σ exp 1 ) 1 /2 2 (x µ) Σ 1 (x µ)
Multivariate Classificatin Using LDA Linear: Decisin bundaries are linear X2 4 2 0 2 4 X2 4 2 0 2 4 4 2 0 2 4 X 1 4 2 0 2 4 X 1
QDA: Quadratic Discriminant Analysis X2 4 3 2 1 0 1 2 X2 4 3 2 1 0 1 2 4 2 0 2 4 X 1 4 2 0 2 4 X 1
Cnfusin Matrix: Predict default True Yes N Ttal Predicted Yes a b a + b N c d c + d Ttal a + c b + d N Result f LDA classificatin: Predict default if Pr[default = yes balance] > 1 /2 Predicted True Yes N Ttal Yes 81 23 104 N 252 9 644 9 896 Ttal 333 9 667 10 000
Tday Successfully using basic machine learning methds Prblems: 1. Hw well is the machine learning methd ding 2. Which methd is best fr my prblem? 3. Hw many features (and which nes) t use? 4. What is the uncertainty in the learned parameters?
Tday Successfully using basic machine learning methds Prblems: 1. Hw well is the machine learning methd ding 2. Which methd is best fr my prblem? 3. Hw many features (and which nes) t use? 4. What is the uncertainty in the learned parameters? Methds: 1. Validatin set 2. Leave ne ut crss-validatin 3. k-fld crss validatin 4. Btstrapping
Prblem: Hw t design features? Miles per galln 10 20 30 40 50 Linear Degree 2 Degree 5 50 100 150 200 Hrsepwer
Benefit f Gd Features Y 10 0 10 20 Mean Squared Errr 0 5 10 15 20 0 20 40 60 80 100 X gray: training errr 2 5 10 20 Flexibility red: test errr
Just Use Training Data?
Just Use Training Data? Using mre features will always reduce MSE
Just Use Training Data? Using mre features will always reduce MSE Errr n the test set will be greater Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 X gray: training errr 2 5 10 20 Flexibility red: test errr
Slutin 1: Validatin Set Just evaluate hw well the methd wrks n the test set Randmly split data t: 1. Training set: abut half f all data 2. Validatin set (AKA hld-ut set): remaining half
Slutin 1: Validatin Set Just evaluate hw well the methd wrks n the test set Randmly split data t: 1. Training set: abut half f all data 2. Validatin set (AKA hld-ut set): remaining half Chse the number f features/representatin based n minimizing errr n validatin set
Feature Selectin Using Validatin Set Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 gray: training errr X 2 5 10 20 Flexibility red: test errr (validatin set)
Prblems using Validatin Set 1. Highly variable (imprecise) estimates: Each line shws validatin errr fr ne pssible divisin f data Mean Squared Errr 16 18 20 22 24 26 28 Mean Squared Errr 16 18 20 22 24 26 28 2 4 6 8 10 Degree f Plynmial 2 4 6 8 10 Degree f Plynmial
Prblems using Validatin Set 1. Highly variable (imprecise) estimates: Each line shws validatin errr fr ne pssible divisin f data Mean Squared Errr 16 18 20 22 24 26 28 Mean Squared Errr 16 18 20 22 24 26 28 2 4 6 8 10 Degree f Plynmial 2 4 6 8 10 Degree f Plynmial 2. Only subset f data is used (validatin set is excluded nly abut half f data is used)
Slutin 2: Leave-ne-ut Addresses prblems with validatin set Split the data set int 2 parts: 1. Training: Size n 1 2. Validatin: Size 1 Repeat n times: t get n learning prblems
Leave-ne-ut Get n learning prblems: Train n n 1 instances (blue) Test n 1 instance (red) MSE i = (y i ŷ i ) 2 LOOCV estimate CV (n) = 1 n n MSE i i=1
Leave-ne-ut vs Validatin Set Advantages
Leave-ne-ut vs Validatin Set Advantages 1. Using almst all data nt just half
Leave-ne-ut vs Validatin Set Advantages 1. Using almst all data nt just half 2. Stable results: Des nt have any randmness
Leave-ne-ut vs Validatin Set Advantages 1. Using almst all data nt just half 2. Stable results: Des nt have any randmness 3. Evaluatin is perfrmed with mre test data
Leave-ne-ut vs Validatin Set Advantages 1. Using almst all data nt just half 2. Stable results: Des nt have any randmness 3. Evaluatin is perfrmed with mre test data Disadvantages
Leave-ne-ut vs Validatin Set Advantages 1. Using almst all data nt just half 2. Stable results: Des nt have any randmness 3. Evaluatin is perfrmed with mre test data Disadvantages Can be very cmputatinally expensive: Fits the mdel n times
Speeding Up Leave-One-Out 1. Slve each fit independently and distribute the cmputatin
Speeding Up Leave-One-Out 1. Slve each fit independently and distribute the cmputatin 2. Linear regressin:
Speeding Up Leave-One-Out 1. Slve each fit independently and distribute the cmputatin 2. Linear regressin: Slve nly ne linear regressin using all data
Speeding Up Leave-One-Out 1. Slve each fit independently and distribute the cmputatin 2. Linear regressin: Slve nly ne linear regressin using all data Cmpute leave-ne-ut errr as: n ( yi ŷ i ) 2 CV (n) = 1 n i=1 1 h i
Speeding Up Leave-One-Out 1. Slve each fit independently and distribute the cmputatin 2. Linear regressin: Slve nly ne linear regressin using all data Cmpute leave-ne-ut errr as: n ( yi ŷ i ) 2 CV (n) = 1 n i=1 1 h i True value: y i, Predictin: ŷ i
Speeding Up Leave-One-Out 1. Slve each fit independently and distribute the cmputatin 2. Linear regressin: Slve nly ne linear regressin using all data Cmpute leave-ne-ut errr as: n ( yi ŷ i ) 2 CV (n) = 1 n i=1 1 h i True value: y i, Predictin: ŷ i hi is the leverage f data pint i: h i = 1 n + (x i x) 2 n j=1 (x j x) 2
Slutin 3: k-fld Crss-validatin Hybrid between validatin set and LOO Split training set int k subsets 1. Training set: n n /k 2. Test set: n /k k learning prblems Crss-validatin errr: CV (k) = 1 k k MSE i i=1
Crss-validatin vs Leave-One-Out k-fld Crss-validatin Leave-ne-ut
Crss-validatin vs Leave-One-Out LOOCV 10 fld CV Mean Squared Errr 16 18 20 22 24 26 28 Mean Squared Errr 16 18 20 22 24 26 28 2 4 6 8 10 Degree f Plynmial 2 4 6 8 10 Degree f Plynmial
Empirical Evaluatin: 3 Examples Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Mean Squared Errr 0 5 10 15 20 2 5 10 20 Flexibility 2 5 10 20 Flexibility 2 5 10 20 Flexibility Blue True errr Dashed LOOCV estimate Orange 10-fld CV
Hw t Chse k in CV? As k increases we have: 1. Increasing cmputatinal cmplexity 2. Decreasing bias (mre training data) 3. Increasing variance (bigger verlap between training sets) Empirically gd values: 5-10
Crss-validatin in Classificatin
Lgistic Regressin Predict prbability f a class: p(x) Example: p(balance) prbability f default fr persn with balance Linear regressin: Lgistic regressin: p(x) = β 0 + β 1 p(x) = eβ 0+β 1 X 1 + e β 0+β 1 X the same as: ( ) p(x) lg = β 0 + β 1 X 1 p(x) Linear decisin bundary (derive frm lg dds: p(x 1 ) p(x 2 ))
Features in Lgistic Regressin Lgistic regressin decisin bundary is als linear...nn-linear decisins? Degree=1 Degree=2 Degree=3 Degree=4
Lgistic Regressin with Nnlinear Features Linear: ( ) p(x) lg = β 0 + β 1 X 1 p(x) Nnlinear dds: ( ) p(x) lg = β 0 + β 1 X + β 2 X 2 + β 3 X 3 1 p(x) Nnlinear prbability: p(x) = eβ 0+β 1 X+β 2 X 2 +β 3 X 3 1 + e β 0+β 1 X+β 2 X 2 +β 3 X 3
Crss-validatin in Classificatin Wrks the same as fr regressin D nt use MSE but: CV (n) = 1 n n Err i i=1 Errr is an indicatr functin: Err i = I(y i ŷ i )
K in KNN Hw t decide n the right k t use in KNN?
K in KNN Hw t decide n the right k t use in KNN? Crss-validatin! Lgistic regressin KNN Errr Rate 0.12 0.14 0.16 0.18 0.20 Errr Rate 0.12 0.14 0.16 0.18 0.20 2 4 6 8 10 Order f Plynmials Used 0.01 0.02 0.05 0.10 0.20 0.50 1.00 1/K Brwn Test errr Blue Training errr Black CV errr
Overfitting and CV Is it pssible t verfit when using crss-validatin?
Overfitting and CV Is it pssible t verfit when using crss-validatin? Yes!
Overfitting and CV Is it pssible t verfit when using crss-validatin? Yes! Inferring k in KNN using crss-validatin is learning
Overfitting and CV Is it pssible t verfit when using crss-validatin? Yes! Inferring k in KNN using crss-validatin is learning Insightful theretical analysis: Prbably Apprximately Crrect (PAC) Learning
Overfitting and CV Is it pssible t verfit when using crss-validatin? Yes! Inferring k in KNN using crss-validatin is learning Insightful theretical analysis: Prbably Apprximately Crrect (PAC) Learning Crss-validatin will nt verfit when learning simple cncepts
Overfitting with Crss-validatin Task: Predict mpg pwer Define a new feature fr sme βs: f = β 0 + β 1 pwer + β 2 pwer 2 + β 3 pwer 3 + β 4 pwer 4 +... Linear regressin: Find α such that: mpg = α f Crss-validatin: Find values f βs
Overfitting with Crss-validatin Task: Predict mpg pwer Define a new feature fr sme βs: f = β 0 + β 1 pwer + β 2 pwer 2 + β 3 pwer 3 + β 4 pwer 4 +... Linear regressin: Find α such that: mpg = α f Crss-validatin: Find values f βs Will verfit Same slutin as using linear regressin n entire data (n crss-validatin)
Preventing Overfitting Gld standard: Have a test set that is used nly nce Rarely pssible $1M Netflix prize design: 1. Publicly available training set 2. Leader-bard results using a test set 3. Private data set used t determine the final winner
Btstrap Gal: Understand the cnfidence in learned parameters Mst useful in inference Hw cnfident are we in learned values f β: mpg = β 0 + β 1 pwer
Btstrap Gal: Understand the cnfidence in learned parameters Mst useful in inference Hw cnfident are we in learned values f β: mpg = β 0 + β 1 pwer Apprach: Run learning algrithm multiple times with different data sets:
Btstrap Gal: Understand the cnfidence in learned parameters Mst useful in inference Hw cnfident are we in learned values f β: mpg = β 0 + β 1 pwer Apprach: Run learning algrithm multiple times with different data sets: Create a new data-set by sampling with replacement frm the riginal ne
Btstrap Illustratin 2.8 5.3 3 1.1 2.1 2 2.4 4.3 1 Y X Obs 2.8 5.3 3 2.4 4.3 1 2.8 5.3 3 Y X Obs 2.4 4.3 1 2.8 5.3 3 1.1 2.1 2 Y X Obs 2.4 4.3 1 1.1 2.1 2 1.1 2.1 2 Y X Obs Original Data (Z) *1 Z *2 Z Z *B αˆ*1 ˆα *2 ˆα *B
Btstrap Results 0 50 100 150 200 0 50 100 150 200 α 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.4 0.5 0.6 0.7 0.8 0.9 α 0.3 0.4 0.5 0.6 0.7 0.8 0.9 α True Btstrap