Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Size: px

Start display at page:

Download "Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017"

Edmund Greer
5 years ago
Views:

1 Resampling Methds Crss-validatin, Btstrapping Marek Petrik 2/21/2017 Sme f the figures in this presentatin are taken frm An Intrductin t Statistical Learning, with applicatins in R (Springer, 2013) with permissin frm the authrs: G. James, D. Witten, T. Hastie and R. Tibshirani

2 S Far in ML Regressin vs classificatin Linear regressin Lgistic regressin Linear discriminant analysis, QDA Maximum likelihd

3 Discriminative vs Generative Mdels Discriminative mdels Estimate cnditinal mdels Pr[Y X] Linear regressin Lgistic regressin Generative mdels Estimate jint prbability Pr[Y, X] = Pr[Y X] Pr[X] Estimates nt nly prbability f labels but als the features Once mdel is fit, can be used t generate data LDA, QDA, Naive Bayes

4 Lgistic Regressin Y = { 1 if default 0 therwise Linear regressin Lgistic regressin Balance Prbability f Default Balance Prbability f Default Predict: Pr[default = yes balance]

5 LDA: Linear Discriminant Analysis Generative mdel: capture prbability f predictrs fr each label Predict: 1. Pr[balance default = yes] and Pr[default = yes] 2. Pr[balance default = n] and Pr[default = n] Classes are nrmal: Pr[balance default = yes]

6 Bayes Therem Classificatin frm label distributins: Pr[Y = k X = x] = Pr[X = x Y = k] Pr[Y = k] Pr[X = x] Example: Ntatin: Pr[default = yes balance = $100] = Pr[balance = $100 default = yes] Pr[default = yes] Pr[balance = $100] Pr[Y = k X = x] = π kf k (x) K l=1 π lf l (x)

7 LDA with Multiple Features Multivariate Nrmal Distributins: x 2 x 2 x 1 x 1 Multivariate nrmal distributin density (mean vectr µ, cvariance matrix Σ): ( 1 p(x) = (2π) p/2 Σ exp 1 ) 1 /2 2 (x µ) Σ 1 (x µ)

8 Multivariate Classificatin Using LDA Linear: Decisin bundaries are linear X X X X 1

9 QDA: Quadratic Discriminant Analysis X X X X 1

10 Cnfusin Matrix: Predict default True Yes N Ttal Predicted Yes a b a + b N c d c + d Ttal a + c b + d N Result f LDA classificatin: Predict default if Pr[default = yes balance] > 1 /2 Predicted True Yes N Ttal Yes N Ttal

11 Tday Successfully using basic machine learning methds Prblems: 1. Hw well is the machine learning methd ding 2. Which methd is best fr my prblem? 3. Hw many features (and which nes) t use? 4. What is the uncertainty in the learned parameters?

12 Tday Successfully using basic machine learning methds Prblems: 1. Hw well is the machine learning methd ding 2. Which methd is best fr my prblem? 3. Hw many features (and which nes) t use? 4. What is the uncertainty in the learned parameters? Methds: 1. Validatin set 2. Leave ne ut crss-validatin 3. k-fld crss validatin 4. Btstrapping

13 Prblem: Hw t design features? Miles per galln Linear Degree 2 Degree Hrsepwer

14 Benefit f Gd Features Y Mean Squared Errr X gray: training errr Flexibility red: test errr

15 Just Use Training Data?

16 Just Use Training Data? Using mre features will always reduce MSE

17 Just Use Training Data? Using mre features will always reduce MSE Errr n the test set will be greater Y Mean Squared Errr X gray: training errr Flexibility red: test errr

18 Slutin 1: Validatin Set Just evaluate hw well the methd wrks n the test set Randmly split data t: 1. Training set: abut half f all data 2. Validatin set (AKA hld-ut set): remaining half

19 Slutin 1: Validatin Set Just evaluate hw well the methd wrks n the test set Randmly split data t: 1. Training set: abut half f all data 2. Validatin set (AKA hld-ut set): remaining half Chse the number f features/representatin based n minimizing errr n validatin set

20 Feature Selectin Using Validatin Set Y Mean Squared Errr gray: training errr X Flexibility red: test errr (validatin set)

21 Prblems using Validatin Set 1. Highly variable (imprecise) estimates: Each line shws validatin errr fr ne pssible divisin f data Mean Squared Errr Mean Squared Errr Degree f Plynmial Degree f Plynmial

22 Prblems using Validatin Set 1. Highly variable (imprecise) estimates: Each line shws validatin errr fr ne pssible divisin f data Mean Squared Errr Mean Squared Errr Degree f Plynmial Degree f Plynmial 2. Only subset f data is used (validatin set is excluded nly abut half f data is used)

23 Slutin 2: Leave-ne-ut Addresses prblems with validatin set Split the data set int 2 parts: 1. Training: Size n 1 2. Validatin: Size 1 Repeat n times: t get n learning prblems

24 Leave-ne-ut Get n learning prblems: Train n n 1 instances (blue) Test n 1 instance (red) MSE i = (y i ŷ i ) 2 LOOCV estimate CV (n) = 1 n n MSE i i=1

25 Leave-ne-ut vs Validatin Set Advantages

26 Leave-ne-ut vs Validatin Set Advantages 1. Using almst all data nt just half

27 Leave-ne-ut vs Validatin Set Advantages 1. Using almst all data nt just half 2. Stable results: Des nt have any randmness

28 Leave-ne-ut vs Validatin Set Advantages 1. Using almst all data nt just half 2. Stable results: Des nt have any randmness 3. Evaluatin is perfrmed with mre test data

29 Leave-ne-ut vs Validatin Set Advantages 1. Using almst all data nt just half 2. Stable results: Des nt have any randmness 3. Evaluatin is perfrmed with mre test data Disadvantages

30 Leave-ne-ut vs Validatin Set Advantages 1. Using almst all data nt just half 2. Stable results: Des nt have any randmness 3. Evaluatin is perfrmed with mre test data Disadvantages Can be very cmputatinally expensive: Fits the mdel n times

31 Speeding Up Leave-One-Out 1. Slve each fit independently and distribute the cmputatin

32 Speeding Up Leave-One-Out 1. Slve each fit independently and distribute the cmputatin 2. Linear regressin:

33 Speeding Up Leave-One-Out 1. Slve each fit independently and distribute the cmputatin 2. Linear regressin: Slve nly ne linear regressin using all data

34 Speeding Up Leave-One-Out 1. Slve each fit independently and distribute the cmputatin 2. Linear regressin: Slve nly ne linear regressin using all data Cmpute leave-ne-ut errr as: n ( yi ŷ i ) 2 CV (n) = 1 n i=1 1 h i

35 Speeding Up Leave-One-Out 1. Slve each fit independently and distribute the cmputatin 2. Linear regressin: Slve nly ne linear regressin using all data Cmpute leave-ne-ut errr as: n ( yi ŷ i ) 2 CV (n) = 1 n i=1 1 h i True value: y i, Predictin: ŷ i

36 Speeding Up Leave-One-Out 1. Slve each fit independently and distribute the cmputatin 2. Linear regressin: Slve nly ne linear regressin using all data Cmpute leave-ne-ut errr as: n ( yi ŷ i ) 2 CV (n) = 1 n i=1 1 h i True value: y i, Predictin: ŷ i hi is the leverage f data pint i: h i = 1 n + (x i x) 2 n j=1 (x j x) 2

37 Slutin 3: k-fld Crss-validatin Hybrid between validatin set and LOO Split training set int k subsets 1. Training set: n n /k 2. Test set: n /k k learning prblems Crss-validatin errr: CV (k) = 1 k k MSE i i=1

38 Crss-validatin vs Leave-One-Out k-fld Crss-validatin Leave-ne-ut

39 Crss-validatin vs Leave-One-Out LOOCV 10 fld CV Mean Squared Errr Mean Squared Errr Degree f Plynmial Degree f Plynmial

40 Empirical Evaluatin: 3 Examples Mean Squared Errr Mean Squared Errr Mean Squared Errr Flexibility Flexibility Flexibility Blue True errr Dashed LOOCV estimate Orange 10-fld CV

41 Hw t Chse k in CV? As k increases we have: 1. Increasing cmputatinal cmplexity 2. Decreasing bias (mre training data) 3. Increasing variance (bigger verlap between training sets) Empirically gd values: 5-10

42 Crss-validatin in Classificatin

43 Lgistic Regressin Predict prbability f a class: p(x) Example: p(balance) prbability f default fr persn with balance Linear regressin: Lgistic regressin: p(x) = β 0 + β 1 p(x) = eβ 0+β 1 X 1 + e β 0+β 1 X the same as: ( ) p(x) lg = β 0 + β 1 X 1 p(x) Linear decisin bundary (derive frm lg dds: p(x 1 ) p(x 2 ))

44 Features in Lgistic Regressin Lgistic regressin decisin bundary is als linear...nn-linear decisins? Degree=1 Degree=2 Degree=3 Degree=4

45 Lgistic Regressin with Nnlinear Features Linear: ( ) p(x) lg = β 0 + β 1 X 1 p(x) Nnlinear dds: ( ) p(x) lg = β 0 + β 1 X + β 2 X 2 + β 3 X 3 1 p(x) Nnlinear prbability: p(x) = eβ 0+β 1 X+β 2 X 2 +β 3 X e β 0+β 1 X+β 2 X 2 +β 3 X 3

46 Crss-validatin in Classificatin Wrks the same as fr regressin D nt use MSE but: CV (n) = 1 n n Err i i=1 Errr is an indicatr functin: Err i = I(y i ŷ i )

47 K in KNN Hw t decide n the right k t use in KNN?

48 K in KNN Hw t decide n the right k t use in KNN? Crss-validatin! Lgistic regressin KNN Errr Rate Errr Rate Order f Plynmials Used /K Brwn Test errr Blue Training errr Black CV errr

49 Overfitting and CV Is it pssible t verfit when using crss-validatin?

50 Overfitting and CV Is it pssible t verfit when using crss-validatin? Yes!

51 Overfitting and CV Is it pssible t verfit when using crss-validatin? Yes! Inferring k in KNN using crss-validatin is learning

52 Overfitting and CV Is it pssible t verfit when using crss-validatin? Yes! Inferring k in KNN using crss-validatin is learning Insightful theretical analysis: Prbably Apprximately Crrect (PAC) Learning

53 Overfitting and CV Is it pssible t verfit when using crss-validatin? Yes! Inferring k in KNN using crss-validatin is learning Insightful theretical analysis: Prbably Apprximately Crrect (PAC) Learning Crss-validatin will nt verfit when learning simple cncepts

54 Overfitting with Crss-validatin Task: Predict mpg pwer Define a new feature fr sme βs: f = β 0 + β 1 pwer + β 2 pwer 2 + β 3 pwer 3 + β 4 pwer Linear regressin: Find α such that: mpg = α f Crss-validatin: Find values f βs

55 Overfitting with Crss-validatin Task: Predict mpg pwer Define a new feature fr sme βs: f = β 0 + β 1 pwer + β 2 pwer 2 + β 3 pwer 3 + β 4 pwer Linear regressin: Find α such that: mpg = α f Crss-validatin: Find values f βs Will verfit Same slutin as using linear regressin n entire data (n crss-validatin)

56 Preventing Overfitting Gld standard: Have a test set that is used nly nce Rarely pssible $1M Netflix prize design: 1. Publicly available training set 2. Leader-bard results using a test set 3. Private data set used t determine the final winner

57 Btstrap Gal: Understand the cnfidence in learned parameters Mst useful in inference Hw cnfident are we in learned values f β: mpg = β 0 + β 1 pwer

58 Btstrap Gal: Understand the cnfidence in learned parameters Mst useful in inference Hw cnfident are we in learned values f β: mpg = β 0 + β 1 pwer Apprach: Run learning algrithm multiple times with different data sets:

59 Btstrap Gal: Understand the cnfidence in learned parameters Mst useful in inference Hw cnfident are we in learned values f β: mpg = β 0 + β 1 pwer Apprach: Run learning algrithm multiple times with different data sets: Create a new data-set by sampling with replacement frm the riginal ne

60 Btstrap Illustratin Y X Obs Y X Obs Y X Obs Y X Obs Original Data (Z) *1 Z *2 Z Z *B αˆ*1 ˆα *2 ˆα *B

61 Btstrap Results α α α True Btstrap

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Resampling Methods. Chapter 5. Chapter 5 1 / 52 Resampling Methds Chapter 5 Chapter 5 1 / 52 1 51 Validatin set apprach 2 52 Crss validatin 3 53 Btstrap Chapter 5 2 / 52 Abut Resampling An imprtant statistical tl Pretending the data as ppulatin and