Cross Validation & Ensembling

Size: px

Start display at page:

Download "Cross Validation & Ensembling"

Douglas Dylan Carroll
6 years ago
Views:

1 Cross Validation & Ensembling Shan-Hung Wu Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 1 / 34

2 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 2 / 34

3 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 3 / 34

4 Cross Validation So far, we use the hold out method for: Hyperparameter tuning: validation set Performance reporting: testing set What if we get an unfortunate split? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 4 / 34

5 Cross Validation So far, we use the hold out method for: Hyperparameter tuning: validation set Performance reporting: testing set What if we get an unfortunate split? K-fold cross validation: 1 Split the data set X evenly into K subsets X (i) (called folds) 2 For i = 1,,K, train f N (i) using all data but the i-th fold (X\X (i) ) 3 Report the cross-validation error C CV by averaging all testing errors C[f N (i) ] s on X (i) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 4 / 34

6 Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 2 nested CV Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34

7 Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 2 nested CV 1 Inner (2 folds): select hyperparameters giving lowest C CV Can be wrapped by grid search Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34

8 Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 2 nested CV 1 Inner (2 folds): select hyperparameters giving lowest C CV Can be wrapped by grid search 2 Train final model using both training and validation sets with the selected hyperparameters Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34

9 Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 2 nested CV 1 Inner (2 folds): select hyperparameters giving lowest C CV Can be wrapped by grid search 2 Train final model using both training and validation sets with the selected hyperparameters 3 Outer (5 folds): report C CV as test error Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34

10 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 6 / 34

11 How Many Folds K? I Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34

12 How Many Folds K? I The cross-validation error C CV is an average of C[f N (i)] s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34

13 How Many Folds K? I The cross-validation error C CV is an average of C[f N (i)] s Regard each C[f N (i)] as an estimator of the expected generalization error E X (C[f N ]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34

14 How Many Folds K? I The cross-validation error C CV is an average of C[f N (i)] s Regard each C[f N (i)] as an estimator of the expected generalization error E X (C[f N ]) C CV is an estimator too, and we have MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34

15 Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

16 Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

17 Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 = E ( ˆq n E[ ˆq n ]) 2 +(E[ ˆq n ] q) 2 + 2( ˆq n E[ ˆq n ])(E[ ˆq n ] q) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

18 Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 = E ( ˆq n E[ ˆq n ]) 2 +(E[ ˆq n ] q) 2 + 2( ˆq n E[ ˆq n ])(E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E (E[ ˆq n ] q) 2 + 2E ˆq n E[ ˆq n ] (E[ ˆq n ] q) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

19 Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 = E ( ˆq n E[ ˆq n ]) 2 +(E[ ˆq n ] q) 2 + 2( ˆq n E[ ˆq n ])(E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E (E[ ˆq n ] q) 2 + 2E ˆq n E[ ˆq n ] (E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E[ ˆq n ] q (E[ ˆq n ] q) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

20 Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 = E ( ˆq n E[ ˆq n ]) 2 +(E[ ˆq n ] q) 2 + 2( ˆq n E[ ˆq n ])(E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E (E[ ˆq n ] q) 2 + 2E ˆq n E[ ˆq n ] (E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E[ ˆq n ] q (E[ ˆq n ] q) = Var X ( ˆq n )+bias( ˆq n ) 2 MSE of an unbiased estimator is its variance Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

21 Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

22 Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

23 Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Let C[ ] be the MSE of predictions (made by a function) to true labels Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

24 Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Let C[ ] be the MSE of predictions (made by a function) to true labels E X (C[f N ]): readline Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

25 Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Let C[ ] be the MSE of predictions (made by a function) to true labels E X (C[f N ]): readline bias(c CV ):gapsbetweentheredandothersolidlines(e X [C CV ]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

26 Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Let C[ ] be the MSE of predictions (made by a function) to true labels E X (C[f N ]): readline bias(c CV ):gapsbetweentheredandothersolidlines(e X [C CV ]) Var X (C CV ):shadedareas Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

27 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

28 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where bias(c CV )=E X (C CV ) E X (C[f N ]) = E Â i 1 K C[f N (i)] E(C[f N]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

29 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E Â i K C[f N (i)] = 1 K Â i E C[f N (i)] E(C[f N]) E(C[f N]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

30 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E Â i K C[f N (i)] = 1 K Â i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s E(C[f N]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

31 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E Â i K C[f N (i)] = 1 K Â i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s = bias C[f N (s)],8s E(C[f N]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

32 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E Â i K C[f N (i)] = 1 K Â i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s = bias C[f N (s)],8s E(C[f N]) Var X (C CV )=Var Â i 1 K C[f N (i)] = 1 K 2 Var Â i C[f N (i)] Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

33 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E Â i K C[f N (i)] = 1 K Â i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s = bias C[f N (s)],8s E(C[f N]) Var X (C CV )=Var Â i 1 K C[f N (i)] = 1 K 2 Var Â i C[f N (i)] = 1 K 2 Â i Var C[f N (i)] + 2Â i,j,j>i Cov X C[f N (i)],c[f N (j)] Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

34 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E Â i K C[f N (i)] = 1 K Â i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s = bias C[f N (s)],8s E(C[f N]) 1 Var X (C CV )=Var Â i K C[f N (i)] = 1 Var K Â 2 i C[f N (i)] = 1 K Â 2 i Var C[f N (i)] + 2Â i,j,j>i Cov X C[f N (i)],c[f N (j)] = 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

35 How Many Folds K? II MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where bias(c CV )=bias C[f N (s)],8s Var(C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s We can reduce bias(c CV ) and Var(C CV ) by learning theory Choosing the right model complexity avoiding both underfitting and overfitting Collecting more training examples (N) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 11 / 34

36 How Many Folds K? II MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where bias(c CV )=bias C[f N (s)],8s Var(C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s We can reduce bias(c CV ) and Var(C CV ) by learning theory Choosing the right model complexity avoiding both underfitting and overfitting Collecting more training examples (N) Furthermore, we can reduce Var(C CV ) by making f N (i) uncorrelated and f N (j) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 11 / 34

37 How Many Folds K? III bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N With a large K, thec CV tends to have: (j)],8s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 12 / 34

38 How Many Folds K? III bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s With a large K, thec CV tends to have: Low bias C[f N (s) ] and Var C[f N (s) ],asf N (s) is trained on more examples Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 12 / 34

39 How Many Folds K? III bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s With a large K, thec CV tends to have: Low bias C[f N (s) ] and Var C[f N (s) ],asf N (s) is trained on more examples High Cov C[f N (i) ],C[f N (j) ], as training sets X\X (i) and X\X (j) are more similar thus C[f N (i) ] and C[f N (j) ] are more positively correlated Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 12 / 34

40 How Many Folds K? IV bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s Conversely, with a small K, the cross-validation error tends to have a high bias C[f N (s)] and Var C[f N (s)] but low Cov C[f N (i)],c[f N (j)] Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 13 / 34

41 How Many Folds K? IV bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s Conversely, with a small K, the cross-validation error tends to have a high bias C[f N (s)] and Var C[f N (s)] but low Cov C[f N (i)],c[f N (j)] In practice, we usually set K = 5 or 10 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 13 / 34

42 Leave-One-Out CV bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N For very small dataset: MSE(C CV ) is dominated by bias C[f N (s) ] and Var C[f N (s) ] Not Cov C[f N (i) ],C[f N (j) ] (j)],8s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 14 / 34

43 Leave-One-Out CV bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N For very small dataset: MSE(C CV ) is dominated by bias C[f N (s) ] and Var C[f N (s) ] Not Cov C[f N (i) ],C[f N (j) ] We can choose K = N, whichwecalltheleave-one-out CV (j)],8s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 14 / 34

44 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 15 / 34

45 Ensemble Methods No free lunch theorem: thereisnosinglemlalgorithmthatalways outperforms the others in all domains/tasks Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34

46 Ensemble Methods No free lunch theorem: thereisnosinglemlalgorithmthatalways outperforms the others in all domains/tasks Can we combine multiple base-learners to improve Applicability across different domains, and/or Generalization performance in a specific task? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34

47 Ensemble Methods No free lunch theorem: thereisnosinglemlalgorithmthatalways outperforms the others in all domains/tasks Can we combine multiple base-learners to improve Applicability across different domains, and/or Generalization performance in a specific task? These are the goals of ensemble learning Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34

48 Ensemble Methods No free lunch theorem: thereisnosinglemlalgorithmthatalways outperforms the others in all domains/tasks Can we combine multiple base-learners to improve Applicability across different domains, and/or Generalization performance in a specific task? These are the goals of ensemble learning How to combine multiple base-learners? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34

49 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 17 / 34

50 Voting Voting: linearcombiningthepredictionsofbase-learnersforeachx: ỹ k = Âw j ŷ (j) k where w j 0,Âw j = 1. j j If all learners are given equal weight w j = 1/L, wehavetheplurality vote (multi-class version of majority vote) Voting Rule Sum Weighted sum Median Minimum Maximum Product Formular ỹ k = 1 L ÂL j=1 ŷ(j) k ỹ k = Â j w j ŷ (j) k,w j 0,Â j w j = 1 ỹ k = median j ŷ (j) k ỹ k = min j ŷ (j) k ỹ k = max j ŷ (j) k ỹ k = j ŷ (j) k Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 18 / 34

51 Why Voting Works? I Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 19 / 34

52 Why Voting Works? I Assume that each ŷ (j) has the expected value E X ŷ (j) x and variance Var X ŷ (j) x When w j = 1/L, wehave:! E X (ỹ x)=e 1 Â j L ŷ(j) x = 1 L Â j Var X (ỹ x)=var! 1 Â j L ŷ(j) x = 1 ŷ L Var (j) x + 2 L Â 2 Cov i,j,i<j E ŷ (j) x = E ŷ (j) x! = 1 L 2 Var Âŷ (j) x j ŷ (i),ŷ (j) x Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 19 / 34

53 Why Voting Works? I Assume that each ŷ (j) has the expected value E X ŷ (j) x and variance Var X ŷ (j) x When w j = 1/L, wehave:! E X (ỹ x)=e 1 Â j L ŷ(j) x = 1 L Â j Var X (ỹ x)=var! 1 Â j L ŷ(j) x = 1 ŷ L Var (j) x + 2 L Â 2 Cov i,j,i<j E ŷ (j) x = E ŷ (j) x! = 1 L 2 Var Âŷ (j) x j ŷ (i),ŷ (j) x The expected value doesn t change, so the bias doesn t change Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 19 / 34

54 Why Voting Works? II Var X (ỹ x)= 1 ŷ L Var (j) x + 2 L Â 2 Cov ŷ (i),ŷ (j) x i,j,i<j Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 20 / 34

55 Why Voting Works? II Var X (ỹ x)= 1 ŷ L Var (j) x + 2 L Â 2 Cov ŷ (i),ŷ (j) x i,j,i<j If ŷ (i) and ŷ (j) are uncorrelated, the variance can be reduced Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 20 / 34

56 Why Voting Works? II Var X (ỹ x)= 1 ŷ L Var (j) x + 2 L Â 2 Cov ŷ (i),ŷ (j) x i,j,i<j If ŷ (i) and ŷ (j) are uncorrelated, the variance can be reduced Unfortunately, ŷ (j) s may not be i.i.d. in practice If voters are positively correlated, variance increases Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 20 / 34

57 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 21 / 34

58 Bagging Bagging (short for bootstrap aggregating) isavotingmethod,but base-learners are made different deliberately How? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34

59 Bagging Bagging (short for bootstrap aggregating) isavotingmethod,but base-learners are made different deliberately How? Why not train them using slightly different training sets? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34

60 Bagging Bagging (short for bootstrap aggregating) isavotingmethod,but base-learners are made different deliberately How? Why not train them using slightly different training sets? 1 Generate L slightly different samples from a given sample is done by bootstrap: givenx of size N, wedrawn points randomly from X with replacement to get X (j) It is possible that some instances are drawn more than once and some are not at all Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34

61 Bagging Bagging (short for bootstrap aggregating) isavotingmethod,but base-learners are made different deliberately How? Why not train them using slightly different training sets? 1 Generate L slightly different samples from a given sample is done by bootstrap: givenx of size N, wedrawn points randomly from X with replacement to get X (j) It is possible that some instances are drawn more than once and some are not at all 2 Train a base-learner for each X (j) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34

62 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 23 / 34

63 Boosting In bagging, generating uncorrelated base-learners is left to chance and unstability of the learning method Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34

64 Boosting In bagging, generating uncorrelated base-learners is left to chance and unstability of the learning method In boosting, wegeneratecomplementary base-learners How? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34

65 Boosting In bagging, generating uncorrelated base-learners is left to chance and unstability of the learning method In boosting, wegeneratecomplementary base-learners How? Why not train the next learner on the mistakes of the previous learners Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34

66 Boosting In bagging, generating uncorrelated base-learners is left to chance and unstability of the learning method In boosting, wegeneratecomplementary base-learners How? Why not train the next learner on the mistakes of the previous learners For simplicity, let s consider the binary classification here: d (j) (x) 2{1, 1} The original boosting algorithm combines three weak learners to generate a strong learner A week learner has error probability less than 1/2 (better than random guessing) A strong learner has arbitrarily small error probability Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34

67 Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

68 Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

69 Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) 3 Use all points misclassified by d (1) and X (2) to train d (2).Thenfeed X (3) to d (1) and d (2) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

70 Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) 3 Use all points misclassified by d (1) and X (2) to train d (2).Thenfeed X (3) to d (1) and d (2) 4 Use the points on which d (1) and d (2) disagree to train d (3) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

71 Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) 3 Use all points misclassified by d (1) and X (2) to train d (2).Thenfeed X (3) to d (1) and d (2) 4 Use the points on which d (1) and d (2) disagree to train d (3) Testing 1 Feed a point it to d (1) and d (2) first. If their outputs agree, use them as the final prediction Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

72 Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) 3 Use all points misclassified by d (1) and X (2) to train d (2).Thenfeed X (3) to d (1) and d (2) 4 Use the points on which d (1) and d (2) disagree to train d (3) Testing 1 Feed a point it to d (1) and d (2) first. If their outputs agree, use them as the final prediction 2 Otherwise the output of d (3) is taken Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

73 Example Assuming X (1), X (2),andX (3) are the same: Disadvantage: requires a large training set to afford the three-way split Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 26 / 34

74 AdaBoost AdaBoost: usesthesametrainingsetoverandoveragain How to make some points larger? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 27 / 34

75 AdaBoost AdaBoost: usesthesametrainingsetoverandoveragain How to make some points larger? Modify the probabilities of drawing the instances as a function of error Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 27 / 34

76 AdaBoost AdaBoost: usesthesametrainingsetoverandoveragain How to make some points larger? Modify the probabilities of drawing the instances as a function of error Notation: Pr (i,j) :probabilitythatanexample(x (i),y (i) ) is drawn to train the jth base-learner d (j) e (j) = Â i Pr (i,j) 1(y (i) 6= d (j) (x (i) )): errorrateofd (j) on its training set Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 27 / 34

77 Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

78 Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) 2 Stop adding new base-learners if e (j) 1 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

79 Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) 2 Stop adding new base-learners if e (j) Define a j = 1 2 log 1 e (j) > 0 and set Pr (i,j+1) = Pr (i,j) exp( e (j) a j y (i) d (j) (x (i) )) for all i Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

80 Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) 2 Stop adding new base-learners if e (j) Define a j = 1 2 log 1 e (j) > 0 and set e (j) Pr (i,j+1) = Pr (i,j) exp( a j y (i) d (j) (x (i) )) for all i 1 4 Normalize Pr (i,j+1), 8i, by multiplying Â i Pr (i,j+1) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

81 Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) 2 Stop adding new base-learners if e (j) Define a j = 1 2 log 1 e (j) > 0 and set e (j) Pr (i,j+1) = Pr (i,j) exp( a j y (i) d (j) (x (i) )) for all i 1 4 Normalize Pr (i,j+1), 8i, by multiplying Â i Pr (i,j+1) Testing 1 Given x, calculateŷ (j) for all j 2 Make final prediction ỹ by voting: ỹ = Â j a j d (j) (x) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

82 Example d (j+1) complements d (j) and d (j disagree 1) by focusing on predictions they Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 29 / 34

83 Example d (j+1) complements d (j) and d (j 1) by focusing on predictions they disagree Voting weights (a j = 1 2 log 1 e (j) )inpredictionsareproportionalto the base-learner s accuracy e (j) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 29 / 34

84 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 30 / 34

85 Why AdaBoost Works Why AdaBoost improves performance? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34

86 Why AdaBoost Works Why AdaBoost improves performance? By increasing model complexity? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34

87 Why AdaBoost Works Why AdaBoost improves performance? By increasing model complexity? Not exactly Empirical study: AdaBoost reduces overfitting as L grows, even when there is no training error Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34

88 Why AdaBoost Works Why AdaBoost improves performance? By increasing model complexity? Not exactly Empirical study: AdaBoost reduces overfitting as L grows, even when there is no training error AdaBoost increases margin [1, 2] Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34

89 Margin as Confidence of Predictions Recall in SVC, a larger margin improves generalizability Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 32 / 34

90 Margin as Confidence of Predictions Recall in SVC, a larger margin improves generalizability Due to higher confidence predictions over training examples Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 32 / 34

91 Margin as Confidence of Predictions Recall in SVC, a larger margin improves generalizability Due to higher confidence predictions over training examples We can define the margin for AdaBoost similarly In binary classification, define margin of a prediction of an example (x (i),y (i) ) 2 X as: margin(x (i),y (i) )=y (i) f (x (i) )= Â a j j:y (i) =d (j) (x (i) ) Â a j j:y (i) 6=d (j) (x (i) ) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 32 / 34

92 Margin Distribution Margin distribution over q: Pr X (y (i) f (x (i) ) apple q) (x(i),y (i) ) : y (i) f (x (i) ) apple q X Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 33 / 34

93 Margin Distribution Margin distribution over q: Pr X (y (i) f (x (i) ) apple q) (x(i),y (i) ) : y (i) f (x (i) ) apple q X Acomplementarylearner: Clarifies low confidence areas Increases margin of points in these areas Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 33 / 34

94 Reference I [1] Yoav Freund, Robert Schapire, and N Abe. Ashortintroductiontoboosting. Journal-Japanese Society For Artificial Intelligence, 14( ):1612, [2] Liwei Wang, Masashi Sugiyama, Cheng Yang, Zhi-Hua Zhou, and Jufu Feng. On the margin explanation of boosting algorithms. In COLT, pages Citeseer,2008. Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 34 / 34

Learning with multiple models. Boosting.

CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models