Cross Validation & Ensembling

Size: px
Start display at page:

Download "Cross Validation & Ensembling"

Transcription

1 Cross Validation & Ensembling Shan-Hung Wu Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 1 / 34

2 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 2 / 34

3 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 3 / 34

4 Cross Validation So far, we use the hold out method for: Hyperparameter tuning: validation set Performance reporting: testing set What if we get an unfortunate split? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 4 / 34

5 Cross Validation So far, we use the hold out method for: Hyperparameter tuning: validation set Performance reporting: testing set What if we get an unfortunate split? K-fold cross validation: 1 Split the data set X evenly into K subsets X (i) (called folds) 2 For i = 1,,K, train f N (i) using all data but the i-th fold (X\X (i) ) 3 Report the cross-validation error C CV by averaging all testing errors C[f N (i) ] s on X (i) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 4 / 34

6 Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 2 nested CV Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34

7 Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 2 nested CV 1 Inner (2 folds): select hyperparameters giving lowest C CV Can be wrapped by grid search Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34

8 Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 2 nested CV 1 Inner (2 folds): select hyperparameters giving lowest C CV Can be wrapped by grid search 2 Train final model using both training and validation sets with the selected hyperparameters Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34

9 Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 2 nested CV 1 Inner (2 folds): select hyperparameters giving lowest C CV Can be wrapped by grid search 2 Train final model using both training and validation sets with the selected hyperparameters 3 Outer (5 folds): report C CV as test error Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34

10 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 6 / 34

11 How Many Folds K? I Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34

12 How Many Folds K? I The cross-validation error C CV is an average of C[f N (i)] s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34

13 How Many Folds K? I The cross-validation error C CV is an average of C[f N (i)] s Regard each C[f N (i)] as an estimator of the expected generalization error E X (C[f N ]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34

14 How Many Folds K? I The cross-validation error C CV is an average of C[f N (i)] s Regard each C[f N (i)] as an estimator of the expected generalization error E X (C[f N ]) C CV is an estimator too, and we have MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34

15 Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

16 Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

17 Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 = E ( ˆq n E[ ˆq n ]) 2 +(E[ ˆq n ] q) 2 + 2( ˆq n E[ ˆq n ])(E[ ˆq n ] q) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

18 Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 = E ( ˆq n E[ ˆq n ]) 2 +(E[ ˆq n ] q) 2 + 2( ˆq n E[ ˆq n ])(E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E (E[ ˆq n ] q) 2 + 2E ˆq n E[ ˆq n ] (E[ ˆq n ] q) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

19 Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 = E ( ˆq n E[ ˆq n ]) 2 +(E[ ˆq n ] q) 2 + 2( ˆq n E[ ˆq n ])(E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E (E[ ˆq n ] q) 2 + 2E ˆq n E[ ˆq n ] (E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E[ ˆq n ] q (E[ ˆq n ] q) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

20 Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 = E ( ˆq n E[ ˆq n ]) 2 +(E[ ˆq n ] q) 2 + 2( ˆq n E[ ˆq n ])(E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E (E[ ˆq n ] q) 2 + 2E ˆq n E[ ˆq n ] (E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E[ ˆq n ] q (E[ ˆq n ] q) = Var X ( ˆq n )+bias( ˆq n ) 2 MSE of an unbiased estimator is its variance Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

21 Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

22 Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

23 Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Let C[ ] be the MSE of predictions (made by a function) to true labels Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

24 Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Let C[ ] be the MSE of predictions (made by a function) to true labels E X (C[f N ]): readline Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

25 Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Let C[ ] be the MSE of predictions (made by a function) to true labels E X (C[f N ]): readline bias(c CV ):gapsbetweentheredandothersolidlines(e X [C CV ]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

26 Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Let C[ ] be the MSE of predictions (made by a function) to true labels E X (C[f N ]): readline bias(c CV ):gapsbetweentheredandothersolidlines(e X [C CV ]) Var X (C CV ):shadedareas Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

27 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

28 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where bias(c CV )=E X (C CV ) E X (C[f N ]) = E Â i 1 K C[f N (i)] E(C[f N]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

29 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E Â i K C[f N (i)] = 1 K Â i E C[f N (i)] E(C[f N]) E(C[f N]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

30 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E Â i K C[f N (i)] = 1 K Â i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s E(C[f N]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

31 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E Â i K C[f N (i)] = 1 K Â i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s = bias C[f N (s)],8s E(C[f N]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

32 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E  i K C[f N (i)] = 1 K  i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s = bias C[f N (s)],8s E(C[f N]) Var X (C CV )=Var  i 1 K C[f N (i)] = 1 K 2 Var  i C[f N (i)] Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

33 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E  i K C[f N (i)] = 1 K  i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s = bias C[f N (s)],8s E(C[f N]) Var X (C CV )=Var  i 1 K C[f N (i)] = 1 K 2 Var  i C[f N (i)] = 1 K 2  i Var C[f N (i)] + 2 i,j,j>i Cov X C[f N (i)],c[f N (j)] Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

34 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E  i K C[f N (i)] = 1 K  i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s = bias C[f N (s)],8s E(C[f N]) 1 Var X (C CV )=Var  i K C[f N (i)] = 1 Var K  2 i C[f N (i)] = 1 K  2 i Var C[f N (i)] + 2 i,j,j>i Cov X C[f N (i)],c[f N (j)] = 1 K Var C[f N (s)] + 2 K  2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

35 How Many Folds K? II MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where bias(c CV )=bias C[f N (s)],8s Var(C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s We can reduce bias(c CV ) and Var(C CV ) by learning theory Choosing the right model complexity avoiding both underfitting and overfitting Collecting more training examples (N) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 11 / 34

36 How Many Folds K? II MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where bias(c CV )=bias C[f N (s)],8s Var(C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s We can reduce bias(c CV ) and Var(C CV ) by learning theory Choosing the right model complexity avoiding both underfitting and overfitting Collecting more training examples (N) Furthermore, we can reduce Var(C CV ) by making f N (i) uncorrelated and f N (j) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 11 / 34

37 How Many Folds K? III bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N With a large K, thec CV tends to have: (j)],8s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 12 / 34

38 How Many Folds K? III bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s With a large K, thec CV tends to have: Low bias C[f N (s) ] and Var C[f N (s) ],asf N (s) is trained on more examples Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 12 / 34

39 How Many Folds K? III bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s With a large K, thec CV tends to have: Low bias C[f N (s) ] and Var C[f N (s) ],asf N (s) is trained on more examples High Cov C[f N (i) ],C[f N (j) ], as training sets X\X (i) and X\X (j) are more similar thus C[f N (i) ] and C[f N (j) ] are more positively correlated Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 12 / 34

40 How Many Folds K? IV bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s Conversely, with a small K, the cross-validation error tends to have a high bias C[f N (s)] and Var C[f N (s)] but low Cov C[f N (i)],c[f N (j)] Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 13 / 34

41 How Many Folds K? IV bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s Conversely, with a small K, the cross-validation error tends to have a high bias C[f N (s)] and Var C[f N (s)] but low Cov C[f N (i)],c[f N (j)] In practice, we usually set K = 5 or 10 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 13 / 34

42 Leave-One-Out CV bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N For very small dataset: MSE(C CV ) is dominated by bias C[f N (s) ] and Var C[f N (s) ] Not Cov C[f N (i) ],C[f N (j) ] (j)],8s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 14 / 34

43 Leave-One-Out CV bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N For very small dataset: MSE(C CV ) is dominated by bias C[f N (s) ] and Var C[f N (s) ] Not Cov C[f N (i) ],C[f N (j) ] We can choose K = N, whichwecalltheleave-one-out CV (j)],8s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 14 / 34

44 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 15 / 34

45 Ensemble Methods No free lunch theorem: thereisnosinglemlalgorithmthatalways outperforms the others in all domains/tasks Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34

46 Ensemble Methods No free lunch theorem: thereisnosinglemlalgorithmthatalways outperforms the others in all domains/tasks Can we combine multiple base-learners to improve Applicability across different domains, and/or Generalization performance in a specific task? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34

47 Ensemble Methods No free lunch theorem: thereisnosinglemlalgorithmthatalways outperforms the others in all domains/tasks Can we combine multiple base-learners to improve Applicability across different domains, and/or Generalization performance in a specific task? These are the goals of ensemble learning Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34

48 Ensemble Methods No free lunch theorem: thereisnosinglemlalgorithmthatalways outperforms the others in all domains/tasks Can we combine multiple base-learners to improve Applicability across different domains, and/or Generalization performance in a specific task? These are the goals of ensemble learning How to combine multiple base-learners? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34

49 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 17 / 34

50 Voting Voting: linearcombiningthepredictionsofbase-learnersforeachx: ỹ k = Âw j ŷ (j) k where w j 0,Âw j = 1. j j If all learners are given equal weight w j = 1/L, wehavetheplurality vote (multi-class version of majority vote) Voting Rule Sum Weighted sum Median Minimum Maximum Product Formular ỹ k = 1 L ÂL j=1 ŷ(j) k ỹ k =  j w j ŷ (j) k,w j 0, j w j = 1 ỹ k = median j ŷ (j) k ỹ k = min j ŷ (j) k ỹ k = max j ŷ (j) k ỹ k = j ŷ (j) k Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 18 / 34

51 Why Voting Works? I Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 19 / 34

52 Why Voting Works? I Assume that each ŷ (j) has the expected value E X ŷ (j) x and variance Var X ŷ (j) x When w j = 1/L, wehave:! E X (ỹ x)=e 1 Â j L ŷ(j) x = 1 L Â j Var X (ỹ x)=var! 1 Â j L ŷ(j) x = 1 ŷ L Var (j) x + 2 L Â 2 Cov i,j,i<j E ŷ (j) x = E ŷ (j) x! = 1 L 2 Var Âŷ (j) x j ŷ (i),ŷ (j) x Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 19 / 34

53 Why Voting Works? I Assume that each ŷ (j) has the expected value E X ŷ (j) x and variance Var X ŷ (j) x When w j = 1/L, wehave:! E X (ỹ x)=e 1 Â j L ŷ(j) x = 1 L Â j Var X (ỹ x)=var! 1 Â j L ŷ(j) x = 1 ŷ L Var (j) x + 2 L Â 2 Cov i,j,i<j E ŷ (j) x = E ŷ (j) x! = 1 L 2 Var Âŷ (j) x j ŷ (i),ŷ (j) x The expected value doesn t change, so the bias doesn t change Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 19 / 34

54 Why Voting Works? II Var X (ỹ x)= 1 ŷ L Var (j) x + 2 L Â 2 Cov ŷ (i),ŷ (j) x i,j,i<j Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 20 / 34

55 Why Voting Works? II Var X (ỹ x)= 1 ŷ L Var (j) x + 2 L Â 2 Cov ŷ (i),ŷ (j) x i,j,i<j If ŷ (i) and ŷ (j) are uncorrelated, the variance can be reduced Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 20 / 34

56 Why Voting Works? II Var X (ỹ x)= 1 ŷ L Var (j) x + 2 L Â 2 Cov ŷ (i),ŷ (j) x i,j,i<j If ŷ (i) and ŷ (j) are uncorrelated, the variance can be reduced Unfortunately, ŷ (j) s may not be i.i.d. in practice If voters are positively correlated, variance increases Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 20 / 34

57 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 21 / 34

58 Bagging Bagging (short for bootstrap aggregating) isavotingmethod,but base-learners are made different deliberately How? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34

59 Bagging Bagging (short for bootstrap aggregating) isavotingmethod,but base-learners are made different deliberately How? Why not train them using slightly different training sets? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34

60 Bagging Bagging (short for bootstrap aggregating) isavotingmethod,but base-learners are made different deliberately How? Why not train them using slightly different training sets? 1 Generate L slightly different samples from a given sample is done by bootstrap: givenx of size N, wedrawn points randomly from X with replacement to get X (j) It is possible that some instances are drawn more than once and some are not at all Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34

61 Bagging Bagging (short for bootstrap aggregating) isavotingmethod,but base-learners are made different deliberately How? Why not train them using slightly different training sets? 1 Generate L slightly different samples from a given sample is done by bootstrap: givenx of size N, wedrawn points randomly from X with replacement to get X (j) It is possible that some instances are drawn more than once and some are not at all 2 Train a base-learner for each X (j) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34

62 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 23 / 34

63 Boosting In bagging, generating uncorrelated base-learners is left to chance and unstability of the learning method Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34

64 Boosting In bagging, generating uncorrelated base-learners is left to chance and unstability of the learning method In boosting, wegeneratecomplementary base-learners How? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34

65 Boosting In bagging, generating uncorrelated base-learners is left to chance and unstability of the learning method In boosting, wegeneratecomplementary base-learners How? Why not train the next learner on the mistakes of the previous learners Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34

66 Boosting In bagging, generating uncorrelated base-learners is left to chance and unstability of the learning method In boosting, wegeneratecomplementary base-learners How? Why not train the next learner on the mistakes of the previous learners For simplicity, let s consider the binary classification here: d (j) (x) 2{1, 1} The original boosting algorithm combines three weak learners to generate a strong learner A week learner has error probability less than 1/2 (better than random guessing) A strong learner has arbitrarily small error probability Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34

67 Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

68 Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

69 Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) 3 Use all points misclassified by d (1) and X (2) to train d (2).Thenfeed X (3) to d (1) and d (2) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

70 Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) 3 Use all points misclassified by d (1) and X (2) to train d (2).Thenfeed X (3) to d (1) and d (2) 4 Use the points on which d (1) and d (2) disagree to train d (3) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

71 Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) 3 Use all points misclassified by d (1) and X (2) to train d (2).Thenfeed X (3) to d (1) and d (2) 4 Use the points on which d (1) and d (2) disagree to train d (3) Testing 1 Feed a point it to d (1) and d (2) first. If their outputs agree, use them as the final prediction Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

72 Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) 3 Use all points misclassified by d (1) and X (2) to train d (2).Thenfeed X (3) to d (1) and d (2) 4 Use the points on which d (1) and d (2) disagree to train d (3) Testing 1 Feed a point it to d (1) and d (2) first. If their outputs agree, use them as the final prediction 2 Otherwise the output of d (3) is taken Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

73 Example Assuming X (1), X (2),andX (3) are the same: Disadvantage: requires a large training set to afford the three-way split Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 26 / 34

74 AdaBoost AdaBoost: usesthesametrainingsetoverandoveragain How to make some points larger? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 27 / 34

75 AdaBoost AdaBoost: usesthesametrainingsetoverandoveragain How to make some points larger? Modify the probabilities of drawing the instances as a function of error Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 27 / 34

76 AdaBoost AdaBoost: usesthesametrainingsetoverandoveragain How to make some points larger? Modify the probabilities of drawing the instances as a function of error Notation: Pr (i,j) :probabilitythatanexample(x (i),y (i) ) is drawn to train the jth base-learner d (j) e (j) = Â i Pr (i,j) 1(y (i) 6= d (j) (x (i) )): errorrateofd (j) on its training set Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 27 / 34

77 Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

78 Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) 2 Stop adding new base-learners if e (j) 1 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

79 Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) 2 Stop adding new base-learners if e (j) Define a j = 1 2 log 1 e (j) > 0 and set Pr (i,j+1) = Pr (i,j) exp( e (j) a j y (i) d (j) (x (i) )) for all i Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

80 Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) 2 Stop adding new base-learners if e (j) Define a j = 1 2 log 1 e (j) > 0 and set e (j) Pr (i,j+1) = Pr (i,j) exp( a j y (i) d (j) (x (i) )) for all i 1 4 Normalize Pr (i,j+1), 8i, by multiplying  i Pr (i,j+1) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

81 Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) 2 Stop adding new base-learners if e (j) Define a j = 1 2 log 1 e (j) > 0 and set e (j) Pr (i,j+1) = Pr (i,j) exp( a j y (i) d (j) (x (i) )) for all i 1 4 Normalize Pr (i,j+1), 8i, by multiplying  i Pr (i,j+1) Testing 1 Given x, calculateŷ (j) for all j 2 Make final prediction ỹ by voting: ỹ =  j a j d (j) (x) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

82 Example d (j+1) complements d (j) and d (j disagree 1) by focusing on predictions they Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 29 / 34

83 Example d (j+1) complements d (j) and d (j 1) by focusing on predictions they disagree Voting weights (a j = 1 2 log 1 e (j) )inpredictionsareproportionalto the base-learner s accuracy e (j) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 29 / 34

84 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 30 / 34

85 Why AdaBoost Works Why AdaBoost improves performance? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34

86 Why AdaBoost Works Why AdaBoost improves performance? By increasing model complexity? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34

87 Why AdaBoost Works Why AdaBoost improves performance? By increasing model complexity? Not exactly Empirical study: AdaBoost reduces overfitting as L grows, even when there is no training error Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34

88 Why AdaBoost Works Why AdaBoost improves performance? By increasing model complexity? Not exactly Empirical study: AdaBoost reduces overfitting as L grows, even when there is no training error AdaBoost increases margin [1, 2] Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34

89 Margin as Confidence of Predictions Recall in SVC, a larger margin improves generalizability Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 32 / 34

90 Margin as Confidence of Predictions Recall in SVC, a larger margin improves generalizability Due to higher confidence predictions over training examples Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 32 / 34

91 Margin as Confidence of Predictions Recall in SVC, a larger margin improves generalizability Due to higher confidence predictions over training examples We can define the margin for AdaBoost similarly In binary classification, define margin of a prediction of an example (x (i),y (i) ) 2 X as: margin(x (i),y (i) )=y (i) f (x (i) )= Â a j j:y (i) =d (j) (x (i) ) Â a j j:y (i) 6=d (j) (x (i) ) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 32 / 34

92 Margin Distribution Margin distribution over q: Pr X (y (i) f (x (i) ) apple q) (x(i),y (i) ) : y (i) f (x (i) ) apple q X Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 33 / 34

93 Margin Distribution Margin distribution over q: Pr X (y (i) f (x (i) ) apple q) (x(i),y (i) ) : y (i) f (x (i) ) apple q X Acomplementarylearner: Clarifies low confidence areas Increases margin of points in these areas Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 33 / 34

94 Reference I [1] Yoav Freund, Robert Schapire, and N Abe. Ashortintroductiontoboosting. Journal-Japanese Society For Artificial Intelligence, 14( ):1612, [2] Liwei Wang, Masashi Sugiyama, Cheng Yang, Zhi-Hua Zhou, and Jufu Feng. On the margin explanation of boosting algorithms. In COLT, pages Citeseer,2008. Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 34 / 34

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

A Brief Introduction to Adaboost

A Brief Introduction to Adaboost A Brief Introduction to Adaboost Hongbo Deng 6 Feb, 2007 Some of the slides are borrowed from Derek Hoiem & Jan ˇSochman. 1 Outline Background Adaboost Algorithm Theory/Interpretations 2 What s So Good

More information

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning Ensembles. 293S T. Yang. UCSB, 2017. Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)

More information

Neural Networks: Optimization & Regularization

Neural Networks: Optimization & Regularization Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of

More information

Ensemble Methods for Machine Learning

Ensemble Methods for Machine Learning Ensemble Methods for Machine Learning COMBINING CLASSIFIERS: ENSEMBLE APPROACHES Common Ensemble classifiers Bagging/Random Forests Bucket of models Stacking Boosting Ensemble classifiers we ve studied

More information

Bagging and Other Ensemble Methods

Bagging and Other Ensemble Methods Bagging and Other Ensemble Methods Sargur N. Srihari srihari@buffalo.edu 1 Regularization Strategies 1. Parameter Norm Penalties 2. Norm Penalties as Constrained Optimization 3. Regularization and Underconstrained

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity What makes good ensemble? CS789: Machine Learning and Neural Network Ensemble methods Jakramate Bootkrajang Department of Computer Science Chiang Mai University 1. A member of the ensemble is accurate.

More information

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan Adaptive Filters and Machine Learning Boosting and Bagging Background Poltayev Rassulzhan rasulzhan@gmail.com Resampling Bootstrap We are using training set and different subsets in order to validate results

More information

Voting (Ensemble Methods)

Voting (Ensemble Methods) 1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods The wisdom of the crowds Ensemble learning Sir Francis Galton discovered in the early 1900s that a collection of educated guesses can add up to very accurate predictions! Chapter 11 The paper in which

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

Machine Learning. Ensemble Methods. Manfred Huber

Machine Learning. Ensemble Methods. Manfred Huber Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning

More information

Hypothesis Evaluation

Hypothesis Evaluation Hypothesis Evaluation Machine Learning Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall 1395 1 / 31 Table of contents 1 Introduction

More information

Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

More information

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m ) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

Lecture 13: Ensemble Methods

Lecture 13: Ensemble Methods Lecture 13: Ensemble Methods Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu 1 / 71 Outline 1 Bootstrap

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

An Introduction to Statistical Machine Learning - Theoretical Aspects -

An Introduction to Statistical Machine Learning - Theoretical Aspects - An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,

More information

Ensembles of Classifiers.

Ensembles of Classifiers. Ensembles of Classifiers www.biostat.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts ensemble bootstrap sample bagging boosting random forests error correcting

More information

Neural Networks and Ensemble Methods for Classification

Neural Networks and Ensemble Methods for Classification Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Boosting & Deep Learning

Boosting & Deep Learning Boosting & Deep Learning Ensemble Learning n So far learning methods that learn a single hypothesis, chosen form a hypothesis space that is used to make predictions n Ensemble learning à select a collection

More information

Chapter 14 Combining Models

Chapter 14 Combining Models Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information

Bias-Variance in Machine Learning

Bias-Variance in Machine Learning Bias-Variance in Machine Learning Bias-Variance: Outline Underfitting/overfitting: Why are complex hypotheses bad? Simple example of bias/variance Error as bias+variance for regression brief comments on

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

TDT4173 Machine Learning

TDT4173 Machine Learning TDT4173 Machine Learning Lecture 9 Learning Classifiers: Bagging & Boosting Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline

More information

Boosting: Foundations and Algorithms. Rob Schapire

Boosting: Foundations and Algorithms. Rob Schapire Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24 Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal

More information

Announcements Kevin Jamieson

Announcements Kevin Jamieson Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

PDEEC Machine Learning 2016/17

PDEEC Machine Learning 2016/17 PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

TDT4173 Machine Learning

TDT4173 Machine Learning TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University BOOSTING Robert E. Schapire and Yoav

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Model complexity scores (v3) Ramesh Johari ramesh.johari@stanford.edu Fall 2015 1 / 34 Estimating prediction error 2 / 34 Estimating prediction error We saw how we can estimate

More information

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan Ensemble Methods NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan How do you make a decision? What do you want for lunch today?! What did you have last night?! What are your favorite

More information

Ensembles. Léon Bottou COS 424 4/8/2010

Ensembles. Léon Bottou COS 424 4/8/2010 Ensembles Léon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. Léon

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier

More information

1/sqrt(B) convergence 1/B convergence B

1/sqrt(B) convergence 1/B convergence B The Error Coding Method and PICTs Gareth James and Trevor Hastie Department of Statistics, Stanford University March 29, 1998 Abstract A new family of plug-in classication techniques has recently been

More information

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

More information

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16 COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-

More information

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7] 8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7] While cross-validation allows one to find the weight penalty parameters which would give the model good generalization capability, the separation of

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Data Warehousing & Data Mining

Data Warehousing & Data Mining 13. Meta-Algorithms for Classification Data Warehousing & Data Mining Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13.

More information

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual Outline: Ensemble Learning We will describe and investigate algorithms to Ensemble Learning Lecture 10, DD2431 Machine Learning A. Maki, J. Sullivan October 2014 train weak classifiers/regressors and how

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

BAGGING PREDICTORS AND RANDOM FOREST

BAGGING PREDICTORS AND RANDOM FOREST BAGGING PREDICTORS AND RANDOM FOREST DANA KANER M.SC. SEMINAR IN STATISTICS, MAY 2017 BAGIGNG PREDICTORS / LEO BREIMAN, 1996 RANDOM FORESTS / LEO BREIMAN, 2001 THE ELEMENTS OF STATISTICAL LEARNING (CHAPTERS

More information

Empirical Risk Minimization, Model Selection, and Model Assessment

Empirical Risk Minimization, Model Selection, and Model Assessment Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,

More information

Notation P(Y ) = X P(X, Y ) = X. P(Y X )P(X ) Teorema de Bayes: P(Y X ) = CIn/UFPE - Prof. Francisco de A. T. de Carvalho P(X )

Notation P(Y ) = X P(X, Y ) = X. P(Y X )P(X ) Teorema de Bayes: P(Y X ) = CIn/UFPE - Prof. Francisco de A. T. de Carvalho P(X ) Notation R n : feature space Y : class label set m : number of training examples D = {x i, y i } m i=1 (x i R n ; y i Y}: training data set H hipothesis space h : base classificer H : emsemble classifier

More information

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c15 2013/9/9 page 331 le-tex 331 15 Ensemble Learning The expression ensemble learning refers to a broad class

More information

Machine learning - HT Basis Expansion, Regularization, Validation

Machine learning - HT Basis Expansion, Regularization, Validation Machine learning - HT 016 4. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford Feburary 03, 016 Outline Introduce basis function to go beyond linear regression Understanding

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016 The Boosting Approach to Machine Learning Maria-Florina Balcan 10/31/2016 Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets

More information

Sample questions for Fundamentals of Machine Learning 2018

Sample questions for Fundamentals of Machine Learning 2018 Sample questions for Fundamentals of Machine Learning 2018 Teacher: Mohammad Emtiyaz Khan A few important informations: In the final exam, no electronic devices are allowed except a calculator. Make sure

More information

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms 03/Feb/2010 VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms Presented by Andriy Temko Department of Electrical and Electronic Engineering Page 2 of

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

More information

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a

More information

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Günther Eibl and Karl Peter Pfeiffer Institute of Biostatistics, Innsbruck, Austria guenther.eibl@uibk.ac.at Abstract.

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Chapter 6. Ensemble Methods

Chapter 6. Ensemble Methods Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Introduction

More information

Lossless Online Bayesian Bagging

Lossless Online Bayesian Bagging Lossless Online Bayesian Bagging Herbert K. H. Lee ISDS Duke University Box 90251 Durham, NC 27708 herbie@isds.duke.edu Merlise A. Clyde ISDS Duke University Box 90251 Durham, NC 27708 clyde@isds.duke.edu

More information

Numerical Optimization

Numerical Optimization Numerical Optimization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Large-Margin Thresholded Ensembles for Ordinal Regression

Large-Margin Thresholded Ensembles for Ordinal Regression Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology, U.S.A. Conf. on Algorithmic Learning Theory, October 9,

More information

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting) Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting) Professor: Aude Billard Assistants: Nadia Figueroa, Ilaria Lauzana and Brice Platerrier E-mails: aude.billard@epfl.ch,

More information

Model Averaging With Holdout Estimation of the Posterior Distribution

Model Averaging With Holdout Estimation of the Posterior Distribution Model Averaging With Holdout stimation of the Posterior Distribution Alexandre Lacoste alexandre.lacoste.1@ulaval.ca François Laviolette francois.laviolette@ift.ulaval.ca Mario Marchand mario.marchand@ift.ulaval.ca

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Hierarchical Boosting and Filter Generation

Hierarchical Boosting and Filter Generation January 29, 2007 Plan Combining Classifiers Boosting Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Hierarchical Structure Filters Combining Classifiers Combining Classifiers

More information

Ensemble Learning in the Presence of Noise

Ensemble Learning in the Presence of Noise Universidad Autónoma de Madrid Master s Thesis Ensemble Learning in the Presence of Noise Author: Maryam Sabzevari Supervisors: Dr. Gonzalo Martínez Muñoz, Dr. Alberto Suárez González Submitted in partial

More information

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015 Fundamentals of Machine Learning Mohammad Emtiyaz Khan EPFL Aug 25, 25 Mohammad Emtiyaz Khan 24 Contents List of concepts 2 Course Goals 3 2 Regression 4 3 Model: Linear Regression 7 4 Cost Function: MSE

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

Monte Carlo Theory as an Explanation of Bagging and Boosting

Monte Carlo Theory as an Explanation of Bagging and Boosting Monte Carlo Theory as an Explanation of Bagging and Boosting Roberto Esposito Università di Torino C.so Svizzera 185, Torino, Italy esposito@di.unito.it Lorenza Saitta Università del Piemonte Orientale

More information

http://imgs.xkcd.com/comics/electoral_precedent.png Statistical Learning Theory CS4780/5780 Machine Learning Fall 2012 Thorsten Joachims Cornell University Reading: Mitchell Chapter 7 (not 7.4.4 and 7.5)

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m ) CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions with R 1,..., R m R p disjoint. f(x) = M c m 1(x R m ) m=1 The CART algorithm is a heuristic, adaptive

More information