Cross Validation & Ensembling
|
|
- Douglas Dylan Carroll
- 6 years ago
- Views:
Transcription
1 Cross Validation & Ensembling Shan-Hung Wu Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 1 / 34
2 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 2 / 34
3 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 3 / 34
4 Cross Validation So far, we use the hold out method for: Hyperparameter tuning: validation set Performance reporting: testing set What if we get an unfortunate split? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 4 / 34
5 Cross Validation So far, we use the hold out method for: Hyperparameter tuning: validation set Performance reporting: testing set What if we get an unfortunate split? K-fold cross validation: 1 Split the data set X evenly into K subsets X (i) (called folds) 2 For i = 1,,K, train f N (i) using all data but the i-th fold (X\X (i) ) 3 Report the cross-validation error C CV by averaging all testing errors C[f N (i) ] s on X (i) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 4 / 34
6 Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 2 nested CV Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34
7 Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 2 nested CV 1 Inner (2 folds): select hyperparameters giving lowest C CV Can be wrapped by grid search Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34
8 Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 2 nested CV 1 Inner (2 folds): select hyperparameters giving lowest C CV Can be wrapped by grid search 2 Train final model using both training and validation sets with the selected hyperparameters Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34
9 Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 2 nested CV 1 Inner (2 folds): select hyperparameters giving lowest C CV Can be wrapped by grid search 2 Train final model using both training and validation sets with the selected hyperparameters 3 Outer (5 folds): report C CV as test error Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34
10 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 6 / 34
11 How Many Folds K? I Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34
12 How Many Folds K? I The cross-validation error C CV is an average of C[f N (i)] s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34
13 How Many Folds K? I The cross-validation error C CV is an average of C[f N (i)] s Regard each C[f N (i)] as an estimator of the expected generalization error E X (C[f N ]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34
14 How Many Folds K? I The cross-validation error C CV is an average of C[f N (i)] s Regard each C[f N (i)] as an estimator of the expected generalization error E X (C[f N ]) C CV is an estimator too, and we have MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34
15 Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34
16 Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34
17 Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 = E ( ˆq n E[ ˆq n ]) 2 +(E[ ˆq n ] q) 2 + 2( ˆq n E[ ˆq n ])(E[ ˆq n ] q) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34
18 Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 = E ( ˆq n E[ ˆq n ]) 2 +(E[ ˆq n ] q) 2 + 2( ˆq n E[ ˆq n ])(E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E (E[ ˆq n ] q) 2 + 2E ˆq n E[ ˆq n ] (E[ ˆq n ] q) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34
19 Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 = E ( ˆq n E[ ˆq n ]) 2 +(E[ ˆq n ] q) 2 + 2( ˆq n E[ ˆq n ])(E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E (E[ ˆq n ] q) 2 + 2E ˆq n E[ ˆq n ] (E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E[ ˆq n ] q (E[ ˆq n ] q) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34
20 Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 = E ( ˆq n E[ ˆq n ]) 2 +(E[ ˆq n ] q) 2 + 2( ˆq n E[ ˆq n ])(E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E (E[ ˆq n ] q) 2 + 2E ˆq n E[ ˆq n ] (E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E[ ˆq n ] q (E[ ˆq n ] q) = Var X ( ˆq n )+bias( ˆq n ) 2 MSE of an unbiased estimator is its variance Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34
21 Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34
22 Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34
23 Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Let C[ ] be the MSE of predictions (made by a function) to true labels Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34
24 Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Let C[ ] be the MSE of predictions (made by a function) to true labels E X (C[f N ]): readline Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34
25 Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Let C[ ] be the MSE of predictions (made by a function) to true labels E X (C[f N ]): readline bias(c CV ):gapsbetweentheredandothersolidlines(e X [C CV ]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34
26 Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Let C[ ] be the MSE of predictions (made by a function) to true labels E X (C[f N ]): readline bias(c CV ):gapsbetweentheredandothersolidlines(e X [C CV ]) Var X (C CV ):shadedareas Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34
27 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34
28 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where bias(c CV )=E X (C CV ) E X (C[f N ]) = E Â i 1 K C[f N (i)] E(C[f N]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34
29 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E Â i K C[f N (i)] = 1 K Â i E C[f N (i)] E(C[f N]) E(C[f N]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34
30 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E Â i K C[f N (i)] = 1 K Â i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s E(C[f N]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34
31 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E Â i K C[f N (i)] = 1 K Â i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s = bias C[f N (s)],8s E(C[f N]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34
32 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E  i K C[f N (i)] = 1 K  i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s = bias C[f N (s)],8s E(C[f N]) Var X (C CV )=Var  i 1 K C[f N (i)] = 1 K 2 Var  i C[f N (i)] Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34
33 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E  i K C[f N (i)] = 1 K  i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s = bias C[f N (s)],8s E(C[f N]) Var X (C CV )=Var  i 1 K C[f N (i)] = 1 K 2 Var  i C[f N (i)] = 1 K 2  i Var C[f N (i)] + 2 i,j,j>i Cov X C[f N (i)],c[f N (j)] Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34
34 Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E  i K C[f N (i)] = 1 K  i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s = bias C[f N (s)],8s E(C[f N]) 1 Var X (C CV )=Var  i K C[f N (i)] = 1 Var K  2 i C[f N (i)] = 1 K  2 i Var C[f N (i)] + 2 i,j,j>i Cov X C[f N (i)],c[f N (j)] = 1 K Var C[f N (s)] + 2 K  2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34
35 How Many Folds K? II MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where bias(c CV )=bias C[f N (s)],8s Var(C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s We can reduce bias(c CV ) and Var(C CV ) by learning theory Choosing the right model complexity avoiding both underfitting and overfitting Collecting more training examples (N) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 11 / 34
36 How Many Folds K? II MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where bias(c CV )=bias C[f N (s)],8s Var(C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s We can reduce bias(c CV ) and Var(C CV ) by learning theory Choosing the right model complexity avoiding both underfitting and overfitting Collecting more training examples (N) Furthermore, we can reduce Var(C CV ) by making f N (i) uncorrelated and f N (j) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 11 / 34
37 How Many Folds K? III bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N With a large K, thec CV tends to have: (j)],8s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 12 / 34
38 How Many Folds K? III bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s With a large K, thec CV tends to have: Low bias C[f N (s) ] and Var C[f N (s) ],asf N (s) is trained on more examples Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 12 / 34
39 How Many Folds K? III bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s With a large K, thec CV tends to have: Low bias C[f N (s) ] and Var C[f N (s) ],asf N (s) is trained on more examples High Cov C[f N (i) ],C[f N (j) ], as training sets X\X (i) and X\X (j) are more similar thus C[f N (i) ] and C[f N (j) ] are more positively correlated Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 12 / 34
40 How Many Folds K? IV bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s Conversely, with a small K, the cross-validation error tends to have a high bias C[f N (s)] and Var C[f N (s)] but low Cov C[f N (i)],c[f N (j)] Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 13 / 34
41 How Many Folds K? IV bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s Conversely, with a small K, the cross-validation error tends to have a high bias C[f N (s)] and Var C[f N (s)] but low Cov C[f N (i)],c[f N (j)] In practice, we usually set K = 5 or 10 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 13 / 34
42 Leave-One-Out CV bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N For very small dataset: MSE(C CV ) is dominated by bias C[f N (s) ] and Var C[f N (s) ] Not Cov C[f N (i) ],C[f N (j) ] (j)],8s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 14 / 34
43 Leave-One-Out CV bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N For very small dataset: MSE(C CV ) is dominated by bias C[f N (s) ] and Var C[f N (s) ] Not Cov C[f N (i) ],C[f N (j) ] We can choose K = N, whichwecalltheleave-one-out CV (j)],8s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 14 / 34
44 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 15 / 34
45 Ensemble Methods No free lunch theorem: thereisnosinglemlalgorithmthatalways outperforms the others in all domains/tasks Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34
46 Ensemble Methods No free lunch theorem: thereisnosinglemlalgorithmthatalways outperforms the others in all domains/tasks Can we combine multiple base-learners to improve Applicability across different domains, and/or Generalization performance in a specific task? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34
47 Ensemble Methods No free lunch theorem: thereisnosinglemlalgorithmthatalways outperforms the others in all domains/tasks Can we combine multiple base-learners to improve Applicability across different domains, and/or Generalization performance in a specific task? These are the goals of ensemble learning Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34
48 Ensemble Methods No free lunch theorem: thereisnosinglemlalgorithmthatalways outperforms the others in all domains/tasks Can we combine multiple base-learners to improve Applicability across different domains, and/or Generalization performance in a specific task? These are the goals of ensemble learning How to combine multiple base-learners? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34
49 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 17 / 34
50 Voting Voting: linearcombiningthepredictionsofbase-learnersforeachx: ỹ k = Âw j ŷ (j) k where w j 0,Âw j = 1. j j If all learners are given equal weight w j = 1/L, wehavetheplurality vote (multi-class version of majority vote) Voting Rule Sum Weighted sum Median Minimum Maximum Product Formular ỹ k = 1 L ÂL j=1 ŷ(j) k ỹ k =  j w j ŷ (j) k,w j 0, j w j = 1 ỹ k = median j ŷ (j) k ỹ k = min j ŷ (j) k ỹ k = max j ŷ (j) k ỹ k = j ŷ (j) k Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 18 / 34
51 Why Voting Works? I Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 19 / 34
52 Why Voting Works? I Assume that each ŷ (j) has the expected value E X ŷ (j) x and variance Var X ŷ (j) x When w j = 1/L, wehave:! E X (ỹ x)=e 1 Â j L ŷ(j) x = 1 L Â j Var X (ỹ x)=var! 1 Â j L ŷ(j) x = 1 ŷ L Var (j) x + 2 L Â 2 Cov i,j,i<j E ŷ (j) x = E ŷ (j) x! = 1 L 2 Var Âŷ (j) x j ŷ (i),ŷ (j) x Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 19 / 34
53 Why Voting Works? I Assume that each ŷ (j) has the expected value E X ŷ (j) x and variance Var X ŷ (j) x When w j = 1/L, wehave:! E X (ỹ x)=e 1 Â j L ŷ(j) x = 1 L Â j Var X (ỹ x)=var! 1 Â j L ŷ(j) x = 1 ŷ L Var (j) x + 2 L Â 2 Cov i,j,i<j E ŷ (j) x = E ŷ (j) x! = 1 L 2 Var Âŷ (j) x j ŷ (i),ŷ (j) x The expected value doesn t change, so the bias doesn t change Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 19 / 34
54 Why Voting Works? II Var X (ỹ x)= 1 ŷ L Var (j) x + 2 L Â 2 Cov ŷ (i),ŷ (j) x i,j,i<j Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 20 / 34
55 Why Voting Works? II Var X (ỹ x)= 1 ŷ L Var (j) x + 2 L Â 2 Cov ŷ (i),ŷ (j) x i,j,i<j If ŷ (i) and ŷ (j) are uncorrelated, the variance can be reduced Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 20 / 34
56 Why Voting Works? II Var X (ỹ x)= 1 ŷ L Var (j) x + 2 L Â 2 Cov ŷ (i),ŷ (j) x i,j,i<j If ŷ (i) and ŷ (j) are uncorrelated, the variance can be reduced Unfortunately, ŷ (j) s may not be i.i.d. in practice If voters are positively correlated, variance increases Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 20 / 34
57 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 21 / 34
58 Bagging Bagging (short for bootstrap aggregating) isavotingmethod,but base-learners are made different deliberately How? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34
59 Bagging Bagging (short for bootstrap aggregating) isavotingmethod,but base-learners are made different deliberately How? Why not train them using slightly different training sets? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34
60 Bagging Bagging (short for bootstrap aggregating) isavotingmethod,but base-learners are made different deliberately How? Why not train them using slightly different training sets? 1 Generate L slightly different samples from a given sample is done by bootstrap: givenx of size N, wedrawn points randomly from X with replacement to get X (j) It is possible that some instances are drawn more than once and some are not at all Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34
61 Bagging Bagging (short for bootstrap aggregating) isavotingmethod,but base-learners are made different deliberately How? Why not train them using slightly different training sets? 1 Generate L slightly different samples from a given sample is done by bootstrap: givenx of size N, wedrawn points randomly from X with replacement to get X (j) It is possible that some instances are drawn more than once and some are not at all 2 Train a base-learner for each X (j) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34
62 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 23 / 34
63 Boosting In bagging, generating uncorrelated base-learners is left to chance and unstability of the learning method Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34
64 Boosting In bagging, generating uncorrelated base-learners is left to chance and unstability of the learning method In boosting, wegeneratecomplementary base-learners How? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34
65 Boosting In bagging, generating uncorrelated base-learners is left to chance and unstability of the learning method In boosting, wegeneratecomplementary base-learners How? Why not train the next learner on the mistakes of the previous learners Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34
66 Boosting In bagging, generating uncorrelated base-learners is left to chance and unstability of the learning method In boosting, wegeneratecomplementary base-learners How? Why not train the next learner on the mistakes of the previous learners For simplicity, let s consider the binary classification here: d (j) (x) 2{1, 1} The original boosting algorithm combines three weak learners to generate a strong learner A week learner has error probability less than 1/2 (better than random guessing) A strong learner has arbitrarily small error probability Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34
67 Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34
68 Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34
69 Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) 3 Use all points misclassified by d (1) and X (2) to train d (2).Thenfeed X (3) to d (1) and d (2) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34
70 Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) 3 Use all points misclassified by d (1) and X (2) to train d (2).Thenfeed X (3) to d (1) and d (2) 4 Use the points on which d (1) and d (2) disagree to train d (3) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34
71 Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) 3 Use all points misclassified by d (1) and X (2) to train d (2).Thenfeed X (3) to d (1) and d (2) 4 Use the points on which d (1) and d (2) disagree to train d (3) Testing 1 Feed a point it to d (1) and d (2) first. If their outputs agree, use them as the final prediction Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34
72 Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) 3 Use all points misclassified by d (1) and X (2) to train d (2).Thenfeed X (3) to d (1) and d (2) 4 Use the points on which d (1) and d (2) disagree to train d (3) Testing 1 Feed a point it to d (1) and d (2) first. If their outputs agree, use them as the final prediction 2 Otherwise the output of d (3) is taken Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34
73 Example Assuming X (1), X (2),andX (3) are the same: Disadvantage: requires a large training set to afford the three-way split Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 26 / 34
74 AdaBoost AdaBoost: usesthesametrainingsetoverandoveragain How to make some points larger? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 27 / 34
75 AdaBoost AdaBoost: usesthesametrainingsetoverandoveragain How to make some points larger? Modify the probabilities of drawing the instances as a function of error Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 27 / 34
76 AdaBoost AdaBoost: usesthesametrainingsetoverandoveragain How to make some points larger? Modify the probabilities of drawing the instances as a function of error Notation: Pr (i,j) :probabilitythatanexample(x (i),y (i) ) is drawn to train the jth base-learner d (j) e (j) = Â i Pr (i,j) 1(y (i) 6= d (j) (x (i) )): errorrateofd (j) on its training set Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 27 / 34
77 Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34
78 Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) 2 Stop adding new base-learners if e (j) 1 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34
79 Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) 2 Stop adding new base-learners if e (j) Define a j = 1 2 log 1 e (j) > 0 and set Pr (i,j+1) = Pr (i,j) exp( e (j) a j y (i) d (j) (x (i) )) for all i Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34
80 Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) 2 Stop adding new base-learners if e (j) Define a j = 1 2 log 1 e (j) > 0 and set e (j) Pr (i,j+1) = Pr (i,j) exp( a j y (i) d (j) (x (i) )) for all i 1 4 Normalize Pr (i,j+1), 8i, by multiplying  i Pr (i,j+1) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34
81 Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) 2 Stop adding new base-learners if e (j) Define a j = 1 2 log 1 e (j) > 0 and set e (j) Pr (i,j+1) = Pr (i,j) exp( a j y (i) d (j) (x (i) )) for all i 1 4 Normalize Pr (i,j+1), 8i, by multiplying  i Pr (i,j+1) Testing 1 Given x, calculateŷ (j) for all j 2 Make final prediction ỹ by voting: ỹ =  j a j d (j) (x) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34
82 Example d (j+1) complements d (j) and d (j disagree 1) by focusing on predictions they Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 29 / 34
83 Example d (j+1) complements d (j) and d (j 1) by focusing on predictions they disagree Voting weights (a j = 1 2 log 1 e (j) )inpredictionsareproportionalto the base-learner s accuracy e (j) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 29 / 34
84 Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 30 / 34
85 Why AdaBoost Works Why AdaBoost improves performance? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34
86 Why AdaBoost Works Why AdaBoost improves performance? By increasing model complexity? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34
87 Why AdaBoost Works Why AdaBoost improves performance? By increasing model complexity? Not exactly Empirical study: AdaBoost reduces overfitting as L grows, even when there is no training error Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34
88 Why AdaBoost Works Why AdaBoost improves performance? By increasing model complexity? Not exactly Empirical study: AdaBoost reduces overfitting as L grows, even when there is no training error AdaBoost increases margin [1, 2] Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34
89 Margin as Confidence of Predictions Recall in SVC, a larger margin improves generalizability Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 32 / 34
90 Margin as Confidence of Predictions Recall in SVC, a larger margin improves generalizability Due to higher confidence predictions over training examples Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 32 / 34
91 Margin as Confidence of Predictions Recall in SVC, a larger margin improves generalizability Due to higher confidence predictions over training examples We can define the margin for AdaBoost similarly In binary classification, define margin of a prediction of an example (x (i),y (i) ) 2 X as: margin(x (i),y (i) )=y (i) f (x (i) )= Â a j j:y (i) =d (j) (x (i) ) Â a j j:y (i) 6=d (j) (x (i) ) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 32 / 34
92 Margin Distribution Margin distribution over q: Pr X (y (i) f (x (i) ) apple q) (x(i),y (i) ) : y (i) f (x (i) ) apple q X Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 33 / 34
93 Margin Distribution Margin distribution over q: Pr X (y (i) f (x (i) ) apple q) (x(i),y (i) ) : y (i) f (x (i) ) apple q X Acomplementarylearner: Clarifies low confidence areas Increases margin of points in these areas Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 33 / 34
94 Reference I [1] Yoav Freund, Robert Schapire, and N Abe. Ashortintroductiontoboosting. Journal-Japanese Society For Artificial Intelligence, 14( ):1612, [2] Liwei Wang, Masashi Sugiyama, Cheng Yang, Zhi-Hua Zhou, and Jufu Feng. On the margin explanation of boosting algorithms. In COLT, pages Citeseer,2008. Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 34 / 34
Learning with multiple models. Boosting.
CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models
More informationA Brief Introduction to Adaboost
A Brief Introduction to Adaboost Hongbo Deng 6 Feb, 2007 Some of the slides are borrowed from Derek Hoiem & Jan ˇSochman. 1 Outline Background Adaboost Algorithm Theory/Interpretations 2 What s So Good
More informationLearning Ensembles. 293S T. Yang. UCSB, 2017.
Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)
More informationNeural Networks: Optimization & Regularization
Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne
More informationMachine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /
Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical
More informationCSE 151 Machine Learning. Instructor: Kamalika Chaudhuri
CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of
More informationEnsemble Methods for Machine Learning
Ensemble Methods for Machine Learning COMBINING CLASSIFIERS: ENSEMBLE APPROACHES Common Ensemble classifiers Bagging/Random Forests Bucket of models Stacking Boosting Ensemble classifiers we ve studied
More informationBagging and Other Ensemble Methods
Bagging and Other Ensemble Methods Sargur N. Srihari srihari@buffalo.edu 1 Regularization Strategies 1. Parameter Norm Penalties 2. Norm Penalties as Constrained Optimization 3. Regularization and Underconstrained
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationWhat makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity
What makes good ensemble? CS789: Machine Learning and Neural Network Ensemble methods Jakramate Bootkrajang Department of Computer Science Chiang Mai University 1. A member of the ensemble is accurate.
More informationBackground. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan
Adaptive Filters and Machine Learning Boosting and Bagging Background Poltayev Rassulzhan rasulzhan@gmail.com Resampling Bootstrap We are using training set and different subsets in order to validate results
More informationVoting (Ensemble Methods)
1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers
More informationEnsemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods
The wisdom of the crowds Ensemble learning Sir Francis Galton discovered in the early 1900s that a collection of educated guesses can add up to very accurate predictions! Chapter 11 The paper in which
More informationVBM683 Machine Learning
VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data
More informationMachine Learning. Ensemble Methods. Manfred Huber
Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected
More informationCS7267 MACHINE LEARNING
CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning
More informationHypothesis Evaluation
Hypothesis Evaluation Machine Learning Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall 1395 1 / 31 Table of contents 1 Introduction
More informationVariance Reduction and Ensemble Methods
Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis
More informationThe AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m
) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationDecision Trees: Overfitting
Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9
More informationLecture 13: Ensemble Methods
Lecture 13: Ensemble Methods Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu 1 / 71 Outline 1 Bootstrap
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationAn Introduction to Statistical Machine Learning - Theoretical Aspects -
An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,
More informationEnsembles of Classifiers.
Ensembles of Classifiers www.biostat.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts ensemble bootstrap sample bagging boosting random forests error correcting
More informationNeural Networks and Ensemble Methods for Classification
Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationBoosting & Deep Learning
Boosting & Deep Learning Ensemble Learning n So far learning methods that learn a single hypothesis, chosen form a hypothesis space that is used to make predictions n Ensemble learning à select a collection
More informationChapter 14 Combining Models
Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients
More informationBias-Variance in Machine Learning
Bias-Variance in Machine Learning Bias-Variance: Outline Underfitting/overfitting: Why are complex hypotheses bad? Simple example of bias/variance Error as bias+variance for regression brief comments on
More informationData Mining und Maschinelles Lernen
Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting
More informationAlgorithm-Independent Learning Issues
Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning
More informationTDT4173 Machine Learning
TDT4173 Machine Learning Lecture 9 Learning Classifiers: Bagging & Boosting Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline
More informationBoosting: Foundations and Algorithms. Rob Schapire
Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you
More informationRegularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline
Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationBig Data Analytics. Special Topics for Computer Science CSE CSE Feb 24
Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal
More informationAnnouncements Kevin Jamieson
Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium
More informationData Mining and Analysis: Fundamental Concepts and Algorithms
Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA
More informationPDEEC Machine Learning 2016/17
PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /
More informationI D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69
R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual
More informationTDT4173 Machine Learning
TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationEvaluation. Andrea Passerini Machine Learning. Evaluation
Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain
More informationEnsemble Methods and Random Forests
Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization
More informationCOMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017
COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University BOOSTING Robert E. Schapire and Yoav
More informationMS&E 226: Small Data
MS&E 226: Small Data Lecture 6: Model complexity scores (v3) Ramesh Johari ramesh.johari@stanford.edu Fall 2015 1 / 34 Estimating prediction error 2 / 34 Estimating prediction error We saw how we can estimate
More informationEnsemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan
Ensemble Methods NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan How do you make a decision? What do you want for lunch today?! What did you have last night?! What are your favorite
More informationEnsembles. Léon Bottou COS 424 4/8/2010
Ensembles Léon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. Léon
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationBoosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi
Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier
More information1/sqrt(B) convergence 1/B convergence B
The Error Coding Method and PICTs Gareth James and Trevor Hastie Department of Statistics, Stanford University March 29, 1998 Abstract A new family of plug-in classication techniques has recently been
More informationAdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague
AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees
More informationCOMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16
COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-
More information8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]
8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7] While cross-validation allows one to find the weight penalty parameters which would give the model good generalization capability, the separation of
More informationLearning theory. Ensemble methods. Boosting. Boosting: history
Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over
More informationData Warehousing & Data Mining
13. Meta-Algorithms for Classification Data Warehousing & Data Mining Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13.
More informationcxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c
Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More informationOutline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual
Outline: Ensemble Learning We will describe and investigate algorithms to Ensemble Learning Lecture 10, DD2431 Machine Learning A. Maki, J. Sullivan October 2014 train weak classifiers/regressors and how
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationCombining Classifiers
Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/
More informationIntroduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:
More informationStatistics and learning: Big Data
Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees
More informationBAGGING PREDICTORS AND RANDOM FOREST
BAGGING PREDICTORS AND RANDOM FOREST DANA KANER M.SC. SEMINAR IN STATISTICS, MAY 2017 BAGIGNG PREDICTORS / LEO BREIMAN, 1996 RANDOM FORESTS / LEO BREIMAN, 2001 THE ELEMENTS OF STATISTICAL LEARNING (CHAPTERS
More informationEmpirical Risk Minimization, Model Selection, and Model Assessment
Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,
More informationNotation P(Y ) = X P(X, Y ) = X. P(Y X )P(X ) Teorema de Bayes: P(Y X ) = CIn/UFPE - Prof. Francisco de A. T. de Carvalho P(X )
Notation R n : feature space Y : class label set m : number of training examples D = {x i, y i } m i=1 (x i R n ; y i Y}: training data set H hipothesis space h : base classificer H : emsemble classifier
More informationFrank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex
Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c15 2013/9/9 page 331 le-tex 331 15 Ensemble Learning The expression ensemble learning refers to a broad class
More informationMachine learning - HT Basis Expansion, Regularization, Validation
Machine learning - HT 016 4. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford Feburary 03, 016 Outline Introduce basis function to go beyond linear regression Understanding
More informationEvaluation requires to define performance measures to be optimized
Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation
More informationThe Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016
The Boosting Approach to Machine Learning Maria-Florina Balcan 10/31/2016 Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets
More informationSample questions for Fundamentals of Machine Learning 2018
Sample questions for Fundamentals of Machine Learning 2018 Teacher: Mohammad Emtiyaz Khan A few important informations: In the final exam, no electronic devices are allowed except a calculator. Make sure
More informationVC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms
03/Feb/2010 VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms Presented by Andriy Temko Department of Electrical and Electronic Engineering Page 2 of
More informationStephen Scott.
1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationCS 231A Section 1: Linear Algebra & Probability Review
CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability
More informationBoosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13
Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit
More informationReminders. Thought questions should be submitted on eclass. Please list the section related to the thought question
Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a
More informationAnalysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example
Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Günther Eibl and Karl Peter Pfeiffer Institute of Biostatistics, Innsbruck, Austria guenther.eibl@uibk.ac.at Abstract.
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationChapter 6. Ensemble Methods
Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Introduction
More informationLossless Online Bayesian Bagging
Lossless Online Bayesian Bagging Herbert K. H. Lee ISDS Duke University Box 90251 Durham, NC 27708 herbie@isds.duke.edu Merlise A. Clyde ISDS Duke University Box 90251 Durham, NC 27708 clyde@isds.duke.edu
More informationNumerical Optimization
Numerical Optimization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationLarge-Margin Thresholded Ensembles for Ordinal Regression
Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology, U.S.A. Conf. on Algorithmic Learning Theory, October 9,
More informationAdvanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)
Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting) Professor: Aude Billard Assistants: Nadia Figueroa, Ilaria Lauzana and Brice Platerrier E-mails: aude.billard@epfl.ch,
More informationModel Averaging With Holdout Estimation of the Posterior Distribution
Model Averaging With Holdout stimation of the Posterior Distribution Alexandre Lacoste alexandre.lacoste.1@ulaval.ca François Laviolette francois.laviolette@ift.ulaval.ca Mario Marchand mario.marchand@ift.ulaval.ca
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationHierarchical Boosting and Filter Generation
January 29, 2007 Plan Combining Classifiers Boosting Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Hierarchical Structure Filters Combining Classifiers Combining Classifiers
More informationEnsemble Learning in the Presence of Noise
Universidad Autónoma de Madrid Master s Thesis Ensemble Learning in the Presence of Noise Author: Maryam Sabzevari Supervisors: Dr. Gonzalo Martínez Muñoz, Dr. Alberto Suárez González Submitted in partial
More informationFundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015
Fundamentals of Machine Learning Mohammad Emtiyaz Khan EPFL Aug 25, 25 Mohammad Emtiyaz Khan 24 Contents List of concepts 2 Course Goals 3 2 Regression 4 3 Model: Linear Regression 7 4 Cost Function: MSE
More informationA Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007
Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized
More informationMonte Carlo Theory as an Explanation of Bagging and Boosting
Monte Carlo Theory as an Explanation of Bagging and Boosting Roberto Esposito Università di Torino C.so Svizzera 185, Torino, Italy esposito@di.unito.it Lorenza Saitta Università del Piemonte Orientale
More informationhttp://imgs.xkcd.com/comics/electoral_precedent.png Statistical Learning Theory CS4780/5780 Machine Learning Fall 2012 Thorsten Joachims Cornell University Reading: Mitchell Chapter 7 (not 7.4.4 and 7.5)
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationCART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )
CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions with R 1,..., R m R p disjoint. f(x) = M c m 1(x R m ) m=1 The CART algorithm is a heuristic, adaptive
More information