Cross Validation & Ensembling

Similar documents
Learning with multiple models. Boosting.

A Brief Introduction to Adaboost

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Neural Networks: Optimization & Regularization

Statistical Machine Learning from Data

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Ensemble Methods for Machine Learning

Bagging and Other Ensemble Methods

ECE 5424: Introduction to Machine Learning

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Voting (Ensemble Methods)

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

VBM683 Machine Learning

Machine Learning. Ensemble Methods. Manfred Huber

CS7267 MACHINE LEARNING

Hypothesis Evaluation

Variance Reduction and Ensemble Methods

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Decision Trees: Overfitting

Lecture 13: Ensemble Methods

Lecture 8. Instructor: Haipeng Luo

An Introduction to Statistical Machine Learning - Theoretical Aspects -

Ensembles of Classifiers.

Neural Networks and Ensemble Methods for Classification

Stochastic Gradient Descent

Boosting & Deep Learning

Chapter 14 Combining Models

Bias-Variance in Machine Learning

Data Mining und Maschinelles Lernen

Algorithm-Independent Learning Issues

TDT4173 Machine Learning

Boosting: Foundations and Algorithms. Rob Schapire

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

ECE 5984: Introduction to Machine Learning

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Announcements Kevin Jamieson

Data Mining and Analysis: Fundamental Concepts and Algorithms

PDEEC Machine Learning 2016/17

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

TDT4173 Machine Learning

Machine Learning Basics: Maximum Likelihood Estimation

Evaluation. Andrea Passerini Machine Learning. Evaluation

Ensemble Methods and Random Forests

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

MS&E 226: Small Data

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Ensembles. Léon Bottou COS 424 4/8/2010

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

1/sqrt(B) convergence 1/B convergence B

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]

Learning theory. Ensemble methods. Boosting. Boosting: history

Data Warehousing & Data Mining

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

CS534 Machine Learning - Spring Final Exam

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Understanding Generalization Error: Bounds and Decompositions

Combining Classifiers

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Statistics and learning: Big Data

BAGGING PREDICTORS AND RANDOM FOREST

Empirical Risk Minimization, Model Selection, and Model Assessment

Notation P(Y ) = X P(X, Y ) = X. P(Y X )P(X ) Teorema de Bayes: P(Y X ) = CIn/UFPE - Prof. Francisco de A. T. de Carvalho P(X )

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Machine learning - HT Basis Expansion, Regularization, Validation

Evaluation requires to define performance measures to be optimized

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Sample questions for Fundamentals of Machine Learning 2018

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

Stephen Scott.

FINAL: CS 6375 (Machine Learning) Fall 2014

CS 231A Section 1: Linear Algebra & Probability Review

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Chapter 6. Ensemble Methods

Lossless Online Bayesian Bagging

Numerical Optimization

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Midterm exam CS 189/289, Fall 2015

Machine Learning Linear Classification. Prof. Matteo Matteucci

Large-Margin Thresholded Ensembles for Ordinal Regression

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Model Averaging With Holdout Estimation of the Posterior Distribution

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Hierarchical Boosting and Filter Generation

Ensemble Learning in the Presence of Noise

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Monte Carlo Theory as an Explanation of Bagging and Boosting


EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Transcription:

Cross Validation & Ensembling Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 1 / 34

Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 2 / 34

Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 3 / 34

Cross Validation So far, we use the hold out method for: Hyperparameter tuning: validation set Performance reporting: testing set What if we get an unfortunate split? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 4 / 34

Cross Validation So far, we use the hold out method for: Hyperparameter tuning: validation set Performance reporting: testing set What if we get an unfortunate split? K-fold cross validation: 1 Split the data set X evenly into K subsets X (i) (called folds) 2 For i = 1,,K, train f N (i) using all data but the i-th fold (X\X (i) ) 3 Report the cross-validation error C CV by averaging all testing errors C[f N (i) ] s on X (i) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 4 / 34

Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 2 nested CV Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34

Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 2 nested CV 1 Inner (2 folds): select hyperparameters giving lowest C CV Can be wrapped by grid search Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34

Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 2 nested CV 1 Inner (2 folds): select hyperparameters giving lowest C CV Can be wrapped by grid search 2 Train final model using both training and validation sets with the selected hyperparameters Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34

Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 2 nested CV 1 Inner (2 folds): select hyperparameters giving lowest C CV Can be wrapped by grid search 2 Train final model using both training and validation sets with the selected hyperparameters 3 Outer (5 folds): report C CV as test error Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34

Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 6 / 34

How Many Folds K? I Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34

How Many Folds K? I The cross-validation error C CV is an average of C[f N (i)] s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34

How Many Folds K? I The cross-validation error C CV is an average of C[f N (i)] s Regard each C[f N (i)] as an estimator of the expected generalization error E X (C[f N ]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34

How Many Folds K? I The cross-validation error C CV is an average of C[f N (i)] s Regard each C[f N (i)] as an estimator of the expected generalization error E X (C[f N ]) C CV is an estimator too, and we have MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34

Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 = E ( ˆq n E[ ˆq n ]) 2 +(E[ ˆq n ] q) 2 + 2( ˆq n E[ ˆq n ])(E[ ˆq n ] q) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 = E ( ˆq n E[ ˆq n ]) 2 +(E[ ˆq n ] q) 2 + 2( ˆq n E[ ˆq n ])(E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E (E[ ˆq n ] q) 2 + 2E ˆq n E[ ˆq n ] (E[ ˆq n ] q) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 = E ( ˆq n E[ ˆq n ]) 2 +(E[ ˆq n ] q) 2 + 2( ˆq n E[ ˆq n ])(E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E (E[ ˆq n ] q) 2 + 2E ˆq n E[ ˆq n ] (E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E[ ˆq n ] q 2 + 2 0 (E[ ˆq n ] q) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

Point Estimation Revisited: Mean Square Error Let ˆq n be an estimator of quantity q related to random variable x mapped from n i.i.d samples of x Mean square error of ˆq n : MSE( ˆq n )=E X ( ˆq n q) 2 Can be decomposed into the bias and variance: E X ( ˆq n q) 2 = E ( ˆq n E[ ˆq n ]+E[ ˆq n ] q) 2 = E ( ˆq n E[ ˆq n ]) 2 +(E[ ˆq n ] q) 2 + 2( ˆq n E[ ˆq n ])(E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E (E[ ˆq n ] q) 2 + 2E ˆq n E[ ˆq n ] (E[ ˆq n ] q) = E ( ˆq n E[ ˆq n ]) 2 + E[ ˆq n ] q 2 + 2 0 (E[ ˆq n ] q) = Var X ( ˆq n )+bias( ˆq n ) 2 MSE of an unbiased estimator is its variance Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Let C[ ] be the MSE of predictions (made by a function) to true labels Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Let C[ ] be the MSE of predictions (made by a function) to true labels E X (C[f N ]): readline Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Let C[ ] be the MSE of predictions (made by a function) to true labels E X (C[f N ]): readline bias(c CV ):gapsbetweentheredandothersolidlines(e X [C CV ]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

Example: 5-Fold vs. 10-Fold CV MSE(C CV )=E X [(C CV E X (C[f N ])) 2 ]=Var X (C CV )+bias(c CV ) 2 Consider polynomial regression where P(y x)=sin(x)+e,e N (0,s 2 ) Let C[ ] be the MSE of predictions (made by a function) to true labels E X (C[f N ]): readline bias(c CV ):gapsbetweentheredandothersolidlines(e X [C CV ]) Var X (C CV ):shadedareas Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where bias(c CV )=E X (C CV ) E X (C[f N ]) = E Â i 1 K C[f N (i)] E(C[f N]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E Â i K C[f N (i)] = 1 K Â i E C[f N (i)] E(C[f N]) E(C[f N]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E Â i K C[f N (i)] = 1 K Â i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s E(C[f N]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E Â i K C[f N (i)] = 1 K Â i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s = bias C[f N (s)],8s E(C[f N]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E  i K C[f N (i)] = 1 K  i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s = bias C[f N (s)],8s E(C[f N]) Var X (C CV )=Var  i 1 K C[f N (i)] = 1 K 2 Var  i C[f N (i)] Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E  i K C[f N (i)] = 1 K  i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s = bias C[f N (s)],8s E(C[f N]) Var X (C CV )=Var  i 1 K C[f N (i)] = 1 K 2 Var  i C[f N (i)] = 1 K 2  i Var C[f N (i)] + 2 i,j,j>i Cov X C[f N (i)],c[f N (j)] Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

Decomposing Bias and Variance C CV is an estimator of the expected generalization error E X (C[f N ]): MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where 1 bias(c CV )=E X (C CV ) E X (C[f N ]) = E  i K C[f N (i)] = 1 K  i E C[f N (i)] E(C[f N]) = E C[f N (s)] E(C[f N]),8s = bias C[f N (s)],8s E(C[f N]) 1 Var X (C CV )=Var  i K C[f N (i)] = 1 Var K  2 i C[f N (i)] = 1 K  2 i Var C[f N (i)] + 2 i,j,j>i Cov X C[f N (i)],c[f N (j)] = 1 K Var C[f N (s)] + 2 K  2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

How Many Folds K? II MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where bias(c CV )=bias C[f N (s)],8s Var(C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s We can reduce bias(c CV ) and Var(C CV ) by learning theory Choosing the right model complexity avoiding both underfitting and overfitting Collecting more training examples (N) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 11 / 34

How Many Folds K? II MSE(C CV )=Var X (C CV )+bias(c CV ) 2, where bias(c CV )=bias C[f N (s)],8s Var(C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s We can reduce bias(c CV ) and Var(C CV ) by learning theory Choosing the right model complexity avoiding both underfitting and overfitting Collecting more training examples (N) Furthermore, we can reduce Var(C CV ) by making f N (i) uncorrelated and f N (j) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 11 / 34

How Many Folds K? III bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N With a large K, thec CV tends to have: (j)],8s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 12 / 34

How Many Folds K? III bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s With a large K, thec CV tends to have: Low bias C[f N (s) ] and Var C[f N (s) ],asf N (s) is trained on more examples Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 12 / 34

How Many Folds K? III bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s With a large K, thec CV tends to have: Low bias C[f N (s) ] and Var C[f N (s) ],asf N (s) is trained on more examples High Cov C[f N (i) ],C[f N (j) ], as training sets X\X (i) and X\X (j) are more similar thus C[f N (i) ] and C[f N (j) ] are more positively correlated Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 12 / 34

How Many Folds K? IV bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s Conversely, with a small K, the cross-validation error tends to have a high bias C[f N (s)] and Var C[f N (s)] but low Cov C[f N (i)],c[f N (j)] Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 13 / 34

How Many Folds K? IV bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N (j)],8s Conversely, with a small K, the cross-validation error tends to have a high bias C[f N (s)] and Var C[f N (s)] but low Cov C[f N (i)],c[f N (j)] In practice, we usually set K = 5 or 10 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 13 / 34

Leave-One-Out CV bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N For very small dataset: MSE(C CV ) is dominated by bias C[f N (s) ] and Var C[f N (s) ] Not Cov C[f N (i) ],C[f N (j) ] (j)],8s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 14 / 34

Leave-One-Out CV bias(c CV )=bias C[f N (s)],8s Var X (C CV )= 1 K Var C[f N (s)] + 2 K Â 2 i,j,j>i Cov C[f N (i)],c[f N For very small dataset: MSE(C CV ) is dominated by bias C[f N (s) ] and Var C[f N (s) ] Not Cov C[f N (i) ],C[f N (j) ] We can choose K = N, whichwecalltheleave-one-out CV (j)],8s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 14 / 34

Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 15 / 34

Ensemble Methods No free lunch theorem: thereisnosinglemlalgorithmthatalways outperforms the others in all domains/tasks Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34

Ensemble Methods No free lunch theorem: thereisnosinglemlalgorithmthatalways outperforms the others in all domains/tasks Can we combine multiple base-learners to improve Applicability across different domains, and/or Generalization performance in a specific task? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34

Ensemble Methods No free lunch theorem: thereisnosinglemlalgorithmthatalways outperforms the others in all domains/tasks Can we combine multiple base-learners to improve Applicability across different domains, and/or Generalization performance in a specific task? These are the goals of ensemble learning Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34

Ensemble Methods No free lunch theorem: thereisnosinglemlalgorithmthatalways outperforms the others in all domains/tasks Can we combine multiple base-learners to improve Applicability across different domains, and/or Generalization performance in a specific task? These are the goals of ensemble learning How to combine multiple base-learners? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34

Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 17 / 34

Voting Voting: linearcombiningthepredictionsofbase-learnersforeachx: ỹ k = Âw j ŷ (j) k where w j 0,Âw j = 1. j j If all learners are given equal weight w j = 1/L, wehavetheplurality vote (multi-class version of majority vote) Voting Rule Sum Weighted sum Median Minimum Maximum Product Formular ỹ k = 1 L ÂL j=1 ŷ(j) k ỹ k =  j w j ŷ (j) k,w j 0, j w j = 1 ỹ k = median j ŷ (j) k ỹ k = min j ŷ (j) k ỹ k = max j ŷ (j) k ỹ k = j ŷ (j) k Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 18 / 34

Why Voting Works? I Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 19 / 34

Why Voting Works? I Assume that each ŷ (j) has the expected value E X ŷ (j) x and variance Var X ŷ (j) x When w j = 1/L, wehave:! E X (ỹ x)=e 1 Â j L ŷ(j) x = 1 L Â j Var X (ỹ x)=var! 1 Â j L ŷ(j) x = 1 ŷ L Var (j) x + 2 L Â 2 Cov i,j,i<j E ŷ (j) x = E ŷ (j) x! = 1 L 2 Var Âŷ (j) x j ŷ (i),ŷ (j) x Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 19 / 34

Why Voting Works? I Assume that each ŷ (j) has the expected value E X ŷ (j) x and variance Var X ŷ (j) x When w j = 1/L, wehave:! E X (ỹ x)=e 1 Â j L ŷ(j) x = 1 L Â j Var X (ỹ x)=var! 1 Â j L ŷ(j) x = 1 ŷ L Var (j) x + 2 L Â 2 Cov i,j,i<j E ŷ (j) x = E ŷ (j) x! = 1 L 2 Var Âŷ (j) x j ŷ (i),ŷ (j) x The expected value doesn t change, so the bias doesn t change Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 19 / 34

Why Voting Works? II Var X (ỹ x)= 1 ŷ L Var (j) x + 2 L Â 2 Cov ŷ (i),ŷ (j) x i,j,i<j Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 20 / 34

Why Voting Works? II Var X (ỹ x)= 1 ŷ L Var (j) x + 2 L Â 2 Cov ŷ (i),ŷ (j) x i,j,i<j If ŷ (i) and ŷ (j) are uncorrelated, the variance can be reduced Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 20 / 34

Why Voting Works? II Var X (ỹ x)= 1 ŷ L Var (j) x + 2 L Â 2 Cov ŷ (i),ŷ (j) x i,j,i<j If ŷ (i) and ŷ (j) are uncorrelated, the variance can be reduced Unfortunately, ŷ (j) s may not be i.i.d. in practice If voters are positively correlated, variance increases Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 20 / 34

Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 21 / 34

Bagging Bagging (short for bootstrap aggregating) isavotingmethod,but base-learners are made different deliberately How? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34

Bagging Bagging (short for bootstrap aggregating) isavotingmethod,but base-learners are made different deliberately How? Why not train them using slightly different training sets? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34

Bagging Bagging (short for bootstrap aggregating) isavotingmethod,but base-learners are made different deliberately How? Why not train them using slightly different training sets? 1 Generate L slightly different samples from a given sample is done by bootstrap: givenx of size N, wedrawn points randomly from X with replacement to get X (j) It is possible that some instances are drawn more than once and some are not at all Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34

Bagging Bagging (short for bootstrap aggregating) isavotingmethod,but base-learners are made different deliberately How? Why not train them using slightly different training sets? 1 Generate L slightly different samples from a given sample is done by bootstrap: givenx of size N, wedrawn points randomly from X with replacement to get X (j) It is possible that some instances are drawn more than once and some are not at all 2 Train a base-learner for each X (j) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34

Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 23 / 34

Boosting In bagging, generating uncorrelated base-learners is left to chance and unstability of the learning method Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34

Boosting In bagging, generating uncorrelated base-learners is left to chance and unstability of the learning method In boosting, wegeneratecomplementary base-learners How? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34

Boosting In bagging, generating uncorrelated base-learners is left to chance and unstability of the learning method In boosting, wegeneratecomplementary base-learners How? Why not train the next learner on the mistakes of the previous learners Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34

Boosting In bagging, generating uncorrelated base-learners is left to chance and unstability of the learning method In boosting, wegeneratecomplementary base-learners How? Why not train the next learner on the mistakes of the previous learners For simplicity, let s consider the binary classification here: d (j) (x) 2{1, 1} The original boosting algorithm combines three weak learners to generate a strong learner A week learner has error probability less than 1/2 (better than random guessing) A strong learner has arbitrarily small error probability Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34

Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) 3 Use all points misclassified by d (1) and X (2) to train d (2).Thenfeed X (3) to d (1) and d (2) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) 3 Use all points misclassified by d (1) and X (2) to train d (2).Thenfeed X (3) to d (1) and d (2) 4 Use the points on which d (1) and d (2) disagree to train d (3) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) 3 Use all points misclassified by d (1) and X (2) to train d (2).Thenfeed X (3) to d (1) and d (2) 4 Use the points on which d (1) and d (2) disagree to train d (3) Testing 1 Feed a point it to d (1) and d (2) first. If their outputs agree, use them as the final prediction Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

Original Boosting Algorithm Training 1 Given a large training set, randomly divide it into three 2 Use X (1) to train the first learner d (1) and feed X (2) to d (1) 3 Use all points misclassified by d (1) and X (2) to train d (2).Thenfeed X (3) to d (1) and d (2) 4 Use the points on which d (1) and d (2) disagree to train d (3) Testing 1 Feed a point it to d (1) and d (2) first. If their outputs agree, use them as the final prediction 2 Otherwise the output of d (3) is taken Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

Example Assuming X (1), X (2),andX (3) are the same: Disadvantage: requires a large training set to afford the three-way split Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 26 / 34

AdaBoost AdaBoost: usesthesametrainingsetoverandoveragain How to make some points larger? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 27 / 34

AdaBoost AdaBoost: usesthesametrainingsetoverandoveragain How to make some points larger? Modify the probabilities of drawing the instances as a function of error Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 27 / 34

AdaBoost AdaBoost: usesthesametrainingsetoverandoveragain How to make some points larger? Modify the probabilities of drawing the instances as a function of error Notation: Pr (i,j) :probabilitythatanexample(x (i),y (i) ) is drawn to train the jth base-learner d (j) e (j) = Â i Pr (i,j) 1(y (i) 6= d (j) (x (i) )): errorrateofd (j) on its training set Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 27 / 34

Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) 2 Stop adding new base-learners if e (j) 1 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) 2 Stop adding new base-learners if e (j) 1 2 3 Define a j = 1 2 log 1 e (j) > 0 and set Pr (i,j+1) = Pr (i,j) exp( e (j) a j y (i) d (j) (x (i) )) for all i Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) 2 Stop adding new base-learners if e (j) 1 2 3 Define a j = 1 2 log 1 e (j) > 0 and set e (j) Pr (i,j+1) = Pr (i,j) exp( a j y (i) d (j) (x (i) )) for all i 1 4 Normalize Pr (i,j+1), 8i, by multiplying  i Pr (i,j+1) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

Algorithm Training 1 Initialize Pr (i,1) = 1 N for all i 2 Start from j = 1: 1 Randomly draw N examples from X with probabilities Pr (i,j) and use them to train d (j) 2 Stop adding new base-learners if e (j) 1 2 3 Define a j = 1 2 log 1 e (j) > 0 and set e (j) Pr (i,j+1) = Pr (i,j) exp( a j y (i) d (j) (x (i) )) for all i 1 4 Normalize Pr (i,j+1), 8i, by multiplying  i Pr (i,j+1) Testing 1 Given x, calculateŷ (j) for all j 2 Make final prediction ỹ by voting: ỹ =  j a j d (j) (x) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

Example d (j+1) complements d (j) and d (j disagree 1) by focusing on predictions they Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 29 / 34

Example d (j+1) complements d (j) and d (j 1) by focusing on predictions they disagree Voting weights (a j = 1 2 log 1 e (j) )inpredictionsareproportionalto the base-learner s accuracy e (j) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 29 / 34

Outline 1 Cross Validation How Many Folds? 2 Ensemble Methods Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 30 / 34

Why AdaBoost Works Why AdaBoost improves performance? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34

Why AdaBoost Works Why AdaBoost improves performance? By increasing model complexity? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34

Why AdaBoost Works Why AdaBoost improves performance? By increasing model complexity? Not exactly Empirical study: AdaBoost reduces overfitting as L grows, even when there is no training error Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34

Why AdaBoost Works Why AdaBoost improves performance? By increasing model complexity? Not exactly Empirical study: AdaBoost reduces overfitting as L grows, even when there is no training error AdaBoost increases margin [1, 2] Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34

Margin as Confidence of Predictions Recall in SVC, a larger margin improves generalizability Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 32 / 34

Margin as Confidence of Predictions Recall in SVC, a larger margin improves generalizability Due to higher confidence predictions over training examples Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 32 / 34

Margin as Confidence of Predictions Recall in SVC, a larger margin improves generalizability Due to higher confidence predictions over training examples We can define the margin for AdaBoost similarly In binary classification, define margin of a prediction of an example (x (i),y (i) ) 2 X as: margin(x (i),y (i) )=y (i) f (x (i) )= Â a j j:y (i) =d (j) (x (i) ) Â a j j:y (i) 6=d (j) (x (i) ) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 32 / 34

Margin Distribution Margin distribution over q: Pr X (y (i) f (x (i) ) apple q) (x(i),y (i) ) : y (i) f (x (i) ) apple q X Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 33 / 34

Margin Distribution Margin distribution over q: Pr X (y (i) f (x (i) ) apple q) (x(i),y (i) ) : y (i) f (x (i) ) apple q X Acomplementarylearner: Clarifies low confidence areas Increases margin of points in these areas Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 33 / 34

Reference I [1] Yoav Freund, Robert Schapire, and N Abe. Ashortintroductiontoboosting. Journal-Japanese Society For Artificial Intelligence, 14(771-780):1612, 1999. [2] Liwei Wang, Masashi Sugiyama, Cheng Yang, Zhi-Hua Zhou, and Jufu Feng. On the margin explanation of boosting algorithms. In COLT, pages479 490.Citeseer,2008. Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 34 / 34