TDT4173 Machine Learning Lecture 9 Learning Classifiers: Bagging & Boosting Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning
Outline 1 Wrap-up from last time 2 Ensemble-methods Background Bagging Boosting 2 TDT4173 Machine Learning
Wrap-up from last time Summary-points from last lesson Evaluating hypothesis Sample error, true error Estimators Confidence intervals for observed hypothesis error The central limit Theorem Comparing hypothesis Comparing learners Computational Learning Theory Bounding the true error PAC learning 3 TDT4173 Machine Learning
Ensemble-methods Background Fundamental setup 1 Generate a collection (ensemble) of classifiers from a given dataset. 2 Each classifier is given a different dataset. The datasets are generated by resampling 3 The final classification is defined by voting Nice property Each classifier in the ensemble may be fairly simple; the ensemble of classifiers can still be made to learn anything well. Only requirement to the weak learner is that it is able to do better than choosing blindly. 4 TDT4173 Machine Learning
Weak learner example Background Consider a two-dimentional dataset of two classes. 1 Data for a two class problem 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1 0.5 0 0.5 1 What is a weak classifier in this case? 5 TDT4173 Machine Learning
Weak learner example Background Consider a two-dimentional dataset of two classes. 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1 0.5 0 0.5 1 One example: The half-space classifier 5 TDT4173 Machine Learning
Weak learner example Background Consider a two-dimentional dataset of two classes. 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1 0.5 0 0.5 1 One example: The half-space classifier Which aggregates well 5 TDT4173 Machine Learning
Background How to learn the half-space classifier Consider a dataset D = {D 1,D 2,...,D m }, each D j is an observation (x 1,x 2 ),c). c is either red or blue. For d in each dimension: For j over each observation: 1 Let b(d, j) be x d from observation D j. 2 r l No. red observations with x d b(d, j) + No. blue observations with x d > b(d, j) 3 quality(d, j) max {r l, m r l } (d 0,j 0 ) argmax d,j quality(d,j) Define the half-plane classifier at x d0 = b(d 0,j 0 ) 1 1 Classifier for data in iteration 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 0.5 0 0.5 1 1 1 0.5 0 0.5 1 6 TDT4173 Machine Learning
Background Two methods of ensemble learning We will consider two methods of ensemble learning: Bagging: Bagging (Bootstrap Aggregating) classifiers means to generate new datasets by drawing at random from original data (with replacement) Boosting: Data for the classifiers are adaptively resampled (re-weighted), so that data-points that classifiers have had problems classifying are given more weight. 7 TDT4173 Machine Learning
Bootstrapping Ensemble-methods Bagging What is a bootstrap sample? Consider a data set D = {D 1,D 2,...,D m } with m datapoints A boostrap sample D i can be created from D by choosing m points from D on random. The selection is done with replacement. 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 0.5 0 0.5 1 1 1 0.5 0 0.5 1 8 TDT4173 Machine Learning
Bagging Ensemble-methods Bagging Bagging (= Bootstrap aggregating) produces replications of the training set by sampling with replacement. Each replication of the training set has the same size as the original set, but some examples can appear more than once while others don t appear at all. A classifier is generated from each replication. All classifiers are used to classify each sample from the test set using a majority voting scheme. 9 TDT4173 Machine Learning
Bagging more formally Bagging 1 Start with dataset D 2 Generate B bootstrap samples D 1,D 2,...,D B. 3 Learn weak classifier for each dataset: c i L(D i ); L( ) is the weak learner; c i ( ) : X { 1,+1} 4 The aggregated classifier of a new instance x simply found by taking a majority vote over classifiers: { B } c(x) sign c b (x) b=1 DEMO 10 TDT4173 Machine Learning
Bagging Summary Ensemble-methods Bagging When is it useful Bagging should only be used if the learner is unstable. A learner is said to be unstable if a small change in the training set yields large variations in the classification. Examples of unstable learners: neural networks, decision trees. Example of a stable learner: k-nearest neighboors. Bagging properties 1 Improves the estimate if the learning algorithm is unstable. 2 Reduces the variance of predictions. 3 Degrades the estimate if the learning algorithm is stable. 11 TDT4173 Machine Learning
Boosting Ensemble-methods Boosting 1 Main idea as for bagging: Invent new datasets, and learn separate classifiers for each dataset 2 Clever trick: The data is now adaptively resampled, so that previously misclassified examples count more in next dataset, previously well-classified observations are weighted less. 3 The classifiers are aggregated by adding them together, but each classifier is weighted differently 12 TDT4173 Machine Learning
Boosting How to learn the WEIGHTED half-space classifier Consider a dataset D = {D 1,D 2,...,D m }, each D j is an observation (x 1,x 2 ),c). c is either red or blue. For d in each dimension: For j over each observation: 1 Let b(d, j) be x d from observation D j. 2 r l Sum of weight of red observations with x d b(d, j) + Sum of weight of blue observations with x d > b(d, j) 3 quality(d, j) max { r l, i w(i) r } l (d 0,j 0 ) argmax d,j quality(d,j) Define the half-plane classifier at x d0 = b(d 0,j 0 ) 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 0.5 0 0.5 1 1 1 0.5 0 0.5 1 13 TDT4173 Machine Learning
Boosting Ensemble-methods Boosting Boosting produces replications of the training set by re-weighting the importance of each observation in the original dataset. A classifier is generated from each replication using the weighted data-set as input. All classifiers are used to classify each sample from the test set. Each classifiers get a weight based on its calculated accuracy on the weighted training set. 14 TDT4173 Machine Learning
Boosting more formally Boosting 1 Start with dataset D and weights w 1 = [1/m,1/m,...,1/m]. 2 Repeat for t = 1,...,T: 1 c t is the classifier learned from the weighted data (D, w t ). 2 ǫ t is the error of c t measured on the weighted data (D, w t ). ( 3 Calculate α t 1 2 ln 1 ǫt ǫ t ). 4 Update weights (1): { exp( αt ) if x w t+1 (i) w t (i) i was classified correctly exp(+α t ) otherwise 5 Update weights (2): Normalize w t+1. 3 The aggregated classifier of a new instance x is found by taking a weighted vote over classifiers: { T } c(x) sign α t c t (x) DEMO t=1 15 TDT4173 Machine Learning
Boosting Summary Ensemble-methods Boosting Boosting properties The error rate of c on the un-weighted training instances approaches zero exponentially quickly as T increases Boosting forces the classifier to have at least 50% accuracy on the re-weighted training instances. This causes the learning system to produce a quite different classifier on the following trial (as next time, we focus on the mis-classifications of this round). This leads to an extensive exploration of the classifier space. Boosting is more clever than bagging, as it will not reduce classification performance. 16 TDT4173 Machine Learning
Boosting Plans for the remainder of the course Next week off Since we have covered both boosting and SVMs today we are ahead of schedule. Therefore, the lecture on November 14th is cancelled, and the next time we meet is for the last lecture (November 21st). Plans for Last Lecture: Wrap-up of course Q & A I will go through last years examination which is also the last assignment. 17 TDT4173 Machine Learning