Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 /
Agenda Combining Classifiers Empirical view Theoretical view Resampling Bagging Boosting Adaboost 2
Combining Classifiers (Empirical View) Just like different features capturing different properties of a pattern, different classifiers also capture different structures and relationships of these patterns in the feature space. An empirical comparison of different classifiers can help us choose one of them as the best classifier for the problem at hand. 3
Combining Classifiers (Empirical View) However, although most of the classifiers may have similar error rates, sets of patterns misclassified by different classifiers do not necessarily overlap. Not relying on a single decision but rather combining the advantages of different classifiers is intuitively promising to improve the overall accuracy of classification. Such combinations are variously called combined classifiers, ensemble classifiers, mixture-of-expert models, or pooled classifiers. 4
Combining Classifiers (Empirical View) Some reasons for combining multiple classifiers to solve a given classification problem can be stated as follows: Access to different classifiers, each developed in a different context and for an entirely different representation/description of the same problem. Availability of multiple training sets, each collected at a different time or in a different environment, even may use different features. Local performances of different classifiers where each classifier may have its own region in the feature space where it performs the best. Different performances due to different initializations and randomness inherent in the training procedure. 5
Combining Classifiers (Theoretical View) At a single data point the quadratic error of the ensemble (f ens -d) 2 is less than or equal to the average quadratic error of individuals (f i -d) 2 : ( f d) w ( f d) w ( f f ) 2 2 2 ens i i i i ens i i Where: f w f ens i i i The first term is the weighted average error of individuals. The second term is the diversity term, measuring the amount of variability among the ensemble member answers for this pattern. 6
Combining Classifiers (Theoretical View) It tells us that taking the combination of several predictors would be better on average over several patterns, than a method which selected one of the predictors at random. We need to get the right balance between diversity (the diversity term) and individual accuracy (the average error term), in order to achieve lowest overall ensemble error. All successful ensemble methods encourage diversity to some extent. 7
Combining Classifiers In summary, we may have different feature sets, training sets, classification methods, and training sessions, all resulting in a set of classifiers whose outputs may be combined. Combination architectures can be grouped as: Parallel: all classifiers are invoked independently and then their results are combined by a combiner. Serial (cascading): individual classifiers are invoked in a linear sequence where the number of possible classes for a given pattern is gradually reduced. Hierarchical (tree): individual classifiers are combined into a structure, which is similar to that of a decision tree, where the nodes are associated with the classifiers. 8
Combining Classifiers Selecting and training of individual classifiers: Combination of classifiers is especially useful if the individual classifiers are largely independent. This can be explicitly forced by using different training sets, different features and different classifiers. Combiner: Some combiners are static, with no training required, while others are trainable. Some are adaptive where the decisions of individual classifiers are evaluated (weighed) depending on the input pattern, whereas non-adaptive ones treat all input patterns the same. Different combiners use different types of output from individual classifiers: confidence, rank, or abstract. 9
Combining Classifiers Examples of classifier combination schemes are: Majority voting (each classifier makes a binary decision (vote) about each class and the final decision is made in favor of the class with the largest number of votes), Sum, product, maximum, minimum and median of the posterior probabilities computed by individual classifiers, Class ranking (each class receives m ranks from m classifiers, the highest (minimum) of these ranks is the final score for that class), Weighted combination of classifiers. We will study different combination schemes using a Bayesian framework and resampling. 10
Resampling Resampling is well-known method for generating training data and evaluating the accuracy of different classifiers. It can also be used to build classifier ensembles. We will study: bagging, where multiple classifiers are built by bootstrapping the original training set, and boosting, where a sequence of classifiers is built by training each classifier using data sampled from a distribution derived from the empirical misclassification rate of the previous classifier. 11
Bagging Bagging (bootstrap aggregating) uses multiple versions of the training set, each created by bootstrapping the original training data. Each of these bootstrap data sets is used to train a different component classifier. The final classification decision is based on the vote of each component classifier. Traditionally, the component classifiers are of the same general form (e.g., all neural networks, all decision trees, etc.) where their differences are in the final parameter values due to their different sets of training patterns. 12
Bagging A classifier/learning algorithm is informally called unstable if small changes in the training data lead to significantly different classifiers and relatively large changes in accuracy. Decision trees and neural networks are examples of unstable classifiers where a slight change in training patterns can result in radically different classifiers. In general, bagging improves recognition for unstable classifiers because it effectively averages over such discontinuities. 13
Boosting In boosting, each training pattern receives a weight that determines its probability of being selected for the training set for an individual component classifier. If a training pattern is accurately classified, its chance of being used again in a subsequent component classifier is reduced. Conversely, if the pattern is not accurately classified, its chance of being used again is increased. The final classification decision is based on the weighted sum of the votes of the component classifiers where the weight for each classifier is a function of its accuracy. 14
Adaboost The popular AdaBoost (adaptive boosting) algorithm allows continuous adding of classifiers until some desired low training error has been achieved. Let α t (x i ) denote the weight of pattern x i at trial t, where α 1 (x i ) = 1/n for every x i. At each trial t=1,...,t, a classifier C t is constructed from the given patterns under the distribution α t where α t (x i ) reflects occurrence probability of x i. The error ε t of this classifier is also measured with respect to the weights, and consists of the sum of the weights of the patterns that it misclassifies. If ε t is greater than 0.5, the trials terminate and T is set to t 1. Conversely, if C t correctly classifies all patterns so that ε t is zero, the trials also terminate and T becomes t. Otherwise, the weights α t+1 for the next trial are generated by multiplying the weights of patterns that C t classifies correctly by the factor β t = ε t /(1- ε t ) and then are renormalized so that Σ n i=1 α t (x i ) =1. The boosted classifier C is obtained by summing the votes of the classifiers C 1,...,C T, where the vote for classifier C t is also weighted by log(1/β t ). 15
Adaboost Provided that ε t is always less than 0.5, it was shown that the error rate of C on the given patterns under the original uniform distribution α 1 approaches zero exponentially quickly as T increases. A succession of weak classifiers {C t } can thus be boosted to a strong classifier C that is at least as accurate as, and usually much more accurate than, the best weak classifier on the training data. However, note that there is no guarantee of the generalization performance of a bagged or boosted classifier on unseen patterns. 16
Any Question? End of Lecture 10 Thank you! Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1/ 17
Machine Learning Ensemble Learning II Hamid R. Rabiee Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 /
Agenda Bias-Variance-Noise Analysis Bootstrap Bagging AdaBoost 2
Bias-Variance Analysis Imagine that our particular training sample S is drawn from some population of possible training samples according to P(S). The expected prediction error: * * 2 E y h x Decompose this into bias, variance, and noise 3
Bias and Variance Adapted from A Unified Overview of Ensemble Methods. 4
Bias-Variance Analysis Lemma: 5
Error Decomposition 2 2 2 E h x* y * E h x* 2 h x* y * y * 2 2 E h x* 2 E h x* E y * E y * 2 * E( * E( * E h x h x ) h x ) 2 E( * * h x ) f x * * * E y f x f x * E( * E h x h x ) * E( h x * f x * 2 E[ y f x* ] 2 2 2 ) 2 bias2 2 lemma lemma variance noise 6
Error Decomposition 2 E[ (h(x*) y*) ] 2 = E[ (h(x*) E(h(x*))) ] (E(h(x*)) f(x*)) 2 E[ (y* f(x*)) ] 2 2 2 Var(h(x*)) + Bias(h(x*)) + E[ ] Var(h(x*)) + Bias(h(x*)) 2 2 Expected prediction error = Variance + Bias + Noise 7
Bias-Variance-Noise Analysis Variance: E h x 2 [( ( *)-E( h( x*))) ] Describes how much h(x*) varies from one training set S to another Bias: [E(h(x*)) f(x*)]: Describes the average error of h(x*). Noise 2 2 2 E[ ( y* f ( x* )) E ] Describes how much y* varies from f(x*) 8
Supervised Ensemble Methods Given a data set D={x 1,x 2,,x n } and their corresponding labels L={l 1,l 2,,l n } An ensemble approach computes: A set of classifiers {f 1,f 2,,f k }, each of which maps data to a class label: f j (x)=l A combination of classifiers f* which minimizes generalization error: f*(x)= w 1 f 1 (x)+ w 2 f 2 (x)+ + w k f k (x) 9
Bootstrap Let the original sample be L=(x 1,x 2,,x n ) Repeat B time: Generate a sample L k of size n from L by sampling with replacement. Compute w i for f (x). j Now we end up with bootstrap values W=(w 1, w 2,.., w k ) Use these values for calculating all the quantities of interest (e.g., standard deviation, confidence intervals) 10
Bootstrap-Example X1=(1.57,0.22,19.67, 0,0,2.2,3.12) Mean=4.13 X=(3.12, 0, 1.57, 19.67, 0.22, 2.20) Mean=4.46 X2=(0, 2.20, 2.20, 2.20, 19.67, 1.57) Mean=4.64 X3=(0.22, 3.12,1.57, 3.12, 2.20, 0.22) Mean=1.74 11
Bootstrap The bootstrap does not replace or add to the original data. We use bootstrap distribution as a way to estimate the variation in a statistic based on the original data. Bootstrapping: One original sample B bootstrap samples B bootstrap samples bootstrap distribution Bootstrap distributions usually approximate the shape, spread, and bias of the actual sampling distribution. Bootstrap distributions are centered at the value of the statistic from the original sample plus any bias. 12
Bootstrap Cases where bootstrap does not apply: Small data sets: the original sample is not a good approximation of the population Dirty data: outliers add variability in our estimates. Dependence structures (e.g., time series, spatial problems): Bootstrap is based on the assumption of independence. 13
Bootstrap How many bootstrap samples are needed? Choice of B depends on Computer availability Type of the problem: standard errors, confidence intervals, Complexity of the problem 14
Bagging Bagging stands for bootstrap aggregating. It is an ensemble method: a method of combining multiple predictors. Let the original training data be L Repeat B times: Get a bootstrap sample L k from L. Train a predictor using L k. Combine B predictors by Voting (for classification problem) Averaging (for estimation problem) 15
16 Bagging-Voting Linear combination Classification 1 and 0 1 1 L j j j L j j j w w w d y L j ji j i d w y 1
Bagging Error Reduction Under mean squared error, bagging reduces variance and leaves bias unchanged Consider idealized bagging estimator: The error is E[ Y fˆ z E[ Y f ( x)] 2 ( x)] E[ Y 2 E[ f ( x) f ( x) f ( x) fˆ z ( x)] 2 fˆ z ( x)] 2 E[ Y f ( x)] 2 Bagging usually decreases MSE Bagging reduces the variance of high variance learners (e.g. decision tree) 17
Boosting Boosting reduces the bias of high bias learners. 18
AdaBoost AdaBoost algorithm Some slides have been adapted from slides of Tommi Jaakkola, MIT CSAIL 19
AdaBoost algorithm 20
AdaBoost Original training set: equal weights to all training samples Adapted from A Tutorial on Boosting by Yoav Freund and Rob Schapire 21
AdaBoost ROUND 1 ε = error rate of classifier α = weight of classifier Adapted from A Tutorial on Boosting by Yoav Freund and Rob Schapire 22
AdaBoost ROUND 2 Adapted from A Tutorial on Boosting by Yoav Freund and Rob Schapire 23
AdaBoost ROUND 3 Adapted from A Tutorial on Boosting by Yoav Freund and Rob Schapire 24
AdaBoost Adapted from A Tutorial on Boosting by Yoav Freund and Rob Schapire 25
26 Boosting For classifier i, its error is The classifier s importance is represented as: The weight of each record is updated as: Final combination: N j j N j j j i j i w y x C w 1 1 ) ) ( ( i i i 1 ln 2 1 ) ( ) ( 1) ( ) ( exp i j i j i i j i j Z x C y w w K i i i y y x C x C 1 * ) ( arg max ) (
Boosting Among the classifiers of the form: f K ( x) i ic 1 i ( x) We seek to minimize the exponential loss function: N j 1 exp Not robust in noisy settings y j f ( x j ) 27
Boosting In boosting, each training pattern receives a weight that determines its probability of being selected for the training set for an individual component classifier. If a training pattern is accurately classified, its chance of being used again in a subsequent component classifier is reduced. Conversely, if the pattern is not accurately classified, its chance of being used again is increased. The final classification decision is based on the weighted sum of the votes of the component classifiers where the weight for each classifier is a function of its accuracy. 28
Adaboost properties: exponential loss After each boosting iteration, assuming we can find a component classifier whose weighted error is better than chance, the combined classifier is guaranteed to have a lower exponential loss over the training examples 29
Adaboost properties: training error The boosting iterations also decrease the classification error of the combined classifier over the training examples. 30
Adaboost properties: training error The training classification error has to go down exponentially fast if the weighted errors of the component classifiers, chance k 0.5 m k err( hˆ ) 2 (1 ) m k k k1, are strictly better than 31
Adaboost properties: weighted error Weighted error of each new component classifier tends to increase as a function of boosting iterations. 32
Training and test errors Training and test errors of the combined classifier Why should the test error go down after we already have zero training error? 33
AdaBoost and margin We can write the combined classifier in a more useful form by dividing the predictions by the total number of votes : This allows us to define a clear notion of voting margin that the combined classifier achieves for each training example: The margin lies in [ 1, 1] and is negative for all misclassified examples. Successive boosting iterations still improve the majority vote or margin for the training examples 34
AdaBoost and margin Cumulative distributions of margin values: 35
Any Question? End of Lecture 11 Thank you! Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1/ 36