Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Similar documents
Algorithm-Independent Learning Issues

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Machine Learning. Ensemble Methods. Manfred Huber

Neural Networks and Ensemble Methods for Classification

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Voting (Ensemble Methods)

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Statistical Machine Learning from Data

Boosting & Deep Learning

Statistical Pattern Recognition

CS7267 MACHINE LEARNING

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

TDT4173 Machine Learning

Learning with multiple models. Boosting.

A Brief Introduction to Adaboost

Variance Reduction and Ensemble Methods

Lecture 8. Instructor: Haipeng Luo

Ensembles. Léon Bottou COS 424 4/8/2010

TDT4173 Machine Learning

Learning theory. Ensemble methods. Boosting. Boosting: history

Chapter 14 Combining Models

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

ECE 5424: Introduction to Machine Learning

Computational and Statistical Learning Theory

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

ECE 5984: Introduction to Machine Learning

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Stochastic Gradient Descent

Data Mining und Maschinelles Lernen

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Ensemble Methods: Jay Hyer

Ensembles of Classifiers.

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

2 Upper-bound of Generalization Error of AdaBoost

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

VBM683 Machine Learning

Ensemble Methods for Machine Learning

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

PDEEC Machine Learning 2016/17

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Cross Validation & Ensembling

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Data Warehousing & Data Mining

Announcements Kevin Jamieson

Bagging and Other Ensemble Methods

i=1 = H t 1 (x) + α t h t (x)

CS229 Supplemental Lecture notes

Hierarchical Boosting and Filter Generation

Boos$ng Can we make dumb learners smart?

Notation P(Y ) = X P(X, Y ) = X. P(Y X )P(X ) Teorema de Bayes: P(Y X ) = CIn/UFPE - Prof. Francisco de A. T. de Carvalho P(X )

CSCI-567: Machine Learning (Spring 2019)

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Understanding Generalization Error: Bounds and Decompositions

CS534 Machine Learning - Spring Final Exam

2D1431 Machine Learning. Bagging & Boosting

1/sqrt(B) convergence 1/B convergence B

18.9 SUPPORT VECTOR MACHINES

Machine Learning Lecture 10

COMS 4771 Lecture Boosting 1 / 16

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Combining Classifiers

arxiv: v2 [cs.lg] 21 Feb 2018

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Ensemble Learning in the Presence of Noise

Decision Trees: Overfitting

Support Vector Machine (SVM) and Kernel Methods

Bias-Variance in Machine Learning

Logistic Regression and Boosting for Labeled Bags of Instances

10701/15781 Machine Learning, Spring 2007: Homework 2

Advanced Machine Learning

Introduction to Machine Learning

Machine Learning Lecture 7

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Harrison B. Prosper. Bari Lectures

Ensemble Methods and Random Forests

Support Vector Machine (SVM) and Kernel Methods

Minimax risk bounds for linear threshold functions

Chemometrics and Intelligent Laboratory Systems

1 Handling of Continuous Attributes in C4.5. Algorithm

Lossless Online Bayesian Bagging

Classifier Performance. Assessment and Improvement

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

MODULE -4 BAYEIAN LEARNING

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

CS6220: DATA MINING TECHNIQUES

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Boosting: Foundations and Algorithms. Rob Schapire

Transcription:

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 /

Agenda Combining Classifiers Empirical view Theoretical view Resampling Bagging Boosting Adaboost 2

Combining Classifiers (Empirical View) Just like different features capturing different properties of a pattern, different classifiers also capture different structures and relationships of these patterns in the feature space. An empirical comparison of different classifiers can help us choose one of them as the best classifier for the problem at hand. 3

Combining Classifiers (Empirical View) However, although most of the classifiers may have similar error rates, sets of patterns misclassified by different classifiers do not necessarily overlap. Not relying on a single decision but rather combining the advantages of different classifiers is intuitively promising to improve the overall accuracy of classification. Such combinations are variously called combined classifiers, ensemble classifiers, mixture-of-expert models, or pooled classifiers. 4

Combining Classifiers (Empirical View) Some reasons for combining multiple classifiers to solve a given classification problem can be stated as follows: Access to different classifiers, each developed in a different context and for an entirely different representation/description of the same problem. Availability of multiple training sets, each collected at a different time or in a different environment, even may use different features. Local performances of different classifiers where each classifier may have its own region in the feature space where it performs the best. Different performances due to different initializations and randomness inherent in the training procedure. 5

Combining Classifiers (Theoretical View) At a single data point the quadratic error of the ensemble (f ens -d) 2 is less than or equal to the average quadratic error of individuals (f i -d) 2 : ( f d) w ( f d) w ( f f ) 2 2 2 ens i i i i ens i i Where: f w f ens i i i The first term is the weighted average error of individuals. The second term is the diversity term, measuring the amount of variability among the ensemble member answers for this pattern. 6

Combining Classifiers (Theoretical View) It tells us that taking the combination of several predictors would be better on average over several patterns, than a method which selected one of the predictors at random. We need to get the right balance between diversity (the diversity term) and individual accuracy (the average error term), in order to achieve lowest overall ensemble error. All successful ensemble methods encourage diversity to some extent. 7

Combining Classifiers In summary, we may have different feature sets, training sets, classification methods, and training sessions, all resulting in a set of classifiers whose outputs may be combined. Combination architectures can be grouped as: Parallel: all classifiers are invoked independently and then their results are combined by a combiner. Serial (cascading): individual classifiers are invoked in a linear sequence where the number of possible classes for a given pattern is gradually reduced. Hierarchical (tree): individual classifiers are combined into a structure, which is similar to that of a decision tree, where the nodes are associated with the classifiers. 8

Combining Classifiers Selecting and training of individual classifiers: Combination of classifiers is especially useful if the individual classifiers are largely independent. This can be explicitly forced by using different training sets, different features and different classifiers. Combiner: Some combiners are static, with no training required, while others are trainable. Some are adaptive where the decisions of individual classifiers are evaluated (weighed) depending on the input pattern, whereas non-adaptive ones treat all input patterns the same. Different combiners use different types of output from individual classifiers: confidence, rank, or abstract. 9

Combining Classifiers Examples of classifier combination schemes are: Majority voting (each classifier makes a binary decision (vote) about each class and the final decision is made in favor of the class with the largest number of votes), Sum, product, maximum, minimum and median of the posterior probabilities computed by individual classifiers, Class ranking (each class receives m ranks from m classifiers, the highest (minimum) of these ranks is the final score for that class), Weighted combination of classifiers. We will study different combination schemes using a Bayesian framework and resampling. 10

Resampling Resampling is well-known method for generating training data and evaluating the accuracy of different classifiers. It can also be used to build classifier ensembles. We will study: bagging, where multiple classifiers are built by bootstrapping the original training set, and boosting, where a sequence of classifiers is built by training each classifier using data sampled from a distribution derived from the empirical misclassification rate of the previous classifier. 11

Bagging Bagging (bootstrap aggregating) uses multiple versions of the training set, each created by bootstrapping the original training data. Each of these bootstrap data sets is used to train a different component classifier. The final classification decision is based on the vote of each component classifier. Traditionally, the component classifiers are of the same general form (e.g., all neural networks, all decision trees, etc.) where their differences are in the final parameter values due to their different sets of training patterns. 12

Bagging A classifier/learning algorithm is informally called unstable if small changes in the training data lead to significantly different classifiers and relatively large changes in accuracy. Decision trees and neural networks are examples of unstable classifiers where a slight change in training patterns can result in radically different classifiers. In general, bagging improves recognition for unstable classifiers because it effectively averages over such discontinuities. 13

Boosting In boosting, each training pattern receives a weight that determines its probability of being selected for the training set for an individual component classifier. If a training pattern is accurately classified, its chance of being used again in a subsequent component classifier is reduced. Conversely, if the pattern is not accurately classified, its chance of being used again is increased. The final classification decision is based on the weighted sum of the votes of the component classifiers where the weight for each classifier is a function of its accuracy. 14

Adaboost The popular AdaBoost (adaptive boosting) algorithm allows continuous adding of classifiers until some desired low training error has been achieved. Let α t (x i ) denote the weight of pattern x i at trial t, where α 1 (x i ) = 1/n for every x i. At each trial t=1,...,t, a classifier C t is constructed from the given patterns under the distribution α t where α t (x i ) reflects occurrence probability of x i. The error ε t of this classifier is also measured with respect to the weights, and consists of the sum of the weights of the patterns that it misclassifies. If ε t is greater than 0.5, the trials terminate and T is set to t 1. Conversely, if C t correctly classifies all patterns so that ε t is zero, the trials also terminate and T becomes t. Otherwise, the weights α t+1 for the next trial are generated by multiplying the weights of patterns that C t classifies correctly by the factor β t = ε t /(1- ε t ) and then are renormalized so that Σ n i=1 α t (x i ) =1. The boosted classifier C is obtained by summing the votes of the classifiers C 1,...,C T, where the vote for classifier C t is also weighted by log(1/β t ). 15

Adaboost Provided that ε t is always less than 0.5, it was shown that the error rate of C on the given patterns under the original uniform distribution α 1 approaches zero exponentially quickly as T increases. A succession of weak classifiers {C t } can thus be boosted to a strong classifier C that is at least as accurate as, and usually much more accurate than, the best weak classifier on the training data. However, note that there is no guarantee of the generalization performance of a bagged or boosted classifier on unseen patterns. 16

Any Question? End of Lecture 10 Thank you! Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1/ 17

Machine Learning Ensemble Learning II Hamid R. Rabiee Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 /

Agenda Bias-Variance-Noise Analysis Bootstrap Bagging AdaBoost 2

Bias-Variance Analysis Imagine that our particular training sample S is drawn from some population of possible training samples according to P(S). The expected prediction error: * * 2 E y h x Decompose this into bias, variance, and noise 3

Bias and Variance Adapted from A Unified Overview of Ensemble Methods. 4

Bias-Variance Analysis Lemma: 5

Error Decomposition 2 2 2 E h x* y * E h x* 2 h x* y * y * 2 2 E h x* 2 E h x* E y * E y * 2 * E( * E( * E h x h x ) h x ) 2 E( * * h x ) f x * * * E y f x f x * E( * E h x h x ) * E( h x * f x * 2 E[ y f x* ] 2 2 2 ) 2 bias2 2 lemma lemma variance noise 6

Error Decomposition 2 E[ (h(x*) y*) ] 2 = E[ (h(x*) E(h(x*))) ] (E(h(x*)) f(x*)) 2 E[ (y* f(x*)) ] 2 2 2 Var(h(x*)) + Bias(h(x*)) + E[ ] Var(h(x*)) + Bias(h(x*)) 2 2 Expected prediction error = Variance + Bias + Noise 7

Bias-Variance-Noise Analysis Variance: E h x 2 [( ( *)-E( h( x*))) ] Describes how much h(x*) varies from one training set S to another Bias: [E(h(x*)) f(x*)]: Describes the average error of h(x*). Noise 2 2 2 E[ ( y* f ( x* )) E ] Describes how much y* varies from f(x*) 8

Supervised Ensemble Methods Given a data set D={x 1,x 2,,x n } and their corresponding labels L={l 1,l 2,,l n } An ensemble approach computes: A set of classifiers {f 1,f 2,,f k }, each of which maps data to a class label: f j (x)=l A combination of classifiers f* which minimizes generalization error: f*(x)= w 1 f 1 (x)+ w 2 f 2 (x)+ + w k f k (x) 9

Bootstrap Let the original sample be L=(x 1,x 2,,x n ) Repeat B time: Generate a sample L k of size n from L by sampling with replacement. Compute w i for f (x). j Now we end up with bootstrap values W=(w 1, w 2,.., w k ) Use these values for calculating all the quantities of interest (e.g., standard deviation, confidence intervals) 10

Bootstrap-Example X1=(1.57,0.22,19.67, 0,0,2.2,3.12) Mean=4.13 X=(3.12, 0, 1.57, 19.67, 0.22, 2.20) Mean=4.46 X2=(0, 2.20, 2.20, 2.20, 19.67, 1.57) Mean=4.64 X3=(0.22, 3.12,1.57, 3.12, 2.20, 0.22) Mean=1.74 11

Bootstrap The bootstrap does not replace or add to the original data. We use bootstrap distribution as a way to estimate the variation in a statistic based on the original data. Bootstrapping: One original sample B bootstrap samples B bootstrap samples bootstrap distribution Bootstrap distributions usually approximate the shape, spread, and bias of the actual sampling distribution. Bootstrap distributions are centered at the value of the statistic from the original sample plus any bias. 12

Bootstrap Cases where bootstrap does not apply: Small data sets: the original sample is not a good approximation of the population Dirty data: outliers add variability in our estimates. Dependence structures (e.g., time series, spatial problems): Bootstrap is based on the assumption of independence. 13

Bootstrap How many bootstrap samples are needed? Choice of B depends on Computer availability Type of the problem: standard errors, confidence intervals, Complexity of the problem 14

Bagging Bagging stands for bootstrap aggregating. It is an ensemble method: a method of combining multiple predictors. Let the original training data be L Repeat B times: Get a bootstrap sample L k from L. Train a predictor using L k. Combine B predictors by Voting (for classification problem) Averaging (for estimation problem) 15

16 Bagging-Voting Linear combination Classification 1 and 0 1 1 L j j j L j j j w w w d y L j ji j i d w y 1

Bagging Error Reduction Under mean squared error, bagging reduces variance and leaves bias unchanged Consider idealized bagging estimator: The error is E[ Y fˆ z E[ Y f ( x)] 2 ( x)] E[ Y 2 E[ f ( x) f ( x) f ( x) fˆ z ( x)] 2 fˆ z ( x)] 2 E[ Y f ( x)] 2 Bagging usually decreases MSE Bagging reduces the variance of high variance learners (e.g. decision tree) 17

Boosting Boosting reduces the bias of high bias learners. 18

AdaBoost AdaBoost algorithm Some slides have been adapted from slides of Tommi Jaakkola, MIT CSAIL 19

AdaBoost algorithm 20

AdaBoost Original training set: equal weights to all training samples Adapted from A Tutorial on Boosting by Yoav Freund and Rob Schapire 21

AdaBoost ROUND 1 ε = error rate of classifier α = weight of classifier Adapted from A Tutorial on Boosting by Yoav Freund and Rob Schapire 22

AdaBoost ROUND 2 Adapted from A Tutorial on Boosting by Yoav Freund and Rob Schapire 23

AdaBoost ROUND 3 Adapted from A Tutorial on Boosting by Yoav Freund and Rob Schapire 24

AdaBoost Adapted from A Tutorial on Boosting by Yoav Freund and Rob Schapire 25

26 Boosting For classifier i, its error is The classifier s importance is represented as: The weight of each record is updated as: Final combination: N j j N j j j i j i w y x C w 1 1 ) ) ( ( i i i 1 ln 2 1 ) ( ) ( 1) ( ) ( exp i j i j i i j i j Z x C y w w K i i i y y x C x C 1 * ) ( arg max ) (

Boosting Among the classifiers of the form: f K ( x) i ic 1 i ( x) We seek to minimize the exponential loss function: N j 1 exp Not robust in noisy settings y j f ( x j ) 27

Boosting In boosting, each training pattern receives a weight that determines its probability of being selected for the training set for an individual component classifier. If a training pattern is accurately classified, its chance of being used again in a subsequent component classifier is reduced. Conversely, if the pattern is not accurately classified, its chance of being used again is increased. The final classification decision is based on the weighted sum of the votes of the component classifiers where the weight for each classifier is a function of its accuracy. 28

Adaboost properties: exponential loss After each boosting iteration, assuming we can find a component classifier whose weighted error is better than chance, the combined classifier is guaranteed to have a lower exponential loss over the training examples 29

Adaboost properties: training error The boosting iterations also decrease the classification error of the combined classifier over the training examples. 30

Adaboost properties: training error The training classification error has to go down exponentially fast if the weighted errors of the component classifiers, chance k 0.5 m k err( hˆ ) 2 (1 ) m k k k1, are strictly better than 31

Adaboost properties: weighted error Weighted error of each new component classifier tends to increase as a function of boosting iterations. 32

Training and test errors Training and test errors of the combined classifier Why should the test error go down after we already have zero training error? 33

AdaBoost and margin We can write the combined classifier in a more useful form by dividing the predictions by the total number of votes : This allows us to define a clear notion of voting margin that the combined classifier achieves for each training example: The margin lies in [ 1, 1] and is negative for all misclassified examples. Successive boosting iterations still improve the majority vote or margin for the training examples 34

AdaBoost and margin Cumulative distributions of margin values: 35

Any Question? End of Lecture 11 Thank you! Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1/ 36