Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Size: px
Start display at page:

Download "Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /"

Transcription

1 Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

2 Agenda Combining Classifiers Empirical view Theoretical view Resampling Bagging Boosting Adaboost 2

3 Combining Classifiers (Empirical View) Just like different features capturing different properties of a pattern, different classifiers also capture different structures and relationships of these patterns in the feature space. An empirical comparison of different classifiers can help us choose one of them as the best classifier for the problem at hand. 3

4 Combining Classifiers (Empirical View) However, although most of the classifiers may have similar error rates, sets of patterns misclassified by different classifiers do not necessarily overlap. Not relying on a single decision but rather combining the advantages of different classifiers is intuitively promising to improve the overall accuracy of classification. Such combinations are variously called combined classifiers, ensemble classifiers, mixture-of-expert models, or pooled classifiers. 4

5 Combining Classifiers (Empirical View) Some reasons for combining multiple classifiers to solve a given classification problem can be stated as follows: Access to different classifiers, each developed in a different context and for an entirely different representation/description of the same problem. Availability of multiple training sets, each collected at a different time or in a different environment, even may use different features. Local performances of different classifiers where each classifier may have its own region in the feature space where it performs the best. Different performances due to different initializations and randomness inherent in the training procedure. 5

6 Combining Classifiers (Theoretical View) At a single data point the quadratic error of the ensemble (f ens -d) 2 is less than or equal to the average quadratic error of individuals (f i -d) 2 : ( f d) w ( f d) w ( f f ) ens i i i i ens i i Where: f w f ens i i i The first term is the weighted average error of individuals. The second term is the diversity term, measuring the amount of variability among the ensemble member answers for this pattern. 6

7 Combining Classifiers (Theoretical View) It tells us that taking the combination of several predictors would be better on average over several patterns, than a method which selected one of the predictors at random. We need to get the right balance between diversity (the diversity term) and individual accuracy (the average error term), in order to achieve lowest overall ensemble error. All successful ensemble methods encourage diversity to some extent. 7

8 Combining Classifiers In summary, we may have different feature sets, training sets, classification methods, and training sessions, all resulting in a set of classifiers whose outputs may be combined. Combination architectures can be grouped as: Parallel: all classifiers are invoked independently and then their results are combined by a combiner. Serial (cascading): individual classifiers are invoked in a linear sequence where the number of possible classes for a given pattern is gradually reduced. Hierarchical (tree): individual classifiers are combined into a structure, which is similar to that of a decision tree, where the nodes are associated with the classifiers. 8

9 Combining Classifiers Selecting and training of individual classifiers: Combination of classifiers is especially useful if the individual classifiers are largely independent. This can be explicitly forced by using different training sets, different features and different classifiers. Combiner: Some combiners are static, with no training required, while others are trainable. Some are adaptive where the decisions of individual classifiers are evaluated (weighed) depending on the input pattern, whereas non-adaptive ones treat all input patterns the same. Different combiners use different types of output from individual classifiers: confidence, rank, or abstract. 9

10 Combining Classifiers Examples of classifier combination schemes are: Majority voting (each classifier makes a binary decision (vote) about each class and the final decision is made in favor of the class with the largest number of votes), Sum, product, maximum, minimum and median of the posterior probabilities computed by individual classifiers, Class ranking (each class receives m ranks from m classifiers, the highest (minimum) of these ranks is the final score for that class), Weighted combination of classifiers. We will study different combination schemes using a Bayesian framework and resampling. 10

11 Resampling Resampling is well-known method for generating training data and evaluating the accuracy of different classifiers. It can also be used to build classifier ensembles. We will study: bagging, where multiple classifiers are built by bootstrapping the original training set, and boosting, where a sequence of classifiers is built by training each classifier using data sampled from a distribution derived from the empirical misclassification rate of the previous classifier. 11

12 Bagging Bagging (bootstrap aggregating) uses multiple versions of the training set, each created by bootstrapping the original training data. Each of these bootstrap data sets is used to train a different component classifier. The final classification decision is based on the vote of each component classifier. Traditionally, the component classifiers are of the same general form (e.g., all neural networks, all decision trees, etc.) where their differences are in the final parameter values due to their different sets of training patterns. 12

13 Bagging A classifier/learning algorithm is informally called unstable if small changes in the training data lead to significantly different classifiers and relatively large changes in accuracy. Decision trees and neural networks are examples of unstable classifiers where a slight change in training patterns can result in radically different classifiers. In general, bagging improves recognition for unstable classifiers because it effectively averages over such discontinuities. 13

14 Boosting In boosting, each training pattern receives a weight that determines its probability of being selected for the training set for an individual component classifier. If a training pattern is accurately classified, its chance of being used again in a subsequent component classifier is reduced. Conversely, if the pattern is not accurately classified, its chance of being used again is increased. The final classification decision is based on the weighted sum of the votes of the component classifiers where the weight for each classifier is a function of its accuracy. 14

15 Adaboost The popular AdaBoost (adaptive boosting) algorithm allows continuous adding of classifiers until some desired low training error has been achieved. Let α t (x i ) denote the weight of pattern x i at trial t, where α 1 (x i ) = 1/n for every x i. At each trial t=1,...,t, a classifier C t is constructed from the given patterns under the distribution α t where α t (x i ) reflects occurrence probability of x i. The error ε t of this classifier is also measured with respect to the weights, and consists of the sum of the weights of the patterns that it misclassifies. If ε t is greater than 0.5, the trials terminate and T is set to t 1. Conversely, if C t correctly classifies all patterns so that ε t is zero, the trials also terminate and T becomes t. Otherwise, the weights α t+1 for the next trial are generated by multiplying the weights of patterns that C t classifies correctly by the factor β t = ε t /(1- ε t ) and then are renormalized so that Σ n i=1 α t (x i ) =1. The boosted classifier C is obtained by summing the votes of the classifiers C 1,...,C T, where the vote for classifier C t is also weighted by log(1/β t ). 15

16 Adaboost Provided that ε t is always less than 0.5, it was shown that the error rate of C on the given patterns under the original uniform distribution α 1 approaches zero exponentially quickly as T increases. A succession of weak classifiers {C t } can thus be boosted to a strong classifier C that is at least as accurate as, and usually much more accurate than, the best weak classifier on the training data. However, note that there is no guarantee of the generalization performance of a bagged or boosted classifier on unseen patterns. 16

17 Any Question? End of Lecture 10 Thank you! Spring

18 Machine Learning Ensemble Learning II Hamid R. Rabiee Spring /

19 Agenda Bias-Variance-Noise Analysis Bootstrap Bagging AdaBoost 2

20 Bias-Variance Analysis Imagine that our particular training sample S is drawn from some population of possible training samples according to P(S). The expected prediction error: * * 2 E y h x Decompose this into bias, variance, and noise 3

21 Bias and Variance Adapted from A Unified Overview of Ensemble Methods. 4

22 Bias-Variance Analysis Lemma: 5

23 Error Decomposition E h x* y * E h x* 2 h x* y * y * 2 2 E h x* 2 E h x* E y * E y * 2 * E( * E( * E h x h x ) h x ) 2 E( * * h x ) f x * * * E y f x f x * E( * E h x h x ) * E( h x * f x * 2 E[ y f x* ] ) 2 bias2 2 lemma lemma variance noise 6

24 Error Decomposition 2 E[ (h(x*) y*) ] 2 = E[ (h(x*) E(h(x*))) ] (E(h(x*)) f(x*)) 2 E[ (y* f(x*)) ] Var(h(x*)) + Bias(h(x*)) + E[ ] Var(h(x*)) + Bias(h(x*)) 2 2 Expected prediction error = Variance + Bias + Noise 7

25 Bias-Variance-Noise Analysis Variance: E h x 2 [( ( *)-E( h( x*))) ] Describes how much h(x*) varies from one training set S to another Bias: [E(h(x*)) f(x*)]: Describes the average error of h(x*). Noise E[ ( y* f ( x* )) E ] Describes how much y* varies from f(x*) 8

26 Supervised Ensemble Methods Given a data set D={x 1,x 2,,x n } and their corresponding labels L={l 1,l 2,,l n } An ensemble approach computes: A set of classifiers {f 1,f 2,,f k }, each of which maps data to a class label: f j (x)=l A combination of classifiers f* which minimizes generalization error: f*(x)= w 1 f 1 (x)+ w 2 f 2 (x)+ + w k f k (x) 9

27 Bootstrap Let the original sample be L=(x 1,x 2,,x n ) Repeat B time: Generate a sample L k of size n from L by sampling with replacement. Compute w i for f (x). j Now we end up with bootstrap values W=(w 1, w 2,.., w k ) Use these values for calculating all the quantities of interest (e.g., standard deviation, confidence intervals) 10

28 Bootstrap-Example X1=(1.57,0.22,19.67, 0,0,2.2,3.12) Mean=4.13 X=(3.12, 0, 1.57, 19.67, 0.22, 2.20) Mean=4.46 X2=(0, 2.20, 2.20, 2.20, 19.67, 1.57) Mean=4.64 X3=(0.22, 3.12,1.57, 3.12, 2.20, 0.22) Mean=

29 Bootstrap The bootstrap does not replace or add to the original data. We use bootstrap distribution as a way to estimate the variation in a statistic based on the original data. Bootstrapping: One original sample B bootstrap samples B bootstrap samples bootstrap distribution Bootstrap distributions usually approximate the shape, spread, and bias of the actual sampling distribution. Bootstrap distributions are centered at the value of the statistic from the original sample plus any bias. 12

30 Bootstrap Cases where bootstrap does not apply: Small data sets: the original sample is not a good approximation of the population Dirty data: outliers add variability in our estimates. Dependence structures (e.g., time series, spatial problems): Bootstrap is based on the assumption of independence. 13

31 Bootstrap How many bootstrap samples are needed? Choice of B depends on Computer availability Type of the problem: standard errors, confidence intervals, Complexity of the problem 14

32 Bagging Bagging stands for bootstrap aggregating. It is an ensemble method: a method of combining multiple predictors. Let the original training data be L Repeat B times: Get a bootstrap sample L k from L. Train a predictor using L k. Combine B predictors by Voting (for classification problem) Averaging (for estimation problem) 15

33 16 Bagging-Voting Linear combination Classification 1 and L j j j L j j j w w w d y L j ji j i d w y 1

34 Bagging Error Reduction Under mean squared error, bagging reduces variance and leaves bias unchanged Consider idealized bagging estimator: The error is E[ Y fˆ z E[ Y f ( x)] 2 ( x)] E[ Y 2 E[ f ( x) f ( x) f ( x) fˆ z ( x)] 2 fˆ z ( x)] 2 E[ Y f ( x)] 2 Bagging usually decreases MSE Bagging reduces the variance of high variance learners (e.g. decision tree) 17

35 Boosting Boosting reduces the bias of high bias learners. 18

36 AdaBoost AdaBoost algorithm Some slides have been adapted from slides of Tommi Jaakkola, MIT CSAIL 19

37 AdaBoost algorithm 20

38 AdaBoost Original training set: equal weights to all training samples Adapted from A Tutorial on Boosting by Yoav Freund and Rob Schapire 21

39 AdaBoost ROUND 1 ε = error rate of classifier α = weight of classifier Adapted from A Tutorial on Boosting by Yoav Freund and Rob Schapire 22

40 AdaBoost ROUND 2 Adapted from A Tutorial on Boosting by Yoav Freund and Rob Schapire 23

41 AdaBoost ROUND 3 Adapted from A Tutorial on Boosting by Yoav Freund and Rob Schapire 24

42 AdaBoost Adapted from A Tutorial on Boosting by Yoav Freund and Rob Schapire 25

43 26 Boosting For classifier i, its error is The classifier s importance is represented as: The weight of each record is updated as: Final combination: N j j N j j j i j i w y x C w 1 1 ) ) ( ( i i i 1 ln 2 1 ) ( ) ( 1) ( ) ( exp i j i j i i j i j Z x C y w w K i i i y y x C x C 1 * ) ( arg max ) (

44 Boosting Among the classifiers of the form: f K ( x) i ic 1 i ( x) We seek to minimize the exponential loss function: N j 1 exp Not robust in noisy settings y j f ( x j ) 27

45 Boosting In boosting, each training pattern receives a weight that determines its probability of being selected for the training set for an individual component classifier. If a training pattern is accurately classified, its chance of being used again in a subsequent component classifier is reduced. Conversely, if the pattern is not accurately classified, its chance of being used again is increased. The final classification decision is based on the weighted sum of the votes of the component classifiers where the weight for each classifier is a function of its accuracy. 28

46 Adaboost properties: exponential loss After each boosting iteration, assuming we can find a component classifier whose weighted error is better than chance, the combined classifier is guaranteed to have a lower exponential loss over the training examples 29

47 Adaboost properties: training error The boosting iterations also decrease the classification error of the combined classifier over the training examples. 30

48 Adaboost properties: training error The training classification error has to go down exponentially fast if the weighted errors of the component classifiers, chance k 0.5 m k err( hˆ ) 2 (1 ) m k k k1, are strictly better than 31

49 Adaboost properties: weighted error Weighted error of each new component classifier tends to increase as a function of boosting iterations. 32

50 Training and test errors Training and test errors of the combined classifier Why should the test error go down after we already have zero training error? 33

51 AdaBoost and margin We can write the combined classifier in a more useful form by dividing the predictions by the total number of votes : This allows us to define a clear notion of voting margin that the combined classifier achieves for each training example: The margin lies in [ 1, 1] and is negative for all misclassified examples. Successive boosting iterations still improve the majority vote or margin for the training examples 34

52 AdaBoost and margin Cumulative distributions of margin values: 35

53 Any Question? End of Lecture 11 Thank you! Spring

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m ) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ

More information

Machine Learning. Ensemble Methods. Manfred Huber

Machine Learning. Ensemble Methods. Manfred Huber Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected

More information

Neural Networks and Ensemble Methods for Classification

Neural Networks and Ensemble Methods for Classification Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated

More information

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods The wisdom of the crowds Ensemble learning Sir Francis Galton discovered in the early 1900s that a collection of educated guesses can add up to very accurate predictions! Chapter 11 The paper in which

More information

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning Ensembles. 293S T. Yang. UCSB, 2017. Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)

More information

Voting (Ensemble Methods)

Voting (Ensemble Methods) 1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual Outline: Ensemble Learning We will describe and investigate algorithms to Ensemble Learning Lecture 10, DD2431 Machine Learning A. Maki, J. Sullivan October 2014 train weak classifiers/regressors and how

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Boosting & Deep Learning

Boosting & Deep Learning Boosting & Deep Learning Ensemble Learning n So far learning methods that learn a single hypothesis, chosen form a hypothesis space that is used to make predictions n Ensemble learning à select a collection

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Support Vector Machine (SVM) Hamid R. Rabiee Hadi Asheri, Jafar Muhammadi, Nima Pourdamghani Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Introduction

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning

More information

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of

More information

TDT4173 Machine Learning

TDT4173 Machine Learning TDT4173 Machine Learning Lecture 9 Learning Classifiers: Bagging & Boosting Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

A Brief Introduction to Adaboost

A Brief Introduction to Adaboost A Brief Introduction to Adaboost Hongbo Deng 6 Feb, 2007 Some of the slides are borrowed from Derek Hoiem & Jan ˇSochman. 1 Outline Background Adaboost Algorithm Theory/Interpretations 2 What s So Good

More information

Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Ensembles. Léon Bottou COS 424 4/8/2010

Ensembles. Léon Bottou COS 424 4/8/2010 Ensembles Léon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. Léon

More information

TDT4173 Machine Learning

TDT4173 Machine Learning TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Chapter 14 Combining Models

Chapter 14 Combining Models Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan Adaptive Filters and Machine Learning Boosting and Bagging Background Poltayev Rassulzhan rasulzhan@gmail.com Resampling Bootstrap We are using training set and different subsets in order to validate results

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 8: Boosting (and Compression Schemes) Boosting the Error If we have an efficient learning algorithm that for any distribution

More information

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016 The Boosting Approach to Machine Learning Maria-Florina Balcan 10/31/2016 Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets

More information

Ensemble Methods: Jay Hyer

Ensemble Methods: Jay Hyer Ensemble Methods: committee-based learning Jay Hyer linkedin.com/in/jayhyer @adatahead Overview Why Ensemble Learning? What is learning? How is ensemble learning different? Boosting Weak and Strong Learners

More information

Ensembles of Classifiers.

Ensembles of Classifiers. Ensembles of Classifiers www.biostat.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts ensemble bootstrap sample bagging boosting random forests error correcting

More information

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12 Ensemble Methods Charles Sutton Data Mining and Exploration Spring 2012 Bias and Variance Consider a regression problem Y = f(x)+ N(0, 2 ) With an estimate regression function ˆf, e.g., ˆf(x) =w > x Suppose

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine CS 484 Data Mining Classification 7 Some slides are from Professor Padhraic Smyth at UC Irvine Bayesian Belief networks Conditional independence assumption of Naïve Bayes classifier is too strong. Allows

More information

2 Upper-bound of Generalization Error of AdaBoost

2 Upper-bound of Generalization Error of AdaBoost COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Haipeng Zheng March 5, 2008 1 Review of AdaBoost Algorithm Here is the AdaBoost Algorithm: input: (x 1,y 1 ),...,(x m,y

More information

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL 6.867 Machine learning: lecture 2 Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learning problem hypothesis class, estimation algorithm loss and estimation criterion sampling, empirical and

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

More information

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch . Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

Ensemble Methods for Machine Learning

Ensemble Methods for Machine Learning Ensemble Methods for Machine Learning COMBINING CLASSIFIERS: ENSEMBLE APPROACHES Common Ensemble classifiers Bagging/Random Forests Bucket of models Stacking Boosting Ensemble classifiers we ve studied

More information

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan Ensemble Methods NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan How do you make a decision? What do you want for lunch today?! What did you have last night?! What are your favorite

More information

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity What makes good ensemble? CS789: Machine Learning and Neural Network Ensemble methods Jakramate Bootkrajang Department of Computer Science Chiang Mai University 1. A member of the ensemble is accurate.

More information

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Günther Eibl and Karl Peter Pfeiffer Institute of Biostatistics, Innsbruck, Austria guenther.eibl@uibk.ac.at Abstract.

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

PDEEC Machine Learning 2016/17

PDEEC Machine Learning 2016/17 PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /

More information

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees CSE 473 Chapter 18 Decision Trees and Ensemble Learning Recall: Learning Decision Trees Example: When should I wait for a table at a restaurant? Attributes (features) relevant to Wait? decision: 1. Alternate:

More information

Cross Validation & Ensembling

Cross Validation & Ensembling Cross Validation & Ensembling Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Data Warehousing & Data Mining

Data Warehousing & Data Mining 13. Meta-Algorithms for Classification Data Warehousing & Data Mining Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13.

More information

Announcements Kevin Jamieson

Announcements Kevin Jamieson Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium

More information

Bagging and Other Ensemble Methods

Bagging and Other Ensemble Methods Bagging and Other Ensemble Methods Sargur N. Srihari srihari@buffalo.edu 1 Regularization Strategies 1. Parameter Norm Penalties 2. Norm Penalties as Constrained Optimization 3. Regularization and Underconstrained

More information

i=1 = H t 1 (x) + α t h t (x)

i=1 = H t 1 (x) + α t h t (x) AdaBoost AdaBoost, which stands for ``Adaptive Boosting", is an ensemble learning algorithm that uses the boosting paradigm []. We will discuss AdaBoost for binary classification. That is, we assume that

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,

More information

Hierarchical Boosting and Filter Generation

Hierarchical Boosting and Filter Generation January 29, 2007 Plan Combining Classifiers Boosting Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Hierarchical Structure Filters Combining Classifiers Combining Classifiers

More information

Boos$ng Can we make dumb learners smart?

Boos$ng Can we make dumb learners smart? Boos$ng Can we make dumb learners smart? Aarti Singh Machine Learning 10-601 Nov 29, 2011 Slides Courtesy: Carlos Guestrin, Freund & Schapire 1 Why boost weak learners? Goal: Automa'cally categorize type

More information

Notation P(Y ) = X P(X, Y ) = X. P(Y X )P(X ) Teorema de Bayes: P(Y X ) = CIn/UFPE - Prof. Francisco de A. T. de Carvalho P(X )

Notation P(Y ) = X P(X, Y ) = X. P(Y X )P(X ) Teorema de Bayes: P(Y X ) = CIn/UFPE - Prof. Francisco de A. T. de Carvalho P(X ) Notation R n : feature space Y : class label set m : number of training examples D = {x i, y i } m i=1 (x i R n ; y i Y}: training data set H hipothesis space h : base classificer H : emsemble classifier

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University BOOSTING Robert E. Schapire and Yoav

More information

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c15 2013/9/9 page 331 le-tex 331 15 Ensemble Learning The expression ensemble learning refers to a broad class

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

2D1431 Machine Learning. Bagging & Boosting

2D1431 Machine Learning. Bagging & Boosting 2D1431 Machine Learning Bagging & Boosting Outline Bagging and Boosting Evaluating Hypotheses Feature Subset Selection Model Selection Question of the Day Three salesmen arrive at a hotel one night and

More information

1/sqrt(B) convergence 1/B convergence B

1/sqrt(B) convergence 1/B convergence B The Error Coding Method and PICTs Gareth James and Trevor Hastie Department of Statistics, Stanford University March 29, 1998 Abstract A new family of plug-in classication techniques has recently been

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Machine Learning Lecture 10

Machine Learning Lecture 10 Machine Learning Lecture 10 Neural Networks 26.11.2018 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Today s Topic Deep Learning 2 Course Outline Fundamentals Bayes

More information

COMS 4771 Lecture Boosting 1 / 16

COMS 4771 Lecture Boosting 1 / 16 COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?

More information

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24 Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

arxiv: v2 [cs.lg] 21 Feb 2018

arxiv: v2 [cs.lg] 21 Feb 2018 Vote-boosting ensembles Maryam Sabzevari, Gonzalo Martínez-Muñoz and Alberto Suárez Universidad Autónoma de Madrid, Escuela Politécnica Superior, Dpto. de Ingeniería Informática, C/Francisco Tomás y Valiente,

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

Ensemble Learning in the Presence of Noise

Ensemble Learning in the Presence of Noise Universidad Autónoma de Madrid Master s Thesis Ensemble Learning in the Presence of Noise Author: Maryam Sabzevari Supervisors: Dr. Gonzalo Martínez Muñoz, Dr. Alberto Suárez González Submitted in partial

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Bias-Variance in Machine Learning

Bias-Variance in Machine Learning Bias-Variance in Machine Learning Bias-Variance: Outline Underfitting/overfitting: Why are complex hypotheses bad? Simple example of bias/variance Error as bias+variance for regression brief comments on

More information

Logistic Regression and Boosting for Labeled Bags of Instances

Logistic Regression and Boosting for Labeled Bags of Instances Logistic Regression and Boosting for Labeled Bags of Instances Xin Xu and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {xx5, eibe}@cs.waikato.ac.nz Abstract. In

More information

10701/15781 Machine Learning, Spring 2007: Homework 2

10701/15781 Machine Learning, Spring 2007: Homework 2 070/578 Machine Learning, Spring 2007: Homework 2 Due: Wednesday, February 2, beginning of the class Instructions There are 4 questions on this assignment The second question involves coding Do not attach

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH. Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers

More information

Harrison B. Prosper. Bari Lectures

Harrison B. Prosper. Bari Lectures Harrison B. Prosper Florida State University Bari Lectures 30, 31 May, 1 June 2016 Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 1 h Lecture 1 h Introduction h Classification h Grid Searches

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Minimax risk bounds for linear threshold functions

Minimax risk bounds for linear threshold functions CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability

More information

Chemometrics and Intelligent Laboratory Systems

Chemometrics and Intelligent Laboratory Systems Chemometrics and Intelligent Laboratory Systems 100 (2010) 1 11 Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab

More information

1 Handling of Continuous Attributes in C4.5. Algorithm

1 Handling of Continuous Attributes in C4.5. Algorithm .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Classification/Supervised Learning Potpourri Contents 1. C4.5. and continuous attributes: incorporating continuous

More information

Lossless Online Bayesian Bagging

Lossless Online Bayesian Bagging Lossless Online Bayesian Bagging Herbert K. H. Lee ISDS Duke University Box 90251 Durham, NC 27708 herbie@isds.duke.edu Merlise A. Clyde ISDS Duke University Box 90251 Durham, NC 27708 clyde@isds.duke.edu

More information

Classifier Performance. Assessment and Improvement

Classifier Performance. Assessment and Improvement Classifier Performance Assessment and Improvement Error Rates Define the Error Rate function Q( ω ˆ,ω) = δ( ω ˆ ω) = 1 if ω ˆ ω = 0 0 otherwise When training a classifier, the Apparent error rate (or Test

More information

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University BBM406 - Introduc0on to ML Spring 2014 Ensemble Methods Aykut Erdem Dept. of Computer Engineering HaceDepe University 2 Slides adopted from David Sontag, Mehryar Mohri, Ziv- Bar Joseph, Arvind Rao, Greg

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008) Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008) RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 3 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 5, 2015 Announcements Homework 2 will be out tomorrow No class next week Course project

More information

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers JMLR: Workshop and Conference Proceedings vol 35:1 8, 014 Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers Balázs Kégl LAL/LRI, University

More information

Boosting: Foundations and Algorithms. Rob Schapire

Boosting: Foundations and Algorithms. Rob Schapire Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you

More information