Ensembles of Classifiers.

Similar documents
CS7267 MACHINE LEARNING

ECE 5424: Introduction to Machine Learning

Ensembles. Léon Bottou COS 424 4/8/2010

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

VBM683 Machine Learning

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Bagging and Other Ensemble Methods

Voting (Ensemble Methods)

Data Mining und Maschinelles Lernen

Learning with multiple models. Boosting.

A Brief Introduction to Adaboost

ECE 5984: Introduction to Machine Learning

Ensemble Methods for Machine Learning

Neural Networks and Ensemble Methods for Classification

Error Limiting Reductions Between Classification Tasks

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Statistics and learning: Big Data

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Variance Reduction and Ensemble Methods

1/sqrt(B) convergence 1/B convergence B

1 Handling of Continuous Attributes in C4.5. Algorithm

Machine Learning. Ensemble Methods. Manfred Huber

Ensemble Methods: Jay Hyer

Boosting & Deep Learning

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Holdout and Cross-Validation Methods Overfitting Avoidance

Ensemble Methods and Random Forests

Data Warehousing & Data Mining

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Boos$ng Can we make dumb learners smart?

Ensemble Learning in the Presence of Noise

Learning theory. Ensemble methods. Boosting. Boosting: history

Bias-Variance in Machine Learning

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Statistical Machine Learning from Data

1 Handling of Continuous Attributes in C4.5. Algorithm

FINAL: CS 6375 (Machine Learning) Fall 2014

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Empirical Risk Minimization, Model Selection, and Model Assessment

Combining Classifiers

Error Limiting Reductions Between Classification Tasks

TDT4173 Machine Learning

Gradient Boosting (Continued)

2D1431 Machine Learning. Bagging & Boosting

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Chapter 14 Combining Models

BAGGING PREDICTORS AND RANDOM FOREST

Stochastic Gradient Descent

RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY

Diversity-Based Boosting Algorithm

Top-k Parametrized Boost

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Learning Sparse Perceptrons

COMS 4771 Lecture Boosting 1 / 16

Notes on Machine Learning for and

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Hierarchical Boosting and Filter Generation

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Cross Validation & Ensembling

An Empirical Study of Building Compact Ensembles

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Lossless Online Bayesian Bagging

Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

TDT4173 Machine Learning

Algorithm-Independent Learning Issues

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

CS145: INTRODUCTION TO DATA MINING

Lecture 3: Decision Trees

Lecture 8. Instructor: Haipeng Luo

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

BOOSTING THE MARGIN: A NEW EXPLANATION FOR THE EFFECTIVENESS OF VOTING METHODS

Low Bias Bagged Support Vector Machines

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Boosting the margin: A new explanation for the effectiveness of voting methods

UVA CS 4501: Machine Learning

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Monte Carlo Theory as an Explanation of Bagging and Boosting

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Decision Tree Learning Lecture 2

Machine Learning Lecture 12

Active Sonar Target Classification Using Classifier Ensembles

Logistic Regression. Machine Learning Fall 2018

ABC random forest for parameter estimation. Jean-Michel Marin

IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1

Deconstructing Data Science

Transcription:

Ensembles of Classifiers www.biostat.wisc.edu/~dpage/cs760/ 1

Goals for the lecture you should understand the following concepts ensemble bootstrap sample bagging boosting random forests error correcting output codes 2

What is an ensemble? h 1 (x) h 2 (x) x h 3 (x) h(x) h 4 (x) h 5 (x) a set of learned models whose individual decisions are combined in some way to make predictions for new instances 3

When can an ensemble be more accurate? when the errors made by the individual predictors are (somewhat) uncorrelated, and the predictors error rates are better than guessing (< 0.5 for 2-class problem) consider an idealized case error rate of ensemble is represented by probability mass in this box = 0.026 4 Figure from Dietterich, AI Magazine, 1997

How can we get diverse classifiers? In practice, we can t get classifiers whose errors are completely uncorrelated, but we can encourage diversity in their errors by choosing a variety of learning algorithms choosing a variety of settings (e.g. # hidden units in neural nets) for the learning algorithm choosing different subsamples of the training set (bagging) using different probability distributions over the training instances (boosting) choosing different features and subsamples (random forests) 5

Bagging (Bootstrap Aggregation) [Breiman, Machine Learning 1996] learning: given: learner L, training set D = { x 1, y 1 x m, y m } for i 1 to T do D (i) m instances randomly drawn with replacement from D h i model learned using L on D (i) classification: given: test instance x predict y plurality_vote( h 1 (x) h T (x) ) regression: given: test instance x t predict y mean( h 1 (x) h T (x) ) 6

Bagging each sampled training set is a bootstrap replicate contains m instances (the same as the original training set) on average it includes 63.2% of the original training set some instances appear multiple times can be used with any base learner works best with unstable learning methods: those for which small changes in D result in relatively large changes in learned models 7

Empirical evaluation of bagging with C4.5 Figure from Dietterich, AI Magazine, 1997 Bagging reduced error of C4.5 on most data sets; wasn t harmful on any 8

Boosting Boosting came out of the PAC learning community A weak PAC learning algorithm is one that cannot PAC learn for arbitrary ε and δ, although its hypotheses are slightly better than random guessing Suppose we have a weak PAC learning algorithm L for a concept class C. Can we use L as a subroutine to create a strong PAC learner for C? Yes, by boosting! [Schapire, Machine Learning 1990] The original boosting algorithm was of theoretical interest, but assumed an unbounded source of training instances A later boosting algorithm, AdaBoost, has had notable practical success 9

AdaBoost [Freund & Schapire, Journal of Computer and System Sciences, 1997] given: learner L, # stages T, training set D = { x 1, y 1 x m, y m } for all i : w 1 (i) 1/m for t 1 to T do for all i : p t (i) w t (i) / (Σ j w t (j)) // initialize instance weights // normalize weights h t model learned using L on D and p t ε t Σ i p t (i)(1 - δ(h t (x i ), y i )) // calculate weighted error if ε t > 0.5 then T t 1 break β t ε t / (1 ε t ) for all i where h t (x i ) = y i // downweight correct examples w t+1 (i) w t (i) β t return: h(x) = argmax y T " log 1 % # $ β t & ' δ h t (x), y t=1 ( ) 10

Implementing weighted instances with AdaBoost AdaBoost calls the base learner L with probability distribution p t specified by weights on the instances there are two ways to handle this 1. Adapt L to learn from weighted instances; straightforward for decision trees and naïve Bayes, among others 2. Sample a large (>> m) unweighted set of instances according to p t ; run L in the ordinary manner 11

Empirical evaluation of boosting with C4.5 Figure from Dietterich, AI Magazine, 1997 12

Bagging and boosting with C4.5 Figure from Dietterich, AI Magazine, 1997 13

Empirical study of bagging vs. boosting [Opitz & Maclin, JAIR 1999] 23 data sets C4.5 and neural nets as base learners bagging almost always better than single decision tree or neural net boosting can be much better than bagging however, boosting can sometimes reduce accuracy (too much emphasis on outliers?) 14

Random forests [Breiman, Machine Learning 2001] given: candidate feature splits F, training set D = { x 1, y 1 x m, y m } for i 1 to T do D (i) m instances randomly drawn with replacement from D h i randomized decision tree learned with F, D (i) randomized decision tree learning: to select a split at a node R randomly select (without replacement) f feature splits from F (where f << F ) choose the best feature split in R do not prune trees classification/regression: as in bagging 15

Learning models for multi-class problems consider a learning task with k > 2 classes with some learning methods, we can learn one model to predict the k classes an alternative approach is to learn k models; each represents one class vs. the rest but we could learn models to represent other encodings as well 16

Error correcting output codes [Dietterich & Bakiri, JAIR 1995] ensemble method devised specifically for problems with many classes represent each class by a multi-bit code word learn a classifier to represent each bit function 17

Classification with ECOC to classify a test instance x using an ECOC ensemble with T classifiers 1. form a vector h(x) = h 1 (x) h T (x) where h i (x) is the prediction of the model for the i th bit 2. find the codeword c with the smallest Hamming distance to h(x) 3. predict the class associated with c if the minimum Hamming distance between any pair of codewords is d, we can still get the right classification with single-bit errors " # " d 1 2 $ % $ recall, x is the largest integer not greater than x 18

Error correcting code design a good ECOC should satisfy two properties 1. row separation: each codeword should be well separated in Hamming distance from every other codeword 2. column separation: each bit position should be uncorrelated with the other bit positions 7 bits apart 6 bits apart d = 7 so this code can correct " # " 7 1 2 $ % $ = 3 errors 19

ECOC evaluation with C4.5 Figure from Bakiri & Dietterich, JAIR, 1995 20

ECOC evaluation with neural nets Figure from Bakiri & Dietterich, JAIR, 1995 21

Comments on ensembles They very often provide a boost in accuracy over base learner It s a good idea to evaluate an ensemble approach for almost any practical learning problem They increase runtime over base learner, but compute cycles are usually much cheaper than training instances Some ensemble approaches (e.g. bagging, random forests) are easily parallelized Prediction contests (e.g. Kaggle, Netflix Prize) usually won by ensemble solutions Ensemble models are usually low on the comprehensibility scale, although see work by [Craven & Shavlik, NIPS 1996] [Domingos, Intelligent Data Analysis 1998] [Van Assche & Blockeel, ECML 2007] 22