Ensembles. Léon Bottou COS 424 4/8/2010

Similar documents
What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

CS7267 MACHINE LEARNING

Learning theory. Ensemble methods. Boosting. Boosting: history

Ensembles of Classifiers.

Boosting: Foundations and Algorithms. Rob Schapire

Ensemble Methods for Machine Learning

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Voting (Ensemble Methods)

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

ECE 5424: Introduction to Machine Learning

Learning with multiple models. Boosting.

Data Mining und Maschinelles Lernen

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

COMS 4771 Lecture Boosting 1 / 16

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

ECE 5984: Introduction to Machine Learning

Hierarchical Boosting and Filter Generation

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

VBM683 Machine Learning

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Chapter 14 Combining Models

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Machine Learning. Ensemble Methods. Manfred Huber

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Lecture 8. Instructor: Haipeng Luo

Statistics and learning: Big Data

A Brief Introduction to Adaboost

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Neural Networks and Ensemble Methods for Classification

Statistical Machine Learning from Data

Boos$ng Can we make dumb learners smart?

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Boosting & Deep Learning

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

CSCI-567: Machine Learning (Spring 2019)

Decision Trees: Overfitting

Stochastic Gradient Descent

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

6.036 midterm review. Wednesday, March 18, 15

1 Handling of Continuous Attributes in C4.5. Algorithm

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Multiclass Boosting with Repartitioning

i=1 = H t 1 (x) + α t h t (x)

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Ensemble Methods and Random Forests

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

1 Handling of Continuous Attributes in C4.5. Algorithm

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

TDT4173 Machine Learning

UVA CS 4501: Machine Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Gradient Boosting (Continued)

2 Upper-bound of Generalization Error of AdaBoost

TDT4173 Machine Learning

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

2D1431 Machine Learning. Bagging & Boosting

Combining Classifiers

Bagging and Other Ensemble Methods

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

10701/15781 Machine Learning, Spring 2007: Homework 2

Announcements Kevin Jamieson

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Notes on Machine Learning for and

An Empirical Study of Building Compact Ensembles

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Ensemble Methods: Jay Hyer

Notation P(Y ) = X P(X, Y ) = X. P(Y X )P(X ) Teorema de Bayes: P(Y X ) = CIn/UFPE - Prof. Francisco de A. T. de Carvalho P(X )

Machine Learning Recitation 8 Oct 21, Oznur Tastan

StatPatternRecognition: A C++ Package for Multivariate Classification of HEP Data. Ilya Narsky, Caltech

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Computational and Statistical Learning Theory

Bias-Variance in Machine Learning

Lecture 3: Decision Trees

Ensemble Learning in the Presence of Noise

MODULE -4 BAYEIAN LEARNING

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

MIRA, SVM, k-nn. Lirong Xia

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Logistic Regression. Machine Learning Fall 2018

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Holdout and Cross-Validation Methods Overfitting Avoidance

Harrison B. Prosper. Bari Lectures

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Robotics 2 AdaBoost for People and Place Detection

Predicting Protein Interactions with Motifs

Data Warehousing & Data Mining

Multiclass Classification-1

An overview of Boosting. Yoav Freund UCSD

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Decision Tree And Random Forest

Gradient Boosting, Continued

Boosting: Algorithms and Applications

Evolutionary Ensemble Strategies for Heuristic Scheduling

Transcription:

Ensembles Léon Bottou COS 424 4/8/2010

Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. Léon Bottou 2/33 COS 424 4/8/2010

Summary 1. Why ensembles? 2. Combining outputs. 3. Constructing ensembles. 4. Boosting. Léon Bottou 3/33 COS 424 4/8/2010

I. Ensembles Léon Bottou 4/33 COS 424 4/8/2010

Ensemble of classifiers Ensemble of classifiers Consider a set of classifiers h 1, h 2,..., h L. Construct a classifier by combining their individual decisions. For example by voting their outputs. Accuracy The ensemble works if the classifiers have low error rates. Diversity No gain if all classifiers make the same mistakes. What if classifiers make different mistakes? Léon Bottou 5/33 COS 424 4/8/2010

Uncorrelated classifiers Assume r s Cov [ 1I{h r (x) = y}, 1I{h s (x) = y} ] = 0 The tally of classifier votes follows a binomial distribution. Example Twenty-one uncorrelated classifiers with 30% error rate. Léon Bottou 6/33 COS 424 4/8/2010

Statistical motivation blue : classifiers that work well on the training set(s) f : best classifier. Léon Bottou 7/33 COS 424 4/8/2010

Computational motivation blue : classifier search may reach local optima f : best classifier. Léon Bottou 8/33 COS 424 4/8/2010

Representational motivation blue : classifier space may not contain best classifier f : best classifier. Léon Bottou 9/33 COS 424 4/8/2010

Practical success Recommendation system Netflix movies you may like. Customers sometimes rate movies they rent. Input: (movie, customer) Output: rating Netflix competition 1M$ for the first team to do 10% better than their system. Winner: BellKor team and friends Ensemble of more than 800 rating systems. Runner-up: everybody else Ensemble of all the rating systems built by the other teams. Léon Bottou 10/33 COS 424 4/8/2010

Bayesian ensembles Let D represent the training data. Enumerating all the classifiers P (y x, D) = h P (y, h x, D) = h P (h x, D) P (y h, x, D) = h P (h D) P (y x, h) P (h D) : how well does h match the training data. P (y x, h) : what h predicts for pattern x. Note that this is a weighted average. Léon Bottou 11/33 COS 424 4/8/2010

II. Combining Outputs Léon Bottou 12/33 COS 424 4/8/2010

Simple averaging Léon Bottou 13/33 COS 424 4/8/2010

Weighted averaging a priori Weights derived from the training errors, e.g. exp( β T rainingerror(h t )). Approximate Bayesian ensemble. Léon Bottou 14/33 COS 424 4/8/2010

Weighted averaging with trained weights Train weights on the validation set. Training weights on the training set overfits easily. You need another validation set to estimate the performance! Léon Bottou 15/33 COS 424 4/8/2010

Stacked classifiers Second tier classifier trained on the validation set. You need another validation set to estimate the performance! Léon Bottou 16/33 COS 424 4/8/2010

III. Constructing Ensembles Léon Bottou 17/33 COS 424 4/8/2010

Diversification Cause of the mistake Pattern was difficult. Overfitting ( ) Some features were noisy Multiclass decisions were inconsistent Diversification strategy hopeless vary the training sets vary the set of input features vary the class encoding Léon Bottou 18/33 COS 424 4/8/2010

Manipulating the training examples Bootstrap replication simulates training set selection Given a training set of size n, construct a new training set by sampling n examples with replacement. About 30% of the examples are excluded. Bagging Create bootstrap replicates of the training set. Build a decision tree for each replicate. Estimate tree performance using out-of-bootstrap data. Average the outputs of all decision trees. Boosting See part IV. Léon Bottou 19/33 COS 424 4/8/2010

Manipulating the features Random forests Construct decision trees on bootstrap replicas. Restrict the node decisions to a small subset of features picked randomly for each node. Do not prune the trees. Estimate tree performance using out-of-bootstrap data. Average the outputs of all decision trees. Multiband speech recognition Filter speech to eliminate a random subset of the frequencies. Train speech recognizer on filtered data. Repeat and combine with a second tier classifier. Resulting recognizer is more robust to noise. Léon Bottou 20/33 COS 424 4/8/2010

Manipulating the output codes Reducing multiclass problems to binary classification We have seen one versus all. We have seen all versus all. Error correcting codes for multiclass problems Code the class numbers with an error correcting code. Construct a binary classifier for each bit of the code. Run the error correction algorithm on the binary classifier outputs. Léon Bottou 21/33 COS 424 4/8/2010

IV. Boosting Léon Bottou 22/33 COS 424 4/8/2010

Motivation Easy to come up with rough rules of thumb for classifying data email contains more than 50% capital letters. email contains expression buy now. Each alone isnt great, but better than random. Boosting converts rough rules of thumb into an accurate classier. Boosting was invented by Prof. Schapire. Léon Bottou 23/33 COS 424 4/8/2010

Adaboost Given examples (x 1, y 1 )... (x n, y n ) with y i = ±1. Let D 1 (i) = 1/n for i = 1... n. For t = 1... T do Run weak learner using examples with weights D t. Get weak classifier h t Compute error: ε t = i D t(i) 1I(h t (x i ) y i ) Compute magic coefficient α t = 1 ( 1 εt 2 log ε t Update weights D t+1 (i) = D t(i) e α t y i h t (x i ) Output the final classifier f T (x) = sign Z t ) T α t h t (x) t=1 Léon Bottou 24/33 COS 424 4/8/2010

Toy example Weak classifiers: vertical or horizontal half-planes. Léon Bottou 25/33 COS 424 4/8/2010

Adaboost round 1 Léon Bottou 26/33 COS 424 4/8/2010

Adaboost round 2 Léon Bottou 27/33 COS 424 4/8/2010

Adaboost round 3 Léon Bottou 28/33 COS 424 4/8/2010

Adaboost final classifier Léon Bottou 29/33 COS 424 4/8/2010

From weak learner to strong classifier (1) Preliminary D T +1 (i) = D 1 (i) e α 1 y i h 1 (x i ) Z 1 e α T y i h T (x i) Z T = 1 n e y i f T (x i ) t Z t Bounding the training error 1 1I{f n T (x i ) y i } 1 e y n i f T (x i ) = i i 1 n D T +1 (i) t i Z t = t Z t Idea: make Z t as small as possible. Z t = n D t (i)e α t y i h t (x i ) i=1 1. Pick h t to minimize ε t. 2. Pick α t to minimize Z t. = n (1 ε t ) e α t + n ε t e α t Léon Bottou 30/33 COS 424 4/8/2010

From weak learner to strong classifier (2) Pick α t to minimize Z t (the magic coefficient) Z t α t = (1 ε t ) e α t + ε t e α t = 0 = α t = 1 2 log 1 ε t ε t Weak learner assumption: γ t = 2 1 ε t is positive and small. ε 1 ε Z t = (1 ε) 1 ε + ε = 4ε(1 ε) = 1 4γ 2 ε t exp ( 2γt 2 ) T T TrainingError(f T ) Z t exp 2 t=1 t=1 γ 2 t The training error decreases exponentially if inf γ t > 0. But that does not happen beyond a certain point... Léon Bottou 31/33 COS 424 4/8/2010

Boosting and exponential loss Proofs are instructive We obtain the bound TrainingError(f T ) 1 n i e y ih(x i ) without saying how D t relates to h t without using the value of α t = T t=1 Z t y y(x) ^ Conclusion Round T chooses the h T and α T that maximize the exponential loss reduction from f T 1 to f T. Exercise Tweak Adaboost to minimize the log loss instead of the exp loss. Léon Bottou 32/33 COS 424 4/8/2010

Boosting and margins margin H (x, y) = y H(x) t α t = t α t y h t (x) t α t Remember support vector machines? Léon Bottou 33/33 COS 424 4/8/2010