Ensemble Methods for Machine Learning

Similar documents
CS7267 MACHINE LEARNING

ECE 5424: Introduction to Machine Learning

Learning Ensembles. 293S T. Yang. UCSB, 2017.

VBM683 Machine Learning

ECE 5984: Introduction to Machine Learning

Ensembles. Léon Bottou COS 424 4/8/2010

Stochastic Gradient Descent

Voting (Ensemble Methods)

Bias-Variance in Machine Learning

Decision Trees: Overfitting

Learning with multiple models. Boosting.

A Brief Introduction to Adaboost

Ensembles of Classifiers.

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Boosting: Foundations and Algorithms. Rob Schapire

Learning theory. Ensemble methods. Boosting. Boosting: history

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Hierarchical Boosting and Filter Generation

Machine Learning. Ensemble Methods. Manfred Huber

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CS 231A Section 1: Linear Algebra & Probability Review

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Data Mining und Maschinelles Lernen

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Boos$ng Can we make dumb learners smart?

Ensemble Methods: Jay Hyer

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Cross Validation & Ensembling

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

COMS 4771 Lecture Boosting 1 / 16

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Data Warehousing & Data Mining

Statistical Machine Learning from Data

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

TDT4173 Machine Learning

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Announcements Kevin Jamieson

Gradient Boosting (Continued)

Logistic Regression and Boosting for Labeled Bags of Instances

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Monte Carlo Theory as an Explanation of Bagging and Boosting

Logistic Regression. William Cohen

TDT4173 Machine Learning

CS229 Supplemental Lecture notes

FINAL: CS 6375 (Machine Learning) Fall 2014

Generalized Boosted Models: A guide to the gbm package

Numerical Learning Algorithms

Computational and Statistical Learning Theory

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

6.036 midterm review. Wednesday, March 18, 15

Boosting & Deep Learning

Face detection and recognition. Detection Recognition Sally

Introduction to Machine Learning Midterm Exam

Bias-Variance Tradeoff

Click Prediction and Preference Ranking of RSS Feeds

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

An overview of Boosting. Yoav Freund UCSD

The Perceptron algorithm

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

CSCI-567: Machine Learning (Spring 2019)

Adaptive Boosting of Neural Networks for Character Recognition

10701/15781 Machine Learning, Spring 2007: Homework 2

Neural Networks: Introduction

2 Upper-bound of Generalization Error of AdaBoost

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Lecture 8. Instructor: Haipeng Luo

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Nonlinear Classification

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Logistic Regression. Machine Learning Fall 2018

Ensemble Methods and Random Forests

Neural Networks and Ensemble Methods for Classification

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Bagging and Other Ensemble Methods

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Multiclass Boosting with Repartitioning

Midterm Exam Solutions, Spring 2007

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Warm up: risk prediction with logistic regression

1 Handling of Continuous Attributes in C4.5. Algorithm

Generalization, Overfitting, and Model Selection

10-701/ Machine Learning - Midterm Exam, Fall 2010

18.9 SUPPORT VECTOR MACHINES

Boosting: Algorithms and Applications

Machine Learning. Lecture 9: Learning Theory. Feng Li.

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Transcription:

Ensemble Methods for Machine Learning

COMBINING CLASSIFIERS: ENSEMBLE APPROACHES

Common Ensemble classifiers Bagging/Random Forests Bucket of models Stacking Boosting

Ensemble classifiers we ve studied so far Bagging Build many bootstrap replicates of the data Train many classifiers Vote the results

Bagged trees to random forests Bagged decision trees: Build many bootstrap replicates of the data Train many tree classifiers Vote the results Random forest: Build many bootstrap replicates of the data Train many tree classifiers (no pruning) For each split, restrict to a random subset of the original feature set. (Usually a random subset of size sqrt(d) where there are d original features.) Vote the results Like bagged decision trees, fast and easy to use

A bucket of models Scenario: I have three learners Logistic Regression, Naïve Bayes, Backprop I m going to embed a learning machine in some system (eg, genius -like music recommender) Each system has a different user è different dataset D In pilot studies there s no single best system Which learner do embed to get the best result? Can I combine all of them somehow and do better?

Input: Simple learner combination: A bucket of models your top T favorite learners (or tunings) L 1,.,L T A dataset D Learning algorithm: Use 10-CV to estimate the error of L 1,.,L T Pick the best (lowest 10-CV error) learner L* Train L* on D and return its hypothesis h*

Pros and cons of a bucket of Pros: Simple models Will give results not much worse than the best of the base learners Cons: What if there s not a single best learner? Other approaches: Vote the hypotheses (how would you weight them?) Combine them some other way? How about learning to combine the hypotheses?

Common Ensemble classifiers Bagging/Random Forests Bucket of models Stacking Boosting

Stacked learners: first attempt Input: your top T favorite learners (or tunings) L 1,.,L T A dataset D containing (x,y),. Learning algorithm: Train L 1,.,L T on D to get h 1,.,h x. T Create a new dataset D containing (x,y ),. x is a vector of the T predictions h 1 (x),.,h T (x) y is the label y for x Train YFCL on D to get h --- which combines the predictions! To predict on a new x: Construct x as before and predict h (x ) Problem: if L i overfits the data D, then h i (x) could be almost always the same as y in D. But that won t be the case on an out-of-sample-the test example The fix: make an x in D look more like the out-of-sample test cases.

Stacked learners: the right way Input: your top T favorite learners D -x is (or the tunings): CV training L 1,.,L set T for x A dataset D containing (x,y), L i (D*,x). is the label predicted by the hypothesis learned from D* Learning algorithm: by L i Train L 1,.,L T on D to get h 1,.,h T Also run 10-CV and collect the CV test predictions (x,l i (D -x,x)) for each example x and each learner L i Create a new dataset D containing (x,y ),. x is the CV test predictions (L 1 (D -x,x), L 2 (D -x,x), y is the label y for x Train YFCL on D to get h --- which combines the predictions! To predict on a new x: we don t do any CV at prediction time Construct x using h 1,.,h T (as before) and predict h (x )

Pros and cons of stacking Pros: Fairly simple Slow, but easy to parallelize Cons: What if there s not a single best combination scheme? E.g.: for movie recommendation sometimes L1 is best for users with many ratings and L2 is best for users with few ratings.

Multi-level stacking/blending Learning algorithm: D -x is the CV training set for x and L i (D*,x) is a predicted label metafeatures Train L 1,.,L T on D to get h 1,.,h T Also run 10-CV and collect the CV test predictions (x,l i (D -x,x)) for each example x and each learner L i Create a new dataset D containing (x,y ),. x is the CV test predictions (L 1 (D -x,x), L 2 (D -x,x), combined with additional features from x (e.g., numratings, userage, ) y is the label y for x Train YFCL on D to get h --- which combines the predictions! To predict on a new x: might create features like L 1 (D -x,x)=y* and numratings>20 Construct x using h 1,.,h T (as before) and predict h (x ) where the choice of classifier to rely on depends on meta-features of x

Comments Ensembles based on blending/stacking were key approaches used in the netflix competition Winning entries blended many types of classifiers Ensembles based on stacking are the main architecture used in Watson Not all of the base classifiers/rankers are learned, however; some are hand-programmed.

Common Ensemble classifiers Bagging/Random Forests Bucket of models Stacking Boosting

Thanks to A Short Introduction to Boosting. Yoav Freund, Robert E. Schapire, Journal of Japanese Society for Artificial Intelligence,14(5):771-780, September, 1999 1950 - T Boosting 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS

1936 - T 1950 - T Valiant CACM 1984 and PAClearning: partly inspired by Turing 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS AI Formal Valiant (1984) Complexity Turing machine (1936) Informal Turing test (1950) Question: what sort of AI questions can we formalize and study with formal methods?

Weak pac-learning (Kearns & Valiant 88) 1950 - T (PAC learning) 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS say, ε=0.49

Weak PAC-learning is equivalent to strong PAC-learning (!) (Schapire 89) 1950 - T (PAC learning) = 1984 - V 1988 - KV 1989 - S 1993 - DSS say, ε=0.49 1995 - FS

Weak PAC-learning is equivalent to strong PAC-learning (!) (Schapire 89) 1950 - T 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS The basic idea exploits the fact that you can learn a little on every distribution: Learn h1 from D0 with error < 49% Modify D0 so that h1 has error 50% (call this D1) If heads wait for an example x where h1(x)=f(x), otherwise wait for an example where h1(x)!=f(x). Learn h2 from D1 with error < 49% Modify D1 so that h1 and h2 always disagree (call this D2) Learn h3 from D2 with error <49%. Now vote h1,h2, and h3. This has error better than any of the weak hypotheses h1,h2 or h3. Repeat this as needed to lower the error rate more.

Boosting can actually help experimentallybut (Drucker, Schapire, Simard) 1950 - T 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS The basic idea exploits the fact that you can learn a little on every distribution: Very wasteful of examples Learn h1 from D0 with error < 49% Modify D0 so that h1 has error 50% (call this D1) If heads wait for an example x where h1(x)=f(x), otherwise wait for an example where h1(x)!=f(x). Learn h2 from D1 with error < 49% Modify D1 so that h1 and h2 always disagree (call this D2) Learn h3 from D2 with error <49%. wait for examples where they disagree!? Now vote h1,h2, and h3. This has error better than any of the weak hypotheses h1,h2 or h3. Repeat this as needed to lower the error rate more.

AdaBoost (Freund and Schapire) 1950 - T 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS

Sample with replacement Increase weight of x i if h t is wrong, decrease weight if h t is right. Linear combination of base hypotheses - best weight α t depends on error of h t.

AdaBoost: Adaptive Boosting (Freund & Schapire, 1995) 1950 - T 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS

Theoretically, one can upper bound the training error of boosting.

Boosting: A toy example Thanks, Rob Schapire

Thanks, Rob Schapire Boosting: A toy example

Thanks, Rob Schapire Boosting: A toy example

Thanks, Rob Schapire Boosting: A toy example

Thanks, Rob Schapire Boosting: A toy example

Boosting improved decision trees 1950 - T 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS

Boosting: Analysis Theorem: if the error at round t of the base classi4ier is ε t = ½- γ t, then the training error of the boosted classi4ier is bounded by The algorithm doesn t need to know any bound on γ t in advance though - - it adapts to the actual sequence of errors.

BOOSTING AS OPTIMIZATION

Even boosting single features worked well 1950 - T Reuters newswire corpus 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS

Some background facts 1950 - T 1984 - V 1988 - KV 1989 - S Coordinate descent optimization to minimize f(w) For t=1,,t or till convergence: For i=1,,n where w=<w 1,,w N > Pick w* to minimize f(<w 1,,w i-1,w*,w i+1,,w N > Set w i = w* 1993 - DSS 1995 - FS

Boosting as optimization using coordinate descent With a small number of possible h s, you can think of boosting as finding a linear combination of these: " H(x) = sign$ # % w i h i (x)' & So boosting is sort of like stacking: i h(x) h 1 (x),..., h N (x) (stacked) instance vector w α 1,...,α N weight vector Boosting uses coordinate descent to minimize an upper bound on error rate: t=i " exp$ y i # i % w i h i (x)' &

Boosting and optimization 1950 - T 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS Jerome Friedman, Trevor Hastie and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 2000. Compared using AdaBoost to set feature weights vs direct optimization of feature weights to minimize loglikelihood, squared error, 1999 - FHT

BOOSTING AS MARGIN LEARNING

Boosting didn t seem to overfit(!) 1950 - T 1984 - V test error 1988 - KV 1989 - S 1993 - DSS 1995 - FS train error

because it turned out to be increasing the margin of the classifier 1950 - T 1000 rounds 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS 100 rounds

Boosting movie

Some background facts 1950 - T 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS Coordinate descent optimization to minimize f(w) For t=1,,t or till convergence: For i=1,,n where w=<w 1,,w N > Pick w* to minimize f(<w 1,,w i-1,w*,w i+1,,w N > Set w i = w* w 2 w 1 2 +...+ w T 2 w k k w 1 k +...+ w T k w 1 w 1 +...+ w T w max(w 1,...,w T )

Boosting is closely related to margin classifiers like SVM, voted perceptron, (!) 1950 - T 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS Boosting: h(x) h 1 (x),...,h T (x) (stacked) instance vector w α 1,...,α N weight vector max w min x here w 1 = w h(x) y optimized by coordinate descent w 1 * h(x) α t and h(x) = max t h t (x) t The coordinates are being extended by one in each round of boosting --- usually, unless you happen to generate the same tree twice

where w 2 = α 2 t and h(x) 2 = h t (x) 2 t Boosting is closely related to margin classifiers like SVM, voted perceptron, (!) 1950 - T 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS Boosting: h(x) h 1 (x),...,h T (x) (stacked) instance vector w α 1,...,α N weight vector max w min x here w 1 = Linear SVMs: max w min x w h(x) y optimized by coordinate descent w 1 * h(x) α t and h(x) = max t h t (x) t w h(x) y w 2 * h(x) 2 optimized by QP,...

BOOSTING VS BAGGING

Boosting vs Bagging Both build weighted combinations of classifiers, each classifier being learned on a sample of the original data Boosting reweights the distribution in each iteration Bagging doesn t

Boosting vs Bagging Boosting finds a linear combination of weak hypotheses Theory says it reduces bias In practice is also reduces variance, especially in later iterations Splice Audiology

Boosting vs Bagging

WRAPUP ON BOOSTING

Boosting in the real world 1950 - T 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS now William s wrap up: Boosting is not discussed much in the ML research community any more It s much too well understood It s really useful in practice as a meta-learning method Eg, boosted Naïve Bayes usually beats Naïve Bayes Boosted decision trees are almost always competitive with respect to accuracy very robust against rescaling numeric features, extra features, non-linearities, somewhat slower to learn and use than many linear classifiers But getting probabilities out of them is a little less reliable.