Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Similar documents
CS7267 MACHINE LEARNING

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Statistical Machine Learning from Data

Learning with multiple models. Boosting.

Voting (Ensemble Methods)

VBM683 Machine Learning

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Variance Reduction and Ensemble Methods

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Learning theory. Ensemble methods. Boosting. Boosting: history

Machine Learning. Ensemble Methods. Manfred Huber

ECE 5424: Introduction to Machine Learning

Ensemble Methods for Machine Learning

Chapter 14 Combining Models

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Data Mining und Maschinelles Lernen

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Data Warehousing & Data Mining

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Boosting: Foundations and Algorithms. Rob Schapire

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Advanced Machine Learning

Ensembles of Classifiers.

Lecture 8. Instructor: Haipeng Luo

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Boos$ng Can we make dumb learners smart?

COMS 4771 Lecture Boosting 1 / 16

BOOSTING THE MARGIN: A NEW EXPLANATION FOR THE EFFECTIVENESS OF VOTING METHODS

Lecture 13: Ensemble Methods

Ensembles. Léon Bottou COS 424 4/8/2010

Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Cross Validation & Ensembling

TDT4173 Machine Learning

ECE 5984: Introduction to Machine Learning

Ensemble Methods: Jay Hyer

Boosting the margin: A new explanation for the effectiveness of voting methods

Ensemble Methods and Random Forests

A Brief Introduction to Adaboost

TDT4173 Machine Learning

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Monte Carlo Theory as an Explanation of Bagging and Boosting

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Neural Networks and Ensemble Methods for Classification

Decision Trees: Overfitting

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

CS229 Supplemental Lecture notes

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Randomized Decision Trees

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Hierarchical Boosting and Filter Generation

Bias-Variance in Machine Learning

CS534 Machine Learning - Spring Final Exam

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Measures of Diversity in Combining Classifiers

Stochastic Gradient Descent

1/sqrt(B) convergence 1/B convergence B

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Bagging and Other Ensemble Methods

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Boosting & Deep Learning

Chapter 6. Ensemble Methods

Classification of Longitudinal Data Using Tree-Based Ensemble Methods

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

An Empirical Study of Building Compact Ensembles

BAGGING PREDICTORS AND RANDOM FOREST

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

2D1431 Machine Learning. Bagging & Boosting

1 Handling of Continuous Attributes in C4.5. Algorithm

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Getting Lost in the Wealth of Classifier Ensembles?

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

CSCI-567: Machine Learning (Spring 2019)

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Statistics and learning: Big Data

Lossless Online Bayesian Bagging

PDEEC Machine Learning 2016/17

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

2 Upper-bound of Generalization Error of AdaBoost

Announcements Kevin Jamieson

An adaptive version of the boost by majority algorithm

FINAL: CS 6375 (Machine Learning) Fall 2014

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Logistic Regression and Boosting for Labeled Bags of Instances

Ensemble Learning. Synonyms Committee machines; Multiple classier systems

Ensemble Learning in the Presence of Noise

Boosting. 1 Boosting. 2 AdaBoost. 2.1 Intuition

Combining Classifiers

UVA CS 4501: Machine Learning

Classifier Performance. Assessment and Improvement

Low Bias Bagged Support Vector Machines

Holdout and Cross-Validation Methods Overfitting Avoidance

Transcription:

The wisdom of the crowds Ensemble learning Sir Francis Galton discovered in the early 1900s that a collection of educated guesses can add up to very accurate predictions! Chapter 11 The paper in which he describes these findings: Vox Populi. Nature 75, pages 450-451, 1907. http://galton.org/essays/1900-1911/galton-1907-vox-populi.pdf Image from http://www.pbs.org/wgbh/nova/physics/wisdom-crowds.html 1 Ensemble methods Intuition: averaging measurements can lead to more reliable estimates Need: an ensemble of different models from the same training data How to achieve diversity? Ensemble methods Intuition: averaging measurements can lead to more reliable estimates Need: an ensemble of different models from the same training data How to achieve diversity? Training models on random subsets or random subsets of features 4 1

Ensemble methods Bootstrap samples The general strategy: Construct multiple, diverse predictive models from adapted versions of the training data Combine the predictions Bootstrap sample: sample with replacement from a dataset The probability that a given example is not selected for a bootstrap sample of size n: (1 1/n) n This has a limit as n goes to infinity: 1/e = 0.68 Conclusion: each bootstrap sample is likely to leave out about a third of the examples. 5 6 Bagging Bagging Algorithm Bagging(D,T,A ) Input : data set D; ensemble size T ; learning algorithm A. Output : ensemble of models whose predictions are to be combined by voting or averaging. 1 for t = 1 to T do build a bootstrap sample D t from D by sampling D data points with replacement; run A on D t to produce a model t ; 4 end 5 return { t 1 t T } Comments: How to combine the classifiers Average the raw classifier scores Convert to probabilities before averaging Voting The decision boundary of a bagged classifier is more complex than that of the underlying classifiers. Breiman, Leo (1996). "Bagging predictors. achine Learning 4 (): 1 140. 7 8

Random forests Random forests Random forests are a special case of bagging with the additional features: Use decision trees as the base classifier Sample the features for each bootstrapping sample Algorithm RandomForest(D,T,d) Input : data set D; ensemble size T ; subspace dimension d. Output : ensemble of tree models whose predictions are to be combined by voting or averaging. 1 for t = 1 to T do build a bootstrap sample D t from D by sampling D data points with replacement; select d features at random and reduce dimensionality of D t accordingly; 4 train a tree model t on D t without pruning; 5 end 6 return { t 1 t T } Random forests. L Breiman. achine Learning, 001. 9 10 Random forests Random forests are a special case of bagging with the additional features: Use decision trees as the base classifier Sample the features for each bootstrapping sample Variable importance: the values of the i th feature are permuted among the training data and the out-of-bag error is computed on this perturbed data set. The importance score for is computed by averaging the difference in out-of-bag error before and after the permutation over all trees. Error estimation during training using out-of-bag data. Why do committees work? Let s focus on regression to illustrate this point. The output produced by the committee: f CO (x) = 1 X f m (x) Let s assume the output of each model can be represented as: f m (x) =y(x)+ m (x) The expected error of an individual model: E h(f m (x) y(x)) i = E m (x) 11 1

Why do committees work? The average of the errors (corresponds to the situation where each model predicts indepdently): E AV = 1 X E m (x) Compare that to the expected error from the committee:! E CO = E 4 1 X f m (x) y(x) 5! = E 4 1 X m (x) 5 Let s assume: Why do committees work? E AV = 1 X E m (x)! E CO = E 4 1 X f m (x) y(x) 5! = E 4 1 X m (x) 5 E [ m (x)] = 0 E [ m (x) l (x)] = 0 Under this assumption: E CO = 1 E AV 1 14 ixtures of experts Bagging produces a result by averaging the predictions of the underlying classifiers. Alternative: mixture of experts. have each classifier be responsible for a small part of the input space Boosting Similar to bagging, but uses a more sophisticated method for constructing its diverse training sets. ain ideas: Train the next classifier on examples that previous classifiers made errors on. Assign each classifier a confidence value that depends on its accuracy. 15 16 4

Boosting Boosting and the exponential loss Algorithm Boosting(D,T,A ) Input : data set D; ensemble size T ; learning algorithm A. Output : weighted ensemble of models. 1 w 1i 1/ D for all x i D ; // start with uniform weights for t = 1 to T do run A on D with weights w ti to produce a model t ; 4 calculate weighted error t ; 5 if t 1/ then 6 set T t 1 and break 7 end 8 Æ t 1 1 t ln ; t // confidence for this model 9 w (t+1)i wti for misclassified instances x t i D ; // increase weight 10 w (t+1)j wtj (1 t ) for correctly classified instances x j D ; // decrease 11 end 1 return (x) = P T t=1 Æ t t (x) What does boosting do? Turns out that boosting minimizes the the exponential loss and leads to large margin solutions. L exp (y, f) =exp( y f) Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 6(5):1651-1686, 1998. 17 18 Which classifier to boost? Typically choose a simple classifier such as a decision stump or a simple linear classifier. The bias-variance decomposition The expected loss can be decomposed as follows (page 9): expected loss = (bias) + variance bias the extent to which the average prediction differs from the label variance the variability of the classifier when trained on different training sets Very flexible models have very low bias, but high variance. Rigid models have high bias and low variance. The best model will achieve a good balance between the two. 19 0 5

Bagging vs boosting expected loss = (bias) + variance bias the extent to which the average prediction is different differs from the label variance the variability of the classifier when trained on different training sets Very flexible models have very low bias, but high variance. Rigid models have high bias and low variance. Bagging is a variance reduction technique, whereas boosting acts to reduce bias. 1 6