Ensemble Methods: Jay Hyer

Similar documents
CS7267 MACHINE LEARNING

Voting (Ensemble Methods)

Machine Learning. Ensemble Methods. Manfred Huber

Data Mining und Maschinelles Lernen

Learning with multiple models. Boosting.

A Brief Introduction to Adaboost

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

ECE 5424: Introduction to Machine Learning

Ensembles of Classifiers.

Ensemble Methods for Machine Learning

VBM683 Machine Learning

Stochastic Gradient Descent

TDT4173 Machine Learning

Decision Trees: Overfitting

Statistical Machine Learning from Data

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Gradient Boosting (Continued)

Statistics and learning: Big Data

Chapter 6. Ensemble Methods

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

ECE 5984: Introduction to Machine Learning

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)

FINAL: CS 6375 (Machine Learning) Fall 2014

Neural Networks and Ensemble Methods for Classification

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

TDT4173 Machine Learning

Machine Learning Recitation 8 Oct 21, Oznur Tastan

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Ensembles. Léon Bottou COS 424 4/8/2010

Learning theory. Ensemble methods. Boosting. Boosting: history

Boos$ng Can we make dumb learners smart?

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Monte Carlo Theory as an Explanation of Bagging and Boosting

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Chapter 14 Combining Models

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Computational Learning Theory

Generalized Boosted Models: A guide to the gbm package

Announcements Kevin Jamieson

Name (NetID): (1 Point)

LEAST ANGLE REGRESSION 469

Ensemble Methods and Random Forests

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Evaluation. Andrea Passerini Machine Learning. Evaluation

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Logistic Regression and Boosting for Labeled Bags of Instances

Boosting & Deep Learning

Variance Reduction and Ensemble Methods

Evaluation requires to define performance measures to be optimized

Classification using stochastic ensembles

Lecture 13: Ensemble Methods

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Gradient Boosting, Continued

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA

PDEEC Machine Learning 2016/17

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Hierarchical Boosting and Filter Generation

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

i=1 = H t 1 (x) + α t h t (x)

An overview of Boosting. Yoav Freund UCSD

Delta Boosting Machine and its application in Actuarial Modeling Simon CK Lee, Sheldon XS Lin KU Leuven, University of Toronto

Regularized Linear Models in Stacked Generalization

Data Warehousing & Data Mining

Bias-Variance in Machine Learning

Ensemble Learning in the Presence of Noise

Lecture 8. Instructor: Haipeng Luo

Learning Binary Classifiers for Multi-Class Problem

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Empirical Risk Minimization, Model Selection, and Model Assessment

A Bias Correction for the Minimum Error Rate in Cross-validation

1 Handling of Continuous Attributes in C4.5. Algorithm

A Weight-Adjusted Voting Algorithm for Ensemble of Classifiers

Does Modeling Lead to More Accurate Classification?

BAGGING PREDICTORS AND RANDOM FOREST

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

A Study of Relative Efficiency and Robustness of Classification Methods

Notation P(Y ) = X P(X, Y ) = X. P(Y X )P(X ) Teorema de Bayes: P(Y X ) = CIn/UFPE - Prof. Francisco de A. T. de Carvalho P(X )

ENSEMBLES OF DECISION RULES

UVA CS 4501: Machine Learning

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]

Why does boosting work from a statistical view

Low Bias Bagged Support Vector Machines

Bagging and Boosting for the Nearest Mean Classifier: Effects of Sample Size on Diversity and Accuracy

A Simple Algorithm for Learning Stable Machines

Boosting kernel density estimates: A bias reduction technique?

Transcription:

Ensemble Methods: committee-based learning Jay Hyer linkedin.com/in/jayhyer @adatahead

Overview Why Ensemble Learning? What is learning? How is ensemble learning different? Boosting Weak and Strong Learners AdaBoost Bagging Bootstrapping and Aggregating Out-of-Bag Error Random Forest Stacking Beyond Averaging Special Applications Proximity and outliers Class imbalance Conclusion Questions?

Why Ensemble Learning? What is learning? D is the phenomenon under investigation: credit card transactions, clinical trails, handwritten digits, ect D is the dataset of the observed phenomenon: {(x 1,y 1 ),(x 2,y 2 ),,(x n,y n )}. f is the target or ground-truth function: f (x i ) = y i. H is the hypothesis space, which is a collection of all possible estimates of the target function f. We will refer to a specific estimate of f from H as h i. [Abu-Mustafa et. al., 2012] (Chapter 1), [Zhou, 2012] (Chapter 1)

Why Ensemble Learning? What is learning? H h logistic D D f h SVM h decision tree

Why Ensemble Learning? What is learning? h logistic :1/1+e -y s.t. y= b 0 +b 1 x 1 + +b n x n. h SVM : max a S i a i -0.5S j S i a i a j y i y j K(x i,x j ). h decision tree :Ent(S) Σ k S k / S Ent(S k ), information gain. [Hastie, et. al.,2009] (Chapters 4, 9 & 12), [Mitchell,1997] (Chapter 3)

Why Ensemble Learning? What is learning? model training H h logistic D D h 1 f h 1 h SVM h 1 h decision tree

Why Ensemble Learning? What is learning? model training H h logistic h 2... D D f h 2 h 2 h SVM h decision tree

Why Ensemble Learning? What is learning? model trained H h logistic D D h n f h n h n h SVM h decision tree

Why Ensemble Learning? How is ensemble learning different? Typically, we choose a single model from {h 1,h 2,,h m,h 1,h 2,,h m,h 1,h 2,,h m }. With ensemble learning, we combine these models with an ensemble learning algorithm. - Boosting: Σ j=1:m a j h j (x) (weighted average). - Bagging: arg max y Σ j=1:m I(h j (x)=y) (majority vote). - Stacking: w 1 h 1 + +w i-1 h i-1 +w i h i + +w j-1 h j-1 + w j h j + +w m h m (meta learning).

Why Ensemble Learning? How is ensemble learning different? Traditional learning. H D D f h1

Why Ensemble Learning? How is ensemble learning different? Ensemble learning. H h 4 h 7 h m h 8 f h1 h 6 h 5 h 9 h 2 h 10 h 11 h3 D D

Why Ensemble Learning? How is ensemble learning different? Can reduce bias by increasing the coverage of the hypothesis space with many diverse models. Can reduce variance by averaging the output of many diverse models. So what s the secret to ensemble learning?: Diversity. Diversity can be achieved by augmenting the input data with weights (boosting) and resampling (bagging), selecting different input variables (Random Forest), or simply trying many different models and combining them (stacking).

Boosting Weak and Strong Learners A weak learner is slightly better than random. A strong learner closely predicts the target function. Can a group of weak learners perform as well as a single strong learner? Yes! these two notions of learnability are equivalent [Schapire, 1990].

Boosting AdaBoost Input: Dataset{(x 1,y 1 ),(x 2,y 2 ),,(x n,y n )} ; Learning algorithm (decision tree, ect.); Process: Number of base learners M; 1. Initialize all weights to 1/n (equal values). 2. Estimate learner with set of weights w. 3. Calculate error e j of learner h j. 4. Check if learner is better than average. 5. Set weight a j for each h j. 6. Update weight w i for each x i. 7. Combine the M base learners. Input: Dataset D, y i = {-1,1} ; Learning algorithm L; Process: Set number of iterations to M; 1. w i (x) =1/n 2. h j = L(D, w) for j in 1:M 3. e j = Σ i=1:n w i I(h j (x i ) y i )/Σ i=1:n w i 4. Ife j > 0.5 break 5. a j = 0.5ln(1-e j /e j ) 6. w i w i exp(a j I(h j (x i ) y i )) 7. H(x) = sign(σ j=1:m a j h j (x)) [Hastie, et. al.,2009] (Chapter 9), [Zhou, 2012] (Chapter 2)

Bagging Bootstrapping and Aggregating Given a dataset of n observations D = {x 1,x 2,,x n }, we can take n samples from D with replacement for a total of B bootstrap samples. D = {1,2,3,4,5} D 1* = {5,5,1,3,3} D 2* = {2,3,2,4,1} D 3* = {2,3,4,3,2} D 4* = {4,1,1,5,2} D j * =... D B* = {4,4,1,5,3} [L. Breiman, 1996], [Hastie, et. al.,2009] (Chapter 7), [Efron and Tibshirani, 1993] (Chapter 6)

Bagging Bootstrapping and Aggregating Input: Dataset{(x 1,y 1 ),(x 2,y 2 ),,(x n,y n )} Learning algorithm (decision tree, ect.) Process: Number of base learners M 1. Estimate learner from bootstrap sample. 2. Combine the M base learners and evaluate using majority vote. Input: Dataset D Learning algorithm L Set number of iterations to M Process: 1. h j = L(D j* ) for j in 1:M 2. H(x) = arg max y Σ j=1:m I(h j (x)=y) [Hastie, et. al.,2009] (Chapter 8), [Zhou, 2012] (Chapter 3)

Bagging Out-of-Bag Error For each learner, h j, in the bagging ensemble, we keep track of the data points not in the bootstrap sample D j * and refer to them as the out-of-bag samples, D j *oob. The out-of-bag error is the proportion of predicted values for each data point in D *oob j that is misclassified averaged over the entire data set D: 1/ D Σ j 1/ D j *oob Σ (x,y)edj *oob I(h j (x) y)

Input: Dataset{(x 1,y 1 ),(x 2,y 2 ),,(x n,y n )}; Attributes of x; Random Subset f of features F; Process: 1. Create node i of tree T j with D j* and f i. Bagging Random Forest 2. If everything is in the same class return h j. 3. Create list F of attributes that can be split. 4. If F is empty then return tree h j. 5. Select subset f i+1 from F and grow tree. 6. Combine the M trees and evaluate using majority vote. Input: Dataset D; Set of features F; Random Subset f of features F; Process: 1. h j T ji (D j*, f i ) for j in 1:M 2. If h * j (x i* ) = y for all i return h j 3. F set of features left to split 4. If F is NULL return h j 5. select f i+1 from F and repeat step 1. 6. H(x) = arg max y Σ j=1:m I(h j (x)=y) [Hastie, et. al.,2009] (Chapter 9), [Zhou, 2012] (Chapter 2)

Input: Dataset{(x 1,y 1 ),(x 2,y 2 ),,(x n,y n )}; Different learning algorithms L j ; Learning algorithm L to combine L j ; Process: 1. Train learners h j with algorithm L j. 2. predict on training dataset with trained learners. 3. Combine output from each h j for each x i. 4. Train learner h with dataset D and algorithm L. 5. Combine output from base learners h j as input for stacking learner h. Stacking Beyond Averaging Input: Base dataset D; Set of base learning algorithm L j ; Stacking learning algorithm L; Process: 1. h j L j (D) for j in 1:M 2. z j = h j (x) for j in 1:M 3. D = {((z 11,,z 1M,)y 1 ),,((z 1n,,z 1M,),y n )} 4. h L(D ) 5. H(x) = h (h 1 (x),,h M (x) ) = h (z 1,,z M ). [Hastie, et. al.,2009] (Chapter 9), [Zhou, 2012] (Chapter 2)

Advances and Special Applications Proximity and Outlier Detection After building a Random Forest run all data (training and OOB data) down the trees and count the number of times each pair of data points fall within the same terminal node. This is the pairwise proximity. Comparing an observation s proximity to data points Comparing an observation s proximity to data points within the same class can detect outliers as data points with below average proximity.

Advances and Special Applications Class imbalance Most real-world datasets do not have evenly distributed classes and some of the most interesting datasets contain a relatively small population for a specific class (among all classes). Balancing classes in the training set can be done by downsamlping. But this discards valuable training data. An ensemble of models can be created from many downsampled training sets. [Zhou, 2012] (Chapter 2)

Why Ensemble Methods? My first option with a dataset is almost always treebased (boosted or bagged). Jose Guerrero #1 ranked on Kaggle

Conclusion The key ingredient in ensemble learning is creating diversity among the models: augmenting input, selecting input variables, or using completely different models. Understand the inherent biases of your constituent models and lower overall variance by increasing the number of ensemble models. Understand the quality and deficiencies of your input data, this can effect which ensemble method you will be able to use.

Thank You! Questions?

Bibliography Y.S. Abu-Mustafa, M. Magdon-Ismail and H.T. Lin. Learning Form Data: A Short Course. AMLBook, 2012. B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1993. L. Breiman. Bagging Predictions. Machine Learning, 24(2):123-140, 1996. L. Breiman. Random Forests. Machine Learning, 45(1):5-32, 2001. Y. Freund and R.E. Schapire. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Journal of Computer and System Sciences, 55(1): 119-139, 1997. T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning, 2 nd ed. Springer, 2009. T. Mitchell. Machine Learning. McGraw-Hill, 1997. R.E. Schapire. The strength of weak learnability. Machine Learning, 5(2): 197-227, 1990. Z.H. Zhou. Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC, 2012.

Boosting Further Reading/R packages G. Ridgeway. Generalized Boosted Models: A guide to the GBM package. 2007. Available online: http://cran.open-source-solution.org/web/packages/gbm/vignettes/gbm.pdf J. Zhu, S. Rosset, H. Zou and T. Hastie. Multi-Class AdaBoost. Technical report, Department of Statistics, University of Michigan, Ann Arbor, MI, 2006. R packages: ada, adabag, gbm and mboost Random Forest R packages: randomforeast, party and Boruta (feature selection with Random Forest) Comparing different ensemble methods T.G. Dietterich. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting and Randomization. Machine Learning. 40(2) : 139-157, 2000.