Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Similar documents
Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Chapter 6. Ensemble Methods

VBM683 Machine Learning

ECE 5424: Introduction to Machine Learning

Announcements Kevin Jamieson

Classification 2: Linear discriminant analysis (continued); logistic regression

Gradient Boosting (Continued)

Regularization Paths

Classification 1: Linear regression of indicators, linear discriminant analysis

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

CS229 Supplemental Lecture notes

Voting (Ensemble Methods)

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Convex Optimization / Homework 1, due September 19

CSCI-567: Machine Learning (Spring 2019)

Stochastic Gradient Descent

ECE 5984: Introduction to Machine Learning

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Lecture 8. Instructor: Haipeng Luo

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Decision Trees: Overfitting

Statistics and learning: Big Data

Why does boosting work from a statistical view

Ensembles of Classifiers.

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Generalized Boosted Models: A guide to the gbm package

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Chapter 14 Combining Models

Regularization Paths. Theme

10701/15781 Machine Learning, Spring 2007: Homework 2

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

STK-IN4300 Statistical Learning Methods in Data Science

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Data Mining und Maschinelles Lernen

Proteomics and Variable Selection

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

Delta Boosting Machine and its application in Actuarial Modeling Simon CK Lee, Sheldon XS Lin KU Leuven, University of Toronto

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Lecture 13: Ensemble Methods

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

COMS 4771 Lecture Boosting 1 / 16

CMSC858P Supervised Learning Methods

Learning with multiple models. Boosting.

Learning theory. Ensemble methods. Boosting. Boosting: history

BAGGING PREDICTORS AND RANDOM FOREST

SPECIAL INVITED PAPER

TDT4173 Machine Learning

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Harrison B. Prosper. Bari Lectures

LEAST ANGLE REGRESSION 469

Logistic Regression. Machine Learning Fall 2018

Minimax risk bounds for linear threshold functions

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Boosting: Foundations and Algorithms. Rob Schapire

Logistic Regression and Boosting for Labeled Bags of Instances

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Ensemble Methods: Jay Hyer

Decision Tree Learning Lecture 2

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Evidence Contrary to the Statistical View of Boosting

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Boosting as a Regularized Path to a Maximum Margin Classifier

the tree till a class assignment is reached

Evidence Contrary to the Statistical View of Boosting

ABC-LogitBoost for Multi-Class Classification

Machine Learning. Ensemble Methods. Manfred Huber

Ordinal Classification with Decision Rules

Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

Boosting with decision stumps and binary features

Decision trees COMS 4771

Introduction to Machine Learning

Online Advertising is Big Business

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Machine Learning. Lecture 9: Learning Theory. Feng Li.

PDEEC Machine Learning 2016/17

Day 3: Classification, logistic regression

Neural Networks and Ensemble Methods for Classification

TDT4173 Machine Learning

LogitBoost with Trees Applied to the WCCI 2006 Performance Prediction Challenge Datasets

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning and Data Mining. Linear classification. Kalev Kask

Support Vector Machines for Classification: A Statistical Portrait

Generalized Boosted Models: A guide to the gbm package

10-701/ Machine Learning - Midterm Exam, Fall 2010

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Generalized Linear Models

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Ensemble Methods for Machine Learning

Machine Learning Lecture 7

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Gradient Boosting, Continued

Multivariate Analysis Techniques in HEP

Transcription:

Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1

Reminder: classification trees Suppose that we are given training data (x i, y i ), i = 1,... n, with y i {1,... K} the class label and x i R p the associated features Recall that the CART algorithm to fit a classification tree starts by considering splitting on variable j and split point s, and defines the regions R 1 = {X R p : X j s}, R 2 = {X R p : X j > s} Within each region the proportion of points of class k is ˆp k (R m ) = 1 n m x i R m 1{y i = k} with n m the number of points in R m. The most common class is c m = argmax k=1,...k ˆp k (R m ) 2

The feature and split point j, s are chosen greedily by minimizing the misclassification error ( [1 argmin ˆpc1 (R 1 ) ] + [ 1 ˆp c2 (R 2 ) ]) j,s This strategy is repeated within each of the newly formed regions 3

Classification trees with observation weights Now suppose that we are are additionally given observation weights w i, i = 1,... n. Each w i 0, and a higher value means that we place a higher importance in correctly classifying this observation The CART algorithm can be easily adapted to use the weights. All that changes is that our sums become weighted sums. I.e., we now use the weighted proportion of points of class k in region R m : x ˆp k (R m ) = i R m w i 1{y i = k} x i R m w i As before, we let c m = argmax k=1,...k ˆp k (R m ) and hence 1 ˆp cm (R m ) is the weighted misclassification error 4

Boosting Boosting 1 is similar to bagging in that we combine the results of several classification trees. However, boosting does something fundamentally different, and can work a lot better As usual, we start with training data (x i, y i ), i = 1,... n. We ll assume that y i { 1, 1} and x i R p. Hence we suppose that a classification tree (fit, e.g., to the training data) will return a prediction of the form ˆf tree (x) { 1, 1} for an input x R p In boosting we combine a weighted sum of B different tree classifiers, ( B ) ˆf boost (x) = sign α b ˆf tree,b (x) b=1 1 Freund and Schapire (1995), A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Similar ideas were around earlier 5

One of the key differences between boosting and bagging is how the individual classifiers ˆf tree,b are fit. Unlike in bagging, in boosting we fit the tree to the entire training set, but adaptively weight the observations to encourage better predictions for points that were previously misclassified (From ESL page 338) 6

The basic boosting algorithm (AdaBoost) Given training data (x i, y i ), i = 1,... n, the basic boosting method is called AdaBoost, and can be described as: Initialize the weights by w i = 1/n for each i For b = 1,... B: 1. Fit a classification tree ˆf tree,b to the training data with weights w 1,... w n 2. Compute the weighted misclassification error n i=1 e b = w i1{y i ˆf tree,b (x i )} n i=1 w i 3. Let α b = log{(1 e b )/e b } 4. Update the weights as w i w i exp(α b 1{y i ˆf tree,b (x i )}) for each i Return ˆf B boost (x) = sign( b=1 α b ˆf ) tree,b (x) 7

Example: boosting stumps Example (from ESL page 339): here n = 1000 points were drawn from the model { 1 if 10 j=1 Y i = X2 ij > χ2 10 (0.5), for i = 1,... n 1 otherwise where each X ij N(0, 1) independently, j = 1,... 10 A stump classifier was computed: this is just a classification tree with one split (two leaves). This has a bad misclassification rate on independent test data, of about 45.8%. (Why is this not surprising?) On the other hand, boosting stumps achieves an excellent error rate of about 5.8% after B = 400 iterations. This is much better than, e.g., a single large tree (error rate 24.7%) 8

9

Why does boosting work? The intution beyond boosting is simple: we weight misclassified observations in such a way that they get properly classified in future iterations. But there are many ways to do this why does the particular weighting seem to work so well? One nice discovery was the connection between boosting and forward stepwise modeling. 2 To understand this connection, it helps to recall forward stepwise linear regression. Given a continuous response y R n and predictors X 1,... X p R n, we: Choose the predictor X j giving the smallest squared error loss n i=1 (y i ˆβ j X ij ) 2 ( ˆβ j is obtained by regressing y on X j ) Choose the predictor X k giving the smallest additional loss n i=1 (r i ˆβ k X ik ) 2 ( ˆβ k is obtained by regressing r on X k ), where r is the residual r = y ˆβ j X j Repeat the last step 2 Friedman et al. (2000), Additive Logistic Regression: A Statistical View of Boosting 10

Boosting fits an additive model In the classification setting, y i is not continuous but discrete, taking values in { 1, 1}. Therefore squared error loss is not appropriate. Instead we use exponential loss: L(y i, f(x i )) = exp( y i f(x i )) Furthermore, boosting doesn t model f(x i ) as a linear function of the variables x i1,... x ip, but rather as a linear function of trees T 1 (x i ),... T M (x i ) of a certain size. (Note that the number M of trees of a given size is finite.) Hence the analogous forward stepwise modeling is as follows: Choose the tree T j and coefficient β j giving the smallest exponential loss n i=1 exp( y iβ j T j (x i )) Choose the tree T k and coefficient β k giving the smallest additional loss n i=1 exp( y i{β j T j (x i ) + β k T k (x i )}) = n i=1 exp( y iβ j T j (x i )) exp( y i β k T k (x i )) Repeat the last step 11

12 This forward stepwise procedure can be tied concretely to the AdaBoost algorithm that we just learned (see ESL 10.4 for details) This connection brought boosting from an unfamiliar algorithm to a more-or-less familiar statistical concept. Consequently, there were many suggested ways to extend boosting: Other losses: exponential loss is not really used anywhere else. It leads to a computationally efficient boosting algorithm (AdaBoost), but binomial deviance and SVM loss are two alternatives that can work better. (Algorithms for boosting are now called gradient boosting) Shrinkage: instead of adding one tree at a time (thinking in the forward stepwise perspective), why don t we add a small multiple of a tree? This is called shrinkage a form of regularization for the model building process. This can also be very helpful in terms of prediction accuracy

13 Choosing the size of the trees How big should the trees be used in the boosting procedure? Remember that in bagging we grew large trees, and then either stopped (no pruning) or pruned back using the original training data as a validation set. In boosting, actually, the best methods if to grow small trees with no pruning In ESL 10.11, it is argued that right size actually depends on the level of the iteractions between the predictor variables. Generally trees with 2 8 leaves work well; rarely more than 10 leaves are needed

14 Disadvantages As with bagging, a major downside is that we lose the simple interpretability of classification trees. The final classifier is a weighted sum of trees, which cannot necessarily be represented by a single tree. Computation is also somewhat more difficult. However, with AdaBoost, the computation is straightforward (moreso than gradient boosting). Further, because we are growing small trees, each step can be done relatively quickly (compared to say, bagging) To deal with the first downside, a measure of variable importance has been developed for boosting (and it can be applied to bagging, also)

15 Variable importance for trees For a single decision tree ˆf tree, we define the squared importance for variable j as Imp 2 j( ˆf m tree ) = ˆd k 1{split at node k is on variable j} k=1 where m is the number of internal nodes (non-leaves), and ˆd k is the improvement in training misclassification error from making the kth split

16 Variable importance for boosting For boosting, we define the squared importance for a variable j by simplying averaging the squared importance over all of the fitted trees: Imp 2 j( ˆf boost ) = 1 B Imp 2 B j( ˆf tree,b ) (To get the importance, we just take the squared root.) We also usually set the largest importance to 100, and scale all of the other variable importances accordingly, which are then called relative importances This averaging stabilizes the variable importances, which means that they tend to be much more accurate for boosting than they do for any single tree b=1

17 Example: spam data Example from ESL 10.8: recall the spam data set, with n = 4601 emails (1813 spam emails), and given p = 58 attributes on each email A classification tree, grown by CART and pruned to 15 leaves, had a classification error of 8.7% on an independent test set. Gradient boosting with stumps (only two leaves per tree) achieved an error rate of 4.7% on an independent test set. This is nearly an improvement by a factor of 2! (It also outperforms additive logistic regression and MARS) So what does the boosting classifier look like? This question is hard to answer, because it s not a tree, but we can look at the relative variable importances

18

19 Bagging and boosting in R The package adabag implements both bagging and boosting (AdaBoost) for trees, via the functions bagging and boosting The package ipred also performs bagging with the bagging function. The package gbm implements both AdaBoost and gradient boosting; see the function gbm and its distribution option The package mboost is an alternative for boosting, that is quite flexible; see the function blackboost for boosting trees, and the function mboost for boosting arbitrary base learners

20 Recap: boosting In this lecture we learned boosting in the context of classification via decision trees. The basic concept is to repeatedly fit classification trees to weighted versions of the training data. In each iteration we update the weights in order to better classify previously misclassified observations. The final classifier is a weighted sum of the decisions made by the trees Boosting shares a close connection to forward stepwise model building. In a precise way, it can be viewed as forward stepwise model building where the base learners are trees, and the loss is exponential. Using other losses can improve performance, as can employing shrinkage. Boosting is a very general, powerful tool Although the final classifier is not as simple as a tree itself, relative variable importances can be computed by looking at how many times a given variable contributes to a split, across all trees

21 Next time: final projects! The final project has you perform crime mining (data mining on a crime data set)