CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Similar documents
CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Kernel Density Estimation

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Chapter 6. Ensemble Methods

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Lecture 13: Ensemble Methods

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Chapter 14 Combining Models

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Gradient Boosting (Continued)

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Informal Definition: Telling things apart

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Statistics and learning: Big Data

Learning theory. Ensemble methods. Boosting. Boosting: history

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Statistical Machine Learning from Data

VBM683 Machine Learning

SF2930 Regression Analysis

ECE 5424: Introduction to Machine Learning

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

the tree till a class assignment is reached

BAGGING PREDICTORS AND RANDOM FOREST

Decision trees COMS 4771

STK-IN4300 Statistical Learning Methods in Data Science

Decision Tree Learning Lecture 2

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Announcements Kevin Jamieson

Decision Trees: Overfitting

Data Mining und Maschinelles Lernen

SPECIAL INVITED PAPER

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Learning Decision Trees

Day 3: Classification, logistic regression

Learning Decision Trees

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Boosting. March 30, 2009

Ensemble Methods and Random Forests

Variance Reduction and Ensemble Methods

Machine Learning. Ensemble Methods. Manfred Huber

Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Machine Learning 2nd Edi7on

Application of machine learning in manufacturing industry

CS145: INTRODUCTION TO DATA MINING

Regularization Paths

mboost - Componentwise Boosting for Generalised Regression Models

Why does boosting work from a statistical view

Online Learning and Sequential Decision Making

Ensembles of Classifiers.

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Lecture 3: Decision Trees

Neural Networks and Ensemble Methods for Classification

day month year documentname/initials 1

Multivariate Analysis Techniques in HEP

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Decision Trees Entropy, Information Gain, Gain Ratio

Constructing Prediction Intervals for Random Forests

Knowledge Discovery and Data Mining

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Machine Learning & Data Mining

Lossless Online Bayesian Bagging

Harrison B. Prosper. Bari Lectures

COMS 4771 Lecture Boosting 1 / 16

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Voting (Ensemble Methods)

Supervised Learning via Decision Trees

Dimension Reduction Using Rule Ensemble Machine Learning Methods: A Numerical Study of Three Ensemble Methods

CSCI-567: Machine Learning (Spring 2019)

Ensembles. Léon Bottou COS 424 4/8/2010

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

ECE 5984: Introduction to Machine Learning

Delta Boosting Machine and its application in Actuarial Modeling Simon CK Lee, Sheldon XS Lin KU Leuven, University of Toronto

A Magiv CV Theory for Large-Margin Classifiers

PATTERN CLASSIFICATION

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

10701/15781 Machine Learning, Spring 2007: Homework 2

1 Handling of Continuous Attributes in C4.5. Algorithm

Logistic Regression and Boosting for Labeled Bags of Instances

PDEEC Machine Learning 2016/17

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Stochastic Gradient Descent

Artificial Intelligence Roman Barták

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

CS534 Machine Learning - Spring Final Exam

Learning with multiple models. Boosting.

Classification using stochastic ensembles

CS229 Supplemental Lecture notes

Lecture 7: DecisionTrees

TDT4173 Machine Learning

Transcription:

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions f (x) = M c m 1(x R m ) m=1 with R 1,..., R m R p disjoint. The CART algorithm is a heuristic, adaptive algorithm for basis function selection. A recursive, binary partition (a tree) is given by a list of splits {(t 01 ), (t 11, t 12 ), (t 21, t 22, t 23, t 24 ),..., (t n1,..., t n2 n)} and corresponding split variable indices {(i 01 ), (i 11, i 12 ), (i 21, i 22, i 23, i 24 ),..., (i n1,..., i n2 n)} R 1 = (x i01 < t 01 ) (x i11 < t 11 )... (x in1 < t n1 ) and we can determine if x R 1 in n steps M = 2 n. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 1 / 25

Figure 9.2 Recursive Binary Partitions The recursive partition of [0, 1] 2 above and the representation of the partition by a tree. A binary tree of depth n can represent up to 2 n partitions/basis functions. We can determine which R j an x belongs to by n recursive yes/no questions. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 2 / 25

Figure 9.2 General Partitions A general partition that can not be represented as binary splits. With M sets in a general partition we would in general need of the order M yes/no questions to determine which of the sets an x belongs to. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 3 / 25

Figure 9.2 Recursive Binary Partitions For a fixed partition R 1,..., R M the least squares estimates are ĉ m = ȳ(r m ) = 1 N m N m = {i x i R m }. i:x i R m y i The recursive partion allows for rapid computation of the estimates and rapid predition of new observations. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 4 / 25

Greedy Splitting Algorithm With squared error loss and an unknown partition R 1,..., R M we would seek to minimize (y i ȳ(r m(i) )) 2 over the possible binary, recursive partitions. But this is computationally difficult. An optimal single split on a region R is determined by min min (y i ȳ(r(j, s))) 2 + (y i ȳ(r(j, s) c )) 2 j s i:x i R(j,s) i:x i R(j,s) }{{ c } univariate optimization problem with R(j, s) = {x R x j < s} The tree growing algorithm recursively does single, optimal splits on each of the partitions obtained in the previous step. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 5 / 25

Tree Pruning The full binary tree, T 0, representing the partitions R 1,..., R M with M = 2 n may be too large. We prune it by snipping of leafs or subtrees. For any subtree T of T 0 with T leafs and partition R 1 (T ),..., R T (T ) the cost-complexity of T is C α (T ) = (y i ȳ(r m(i) (T ))) 2 + α T. Theorem There is a finite set of subtrees T 0 T α1 T α2... T αr with 0 α 1 < α 2 <... < α r such that T αi minimizes C α (T ) for α [α i, α i+1 ) Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 6 / 25

Node Impurities and Classification Trees Define the node impurity as the average loss for the node R Q(R) = 1 (y i ȳ(r)) 2 N(R) The greedy split of R is found by i:x i R min min (N(R(j, s))q(r(j, s)) + N(R(j, s) c )Q(R(j, s) c )) j s with R(j, s) = {x R x j < s} and we have C α (T ) = T m=1 N(R m (T ))Q(R m (T )) + α T. If Y takes K discrete values we focus on the node estimate for R m (T ) in tree T as being ˆp m (T )(k) = 1 1(y i = k) N m i:x i R m(t ) Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 7 / 25

Node Impurities and Classification Trees The loss functions for classification enter in the specification of the node impurities used for splitting an cost-complexity computations. Examples of 0-1 loss gives misclassification error impurity: Q(R m (T )) = 1 max{ˆp(r m (T ))(1),..., ˆp(R m (T ))(K)} likelihood loss gives entropy impurity: Q(R m (T )) = The Gini index impurity: Q(R m (T )) = K ˆp(R m (T ))(k) log ˆp(R m (T ))(k) k=1 K ˆp(R m (T ))(k)(1 ˆp(R m (T ))(k)) k=1 Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 8 / 25

Figure 9.3 Node Impurities Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 9 / 25

Spam Example Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 10 / 25

Spam Example Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 11 / 25

Spensitivity and Specificity The sensitivity is the probability of predicting 1 given that the true value is 1 (predict a case given that there is a case). sensitivity = Pr(f (X ) = 1 Y = 1) = Pr(Y = 1, f (X ) = 1) Pr(Y = 1, f (X ) = 1) + Pr(Y = 1, f (X ) = 0) The specificity is the probability of predicting 0 given that the true value is 0 (predict that there is no case given that there is no case). specificity = Pr(f (X ) = 0 Y = 0) = Pr(Y = 0, f (X ) = 0) Pr(Y = 0, f (X ) = 0) + Pr(Y = 0, f (X ) = 1) Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 12 / 25

ROC curves The reciever operating characteristic or ROC curve. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 13 / 25

Ensembles of Weak Predictors A weak predictor is a predictor that performs only a little better than random guessing. With an ensemble or collection of weak predictors ˆf 1,..., ˆf B we seek to combine their predictions, e.g. as ˆf B = 1 B B ˆf b b=1 hoping to improve performance. Bootstrap aggregation or Bagging is an example where the ensemble of preditors are obtained by estimation of the predictor on bootstrapped data sets. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 14 / 25

Combining Weak Predictors Recall that V (ˆf B (x)) = 1 B 2 B V (ˆf b (x)) + 1 B 2 cov(ˆf b (x), ˆf b (x)) b b b=1 hence bagging can be improved if the preditors can be de-correlated. Random Forests (Chapter 15) is a modification of bagging for trees where the bagged trees are de-correlated. The problem of ensemble learning is broken down into the The selection of base learners. The combination of the base learners. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 15 / 25

Trees as Ensemble Learners The basis expansion techniques can be seen as ensemble learning (with or without regularization) where we have specified the base learners a priori. For trees we build and combine sequentially and recursively the simplest base learners; the stumps or single splits. Are there general ways to search the space of learners and combinations of simple learners? Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 16 / 25

Stagewise Additive Modeling With b(, γ) for γ Γ a parameterized family of basis functions we can seek expansions of the form M β m b(x, γ m ) m=1 With fixed γ m this is standard, with unrestricted γ m this is in general very difficult numerically. Suggetion: Evolve the expansions in stages where (β m, γ m ) are estimated in step m and then fixed forever. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 17 / 25

Boosting With any loss function L the Forward Stagewise Additive Model is estimated by the algorithm: 1 Set m = 1 and initialize with ˆf 0 (x) = 0. 2 Compute ( ˆβ m, ˆγ m ) = argmin β,γ L(y i, ˆf m 1 (x i ) + βb(x i, γ)) 3 Set f m = f m 1 + ˆβ m b(, ˆγ m ), m = m + 1 and return to 2. Note that with squared error loss L(y i, ˆf m 1 (x i ) + βb(x i, γ)) = ((y i ˆf m 1 (x i )) βb(x i, γ)) 2 every estimation step is a reestimation on the residuals. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 18 / 25

Base Classifiers With Y { 1, 1} and any classifier G(x) { 1, 1} the misclassification error is err(g) = 1 N 1(y i G(x i )) = 1 2N (1 y i G(x i )) With G a class of classifiers the (unweigted) optimal classifier is Ĝ = argmin err(g) G G With w 1,..., w N 0 the weighted optimal classifier is Ĝ = argmin G G w i 1(y i G(x i )) Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 19 / 25

Surrogate Loss Functions Most important property of the surrogate loss functions is that they are convexifications of the 0-1-loss. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 20 / 25

AdaBoost Classification with Exponential Loss With exponential loss L(y, f (x)) = exp( yf (x)) and with w (m) i = exp( y i f m 1 (x i )) L(y i, ˆf m 1 (x i ) + βg(x i )) = = (e β e β ) w (m) i exp( y i βg(x i )) w (m) i 1(y i G(x i )) + e β The minimizer is Ĝ m = argmin G G N w (m) i 1(y i G(x i )), ˆβ m = 1 2 log 1 err m err m The updated weights in step m + 1 are w (m+1) i = w (m) i exp( y i ˆβ m Ĝ m (x i )) = w (m) i exp(2 ˆβ m 1(y i Ĝ m (x i ))) exp( ˆβ m ). w (m) i Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 21 / 25

Figure 10.1 Schematic AdaBoost M G(x) = α m G m (x) m=1 1 Initialize with weights w i = 1/N and set m = 1 and fix M. 2 Fit a classifier G m using weights w i. 3 Recompute weights as w i w i exp(α m 1(y i G m (x i ))) where α m = log((1 err m )/err m ) and err m = 1 N w i w i 1(y i G(x i )). 4 Stop if m = M or set m m + 1 and return to 2 Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 22 / 25

Figure 10.2 and 10.3 Boosting using stumps only can outperform even large trees in terms of test error (simulation). Even when the misclassification error is 0 on the training data it can pay to continue the boosting and the exponential loss will continue to decrease. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 23 / 25

More Boosting The computational problem in boosting is minimization of L(y i, ˆf m 1 (x i ) + βb(x i, γ)). For classification with exponential loss this simplifies to weighted optimal classification. For regression and squared error loss this is re-estimation based on the residuals. With the notation L(f) = L(y i, f i ) for f = (f 1,..., f N ) R N we aim at finding steps h 1,..., h M and with h 0 the initial guess an approximate minimizer of the form M f M = h m. m=0 Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 24 / 25

Gradient Boosting The gradient of L : R N R is L(f) = ( z L(y 1, f 1 ),..., z L(y N, f N )) T Gradient descent algorithms suggest steps from f m in the direction of L(f m ); h m = ρ m L(f m ). Problem: ρ m L(f m ) is most likely not obtainable as a prediction within the class of base learners it is not of the form β(b(x 1, γ),..., b(x N, γ)) T. Solution: Fit a base learner ĥ m to L(f m ) and compute by iteration the expansion M ˆf M = ρ m ĥ m. m=0 This is gradient boosting as implemented in the mboost library. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning June 8, 2009 25 / 25