Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Similar documents
Nonparametric Classification 10/36-702

arxiv: v1 [math.st] 14 Mar 2016

Random forests and averaging classifiers

Variance Reduction and Ensemble Methods

arxiv: v5 [stat.me] 18 Apr 2016

Holdout and Cross-Validation Methods Overfitting Avoidance

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Estimation and Inference of Heterogeneous Treatment Effects using Random Forests

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Statistics and learning: Big Data

Chapter 14 Combining Models

Ensemble Methods and Random Forests

Machine Learning, Fall 2009: Midterm

BAGGING PREDICTORS AND RANDOM FOREST

Understanding Generalization Error: Bounds and Decompositions

SF2930 Regression Analysis

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

ABC random forest for parameter estimation. Jean-Michel Marin

Consistency of Nearest Neighbor Methods

Lectures in AstroStatistics: Topics in Machine Learning for Astronomers

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

REGRESSION TREE CREDIBILITY MODEL

A Simple Algorithm for Learning Stable Machines

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Classification using stochastic ensembles

FINAL: CS 6375 (Machine Learning) Fall 2014

Decision Tree Learning Lecture 2

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Hierarchical models for the rainfall forecast DATA MINING APPROACH

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

the tree till a class assignment is reached

Dyadic Classification Trees via Structural Risk Minimization

Decision trees COMS 4771

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

CS145: INTRODUCTION TO DATA MINING

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Statistical Machine Learning from Data

A Bias Correction for the Minimum Error Rate in Cross-validation

Machine Learning. Lecture 9: Learning Theory. Feng Li.

PDEEC Machine Learning 2016/17

Nonlinear Classification

Gaussian Processes (10/16/13)

CSCI-567: Machine Learning (Spring 2019)

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

ECE 5424: Introduction to Machine Learning

Lecture 7: DecisionTrees

Performance of Cross Validation in Tree-Based Models

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Constructing Prediction Intervals for Random Forests

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining und Maschinelles Lernen

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

UVA CS 4501: Machine Learning

day month year documentname/initials 1

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Supervised Learning via Decision Trees

Metric Embedding for Kernel Classification Rules

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

On the asymptotics of random forests

Introduction to Machine Learning Midterm Exam

VBM683 Machine Learning

Day 3: Classification, logistic regression

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

PAC-learning, VC Dimension and Margin-based Bounds

Lecture 13: Ensemble Methods

The exam is closed book, closed notes except your one-page cheat sheet.

Gradient Boosting (Continued)

arxiv: v4 [stat.me] 5 Apr 2018

Why experimenters should not randomize, and what they should do instead

Regression tree methods for subgroup identification I

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Generative v. Discriminative classifiers Intuition

Learning theory. Ensemble methods. Boosting. Boosting: history

Variable Selection and Weighting by Nearest Neighbor Ensembles

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning

Lecture 3: Statistical Decision Theory (Part II)

Informal Definition: Telling things apart

Voting (Ensemble Methods)

Final Exam, Machine Learning, Spring 2009

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Formal Hypothesis Tests for Additive Structure in Random Forests

Random projection ensemble classification

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Jeffrey D. Ullman Stanford University

Decision Trees: Overfitting

Announcements. Proposals graded

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Transcription:

Random Forests One of the best known classifiers is the random forest. It is very simple and effective but there is still a large gap between theory and practice. Basically, a random forest is an average of tree estimators. These notes rely heavily on Biau and Scornet (26) as well as the other references at the end of the notes. Partitions and Trees We begin by reviewing trees. As with nonparametric regression, simple and interpretable classifiers can be derived by partitioning the range of X. Let Π n = {A,..., A N } be a partition of X. Let A j be the partition element that contains x. Then ĥ(x) = if X i A j Y i X i A j ( Y i ) and ĥ(x) = otherwise. This is nothing other than the plugin classifier based on the partition regression estimator m(x) = N Y j I(x A j ) j= where Y j = n n j i= Y ii(x i A j ) is the average of the Y i s in A j and n j = #{X i A j }. (We define Y j to be if n j =.) Recall from the results on regression that if m H (, L) and the binwidth b of a regular partition satisfies b n /(d+2) then E m m 2 P c. () n2/(d+2) We conclude that the corresponding classification risk satisfies R(ĥ) R(h ) = O(n /(d+2) ). Regression trees and classification trees (also called decision trees) are partition classifiers where the partition is built recursively. For illustration, suppose there are two covariates, X = age and X 2 = blood pressure. Figure shows a classification tree using these variables. The tree is used in the following way. If a subject has Age 5 then we classify him as Y =. If a subject has Age < 5 then we check his blood pressure. If systolic blood pressure is < then we classify him as Y =, otherwise we classify him as Y =. Figure 2 shows the same classifier as a partition of the covariate space. Here is how a tree is constructed. First, suppose that there is only a single covariate X. We choose a split point t that divides the real line into two sets A = (, t] and A 2 = (t, ). Let Y be the mean of the Y i s in A and let Y 2 be the mean of the Y i s in A 2.

Age < 5 5 Blood Pressure < Figure : A simple classification tree. Blood Pressure 5 Age Figure 2: Partition representation of classification tree. 2

For continuous Y (regression), the split is chosen to minimize the training error. For binary Y (classification), the split is chosen to minimize a surrogate for classification error. A common choice is the impurity defined by I(t) = 2 s= γ s where γ s = [Y 2 s + ( Y s ) 2 ]. (2) This particular measure of impurity is known as the Gini index. If a partition element A s contains all s or all s, then γ s =. Otherwise, γ s >. We choose the split point t to minimize the impurity. Other indices of impurity besides the Gini index can be used, such as entropy. The reason for using impurity rather than classification error is because impurity is a smooth function and hence is easy to minimize. Now we continue recursively splitting until some stopping criterion is met. For example, we might stop when every partition element has fewer than n data points, where n is some fixed number. The bottom nodes of the tree are called the leaves. Each leaf has an estimate m(x) which is the mean of Y i s in that leaf. For classification, we take ĥ(x) = I( m(x) > /2). When there are several covariates, we choose whichever covariate and split that leads to the lowest impurity. The result is a piecewise constant estimator that can be represented as a tree. 2 Example The following data are from simulated images of gamma ray events for the Major Atmospheric Gamma-ray Imaging Cherenkov Telescope (MAGIC) in the Canary Islands. The data are from archive.ics.uci.edu/ml/datasets/magic+gamma+telescope. The telescope studies gamma ray bursts, active galactic nuclei and supernovae remnants. The goal is to predict if an event is real or is background (hadronic shower). There are predictors that are numerical summaries of the images. We randomly selected 4 training points (2 positive and 2 negative) and test cases (5 positive and 5 negative). The results of various methods are in Table. See Figures 3, 4, 5, 6. 3 Bagging Trees are useful for their simplicity and interpretability. But the prediction error can be reduced by combining many trees. A common approach, called bagging, is as follows. Suppose we draw B bootstrap samples and each time we construct a classifier. This gives tree classifiers h,..., h B. (The same idea applies to regression.) We now classify by combining 3

Method Test Error Logistic regression.23 SVM (Gaussian Kernel).2 Kernel Regression.24 Additive Model.2 Reduced Additive Model.2 -NN.25 Trees.2 Table : Various methods on the MAGIC data. The reduced additive model is based on using the three most significant variables from the additive model..3.2...3.2....2.2..2.4.6.8.4.2..2.4.2.5..5..5 2 3 4 5 2 4 6 2 3 4 2 2 2 2..5..5..5.6.4.2..2.4.6....2.2....2.2...2.3.4 4 2 2 4 2 2 6 2 2 4 6... 2. 2 2 Figure 3: Estimated functions for additive model. 4

Test Error.25.26.27.28.29 2 3 4 5 k Figure 4: Test error versus k for nearest neighbor estimator. xtrain.v9 <.89962 xtrain.v <.283 xtrain.v <.536394 xtrain.v4 <.4748 xtrain.v4 <.76953 xtrain.v <.3694 xtrain.v2 <.3439 xtrain.v8 <.92288 xtrain.v3 <.463854 xtrain.v3 <.459 xtrain.v7 <.592 xtrain.v6 <.274359 xtrain.v <.4797 xtrain.v2 <.6782 xtrain.v8 <.9942 xtrain.v5 <.4292 xtrain.v3 <.9674 xtrain.v4 <.9553 xtrain.v <.7878 Figure 5: Full tree. 5

xtrain.v9 <.89962 xtrain.v <.283 xtrain.v <.536394 xtrain.v <.3694 Figure 6: Classification tree. The size of the tree was chosen by cross-validation. them: h(x) = { if B j h j(x) 2 otherwise. This is called bagging which stands for bootstrap aggregation. where we use subsamples instead of bootstrap samples. A variation is sub-bagging To get some intuition about why bagging is useful, consider this example from Buhlmann and Yu (22). Suppose that x R and consider the simple decision rule θ n = I(Y n x). Let µ = E[Y i ] and for simplicity assume that Var(Y i ) =. Suppose that x is close to µ relative to the sample size. We can model this by setting x x n = µ + c/ n. Then θ n converges to I(Z c) where Z N(, ). So the limiting mean and variance of θ n are Φ(c) and Φ(c)( Φ(c)). Now the bootstrap distribution of Y (conditional on Y,..., Y n ) is approximately N(Y, /n). That is, n(y Y ) N(, ). Let E denote the average with respect to the bootstrap randomness. Then, if θ n is the bagged estimator, we have )] θ n = E [I(Y x n )] = E [I( n(y Y ) n(xn Y ) = Φ( n(x n Y )) + o() = Φ(c + Z) + o() where Z N(, ), and we used the fact that Y N(µ, /n). To summarize, θ n I(Z c) while θ n Φ(c + Z) which is a smoothed version of I(Z c). 6

In other words, bagging is a smoothing operator. In particular, suppose we take c =. Then θ n converges to a Bernoulli with mean /2 and variance /4. The bagged estimator converges to Φ(Z) = Unif(, ) which has mean /2 and variance /2. The reduction in variance is due to the smoothing effect of bagging. 4 Random Forests Finally we get to random forests. These are bagged trees except that we also choose random subsets of features for each tree. The estimator can be written as m(x) = m j (x) M where m j is a tree estimator based on a subsample (or bootstrap) of size a using p randomly selected features. The trees are usually required to have some number k of observations in the leaves. There are three tuning parameters: a, p and k. You could also think of M as a tuning parameter but generally we can think of M as tending to. For each tree, we can estimate the prediction error on the un-used data. (The tree is built on a subsample.) Averaging these prediction errors gives an estimate called the out-of-bag error estimate. Unfortunately, it is very difficult to develop theory for random forests since the splitting is done using greedy methods. Much of the theoretical analysis is done using simplified versions of random forests. For example, the centered forest is defined as follows. Suppose the data are on [, ] d. Choose a random feature, split in the center. Repeat until there are k leaves. This defines one tree. Now we average M such trees. Breiman (24) and Biau (22) proved the following. j Theorem If each feature is selected with probability /d, k = o(n) and k then as n. E[ m(x) m(x) 2 ] Under stronger assumptions we can say more: Theorem 2 Suppose that m is Lipschitz and that m only depends on a subset S of the features and that the probability of selecting j S is (/S)( + o()). Then E m(x) m(x) 2 = O 7 ( ) 3 4 S log 2+3. n

This is better than the usual Lipschitz rate n 2/(d+2) if S p/2. But the condition that we select relevant variables with high probability is very strong and proving that this holds is a research problem. A significant step forward was made by Scornet, Biau and Vert (25). Here is their result. Theorem 3 Suppose that Y = j m j(x(j)) + ɛ where X Uniform[, ] d, ɛ N(, σ 2 ) and each m j is continuous. Assume that the split is chosen using the maximum drop in sums of squares. Let t n be the number of leaves on each tree and let a n be the subsample size. If t n, a n and t n (log a n ) 9 /a n then as n. E[ m(x) m(x) 2 ] Again, the theorem has strong assumptions but it does allow a greedy split selection. Scornet, Biau and Vert (25) provide another interesting result. Suppose that (i) there is a subset S of relevant features, (ii) p = d, (iii) m j is not constant on any interval for j S. Then with high probability, we always split only on relevant variables. 5 Connection to Nearest Neighbors Lin and Jeon (26) showed that there is a connection between random forests and k-nn methods. We say that X i is a layered nearest neighbor (LNN) of x If the hyper-rectangle defined by x and X i contains no data points except X i. Now note that if tree is grown until each leaf has one point, then m(x) is simply a weighted average of LNN s. More generally, Lin and Jeon (26) call X i a k-potential nearest neighbor k P NN if there are fewer than k samples in the the hyper-rectangle defined by x and X i. If we restrict to random forests whose leaves have k points then it follows easily that m(x) is some weighted average of the k P NN s. Let us know return to LNN s. Let L n (x) denote all LNN s of x and let L n (x) = L n (x). We could directly define m(x) = Y i I(X i L n (x)). L n (x) Biau and Devroye (2) showed that, if X has a continuous density, i (d )!E[L n (x)] 2 d (log n) d. Moreover, if Y is bounded and m is continuous then, for all p, E m n (X) m(x) p 8

as n. Unfortunately, the rate of convergence is slow. Suppose that Var(Y X = x) = σ 2 is constant. Then E m n (X) m(x) p σ 2 E[L n (x)] σ2 (d )! 2 d (log n). d If we use k-pnn, with k and k = o(n), then the results Lin and Jeon (26) show that the estimator is consistent and has variance of order O(/k(log n) d ). As an aside, Biau and Devroye (2) also show that if we apply bagging to the usual -NN rule to subsamples of size k and then average over subsamples, then, if k and k = o(n), then for all p and all distributions P, we have that E m(x) m(x) p. So bagged -NN is universally consistent. But at this point, we have wondered quite far from random forests. 6 Connection to Kernel Methods There is also a connection between random forests and kernel methods (Scornet 26). Let A j (x) be the cell containing x in the j th tree. Then we can write the tree estimator as m(x) = M j i Y i I(X i A j (x)) N j (x) = M W ij Y j j i where N j (x) is the number of data points in A j (x) and W ij = I(X i A j (x))/n j (x). This suggests that a cell A j with low density (and hence small N j (x)) has a high weight. Based on this observation, Scornet (26) defined kernel based random forest (KeRF) by m(x) = j i Y ii(x i A j (x)) j N. j(x) With this modification, m(x) is the average of each Y i weighted by how often it appears in the trees. The KeRF can be written as m(x) = i Y ik(x, X i ) s K n(x, X s ) where K n (x, z) = I(x A j (x)). M The trees are random. So let us write the j th tree as T j = T (Θ j ) for some random quantity Θ j. So the forests is built from T (Θ ),..., T (Θ M ). And we can write A j (x) as A(x, Θ j ). Then K n (x, z) converges almost surely (as M ) to κ n (x, z) = P Θ (z A(x, Θ)) which is j 9

just the probability that x and z are connected, in the sense that they are in the same cell. Under some assumptions, Scornet (26) showed that KeRF s and forests are close to each other, thus providing a kernel interpretation of forests. Recall the centered forest we discussed earlier. This is a stylized forest quite different from the forests used in practice but they provide a nice way to study the properties of the forest. In the case of KeRF s, Scornet (26) shows that if m(x) is Lipschitz and X Unif([, ] d ) then E[( m(x) m(x)) 2 ] C(log n) 2 ( n ) 3+d log 2. This is slower than the minimax rate n 2/(d+2) but this probably reflects the difficulty in analyzing forests. 7 Variable Importance Let m be a random forest estimator. How important is feature X(j)? LOCO. One way to answer this question is to fit the forest with all the data and fit it again without using X(j). When we construct a forest, we randomly select features for each tree. This second forest can be obtained by simply average the trees where feature j was not selected. Call this m ( j). Let H be a hold-out sample of size m. Then let where j = m i H W i W i = (Y i m ( j) (X i )) 2 (Y i m(x i )) 2. Then j is a consistent estimate of the prediction risk inflation that occurs by not having access to X(j). Formally, if T denotes the training data then, [ ] E[ j T ] = E (Y m ( j) (X)) 2 (Y m(x)) 2 T j. In fact, since j is simply an average, we can easily construct a confidence interval. This approach is called LOCO (Leave-Out-COvariates). Of course, it is easily extended to sets of features. The method is explored in Lei, G Sell, Rinaldo, Tibshirani, Wasserman (27) and Rinaldo, Tibshirani, Wasserman (25). Permutation Importance. A different approach is to permute the values of X(j) for the out-of-bag observations, for each tree. Let O j be the out-of-bag observations for tree j and

let O j be the out-of-bag observations for tree j with X(j) permuted. Γ j = M j i W ij where W ij = (Y i m j (X i )) 2 m j m j i O j This avoids using a hold-out sample. This is estimating i O j (Y i m j (X i )) 2. Γ j = E[(Y m(x j)) 2 ] E[(Y m(x)) 2 ] where X j has the same distribution as X except that X j(j) is an independent draw from X(j). This is a lot like LOCO but its meaning is less clear. Note that m j is not changed when X(j) is permuted. Gregorutti, Michel and Saint Pierre. (23) show that, when (X, ɛ) is Gaussian, that Var(X) = ( c)i + c T and that Cov(Y, X(j)) = τ for all j then ( ) 2 τ Γ j = 2. c + dc It is not clear how this connects to the actual importance of X(j). In the case where Y = j m j(x(j)) + ɛ with E[ɛ X] = and E[ɛ 2 X] <, they show that Γ j = 2Var(m j (X(j)). 8 Inference Using the theory of infinite order U-statistics, Mentch and Hooker (25) showed that n( m(x) E[ m(x)])/σ converges to a Normal(,) and they show how to estimate σ. Wager and Athey (27) show asymptotic normality if we use sample splitting: part of the data are used to build the tree and part is used to estimate the average in the leafs of the tree. Under a number of technical conditions including the fact that we use subsamples of size s = n β with β < they show that ( m(x) m(x))/σ n (x) N(, ) and they show how to estimate σ n (x). Specifically, σ 2 n(x) = n n ( ) 2 n (Cov( m j (x), N ij ) 2 n s i where the covariance is with respect to the trees in the forest and N ij = of (X i, Y i ) was in the j th subsample and otherwise.

9 Summary Random forests are considered one of the best all purpose classifiers. But it is still a mystery why they work so well. The situation is very similar to deep learning. We have seen that there are now many interesting theoretical results about forests. But the results make strong assumptions that create a gap between practice and theory. Furthermore, there is no theory to say why forests outperform other methods. The gap between theory and practice is due to the fact that forests as actually used in practice are complex functions of the data. References Biau, Devroye and Lugosi. (28). Consistency of Random Forests and Other Average Classifiers. JMLR. Biau, Gerard, and Scornet. (26). A random forest guided tour. Test 25.2: 97-227. Biau, G. (22). Analysis of a Random Forests Model. arxiv:5.28. Buhlmann, P., and Yu, B. (22). Analyzing bagging. Annals of Statistics, 927-96. Gregorutti, Michel, and Saint Pierre. (23). Correlation and variable importance in random forests. arxiv:3.5726. Lei J, G Sell M, Rinaldo A, Tibshirani RJ, Wasserman L. (27). Distribution-free predictive inference for regression. Journal of the American Statistical Association. Lin, Y. and Jeon, Y. (26). Random Forests and Adaptive Nearest Neighbors. Journal of the American Statistical Association,, p 578. L. Mentch and G. Hooker. (25). Ensemble trees and CLTs: Statistical inference for supervised learning. Journal of Machine Learning Research. Rinaldo A, Tibshirani R, Wasserman L. (25). Uniform asymptotic inference and the bootstrap after model selection. arxiv preprint arxiv:56.6266. Scornet E. Random forests and kernel methods. (26). IEEE Transactions on Information Theory. 62(3):485-5. Wager, S. (24). Asymptotic Theory for Random Forests. arxiv:45.352. Wager, S. (25). Uniform convergence of random forests via adaptive concentration. arxiv:53.6388. Wager, S. and Athey, S. (27). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association. 2