Ensemble Methods and Random Forests

Similar documents
VBM683 Machine Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Variance Reduction and Ensemble Methods

Statistical Machine Learning from Data

ECE 5424: Introduction to Machine Learning

Learning with multiple models. Boosting.

Understanding Generalization Error: Bounds and Decompositions

Data Mining und Maschinelles Lernen

Ensembles of Classifiers.

Neural Networks and Ensemble Methods for Classification

CS534 Machine Learning - Spring Final Exam

Lecture 3: Introduction to Complexity Regularization

FINAL: CS 6375 (Machine Learning) Fall 2014

BAGGING PREDICTORS AND RANDOM FOREST

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Decision Tree Learning Lecture 2

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Holdout and Cross-Validation Methods Overfitting Avoidance

Decision trees COMS 4771

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

CS7267 MACHINE LEARNING

Statistics and learning: Big Data

ABC random forest for parameter estimation. Jean-Michel Marin

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Ensembles. Léon Bottou COS 424 4/8/2010

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Recap from previous lecture

Chapter 14 Combining Models

Bias-Variance in Machine Learning

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Click Prediction and Preference Ranking of RSS Feeds

Machine Learning, Fall 2009: Midterm

Machine Learning. Ensemble Methods. Manfred Huber

ECE 5984: Introduction to Machine Learning

Introduction to Machine Learning Midterm Exam

UVA CS 4501: Machine Learning

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Voting (Ensemble Methods)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

6.036 midterm review. Wednesday, March 18, 15

Decision Trees: Overfitting

ECE521 week 3: 23/26 January 2017

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

SF2930 Regression Analysis

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

10701/15781 Machine Learning, Spring 2007: Homework 2

Gradient Boosting, Continued

1 Handling of Continuous Attributes in C4.5. Algorithm

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

PDEEC Machine Learning 2016/17

RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY

CS 6375 Machine Learning

10-701/ Machine Learning - Midterm Exam, Fall 2010

Learning theory. Ensemble methods. Boosting. Boosting: history

Ensemble Methods: Jay Hyer

the tree till a class assignment is reached

Machine Learning, Fall 2011: Homework 5

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Ensemble Methods for Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Gradient Boosting (Continued)

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

day month year documentname/initials 1

Introduction to Bayesian Learning. Machine Learning Fall 2018

Lossless Online Bayesian Bagging

Data Mining. Supervised Learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Qualifying Exam in Machine Learning

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning & Data Mining

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Cross Validation & Ensembling

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

Empirical Risk Minimization, Model Selection, and Model Assessment

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]

TDT4173 Machine Learning

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Learning Decision Trees

Stochastic Gradient Descent

Bayesian Learning (II)

Machine Learning. Lecture 9: Learning Theory. Feng Li.

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Announcements Kevin Jamieson

A Brief Introduction to Adaboost

Evaluation requires to define performance measures to be optimized

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Lecture 3: Statistical Decision Theory (Part II)

Lecture 13: Ensemble Methods

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Transcription:

Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization error is by using ensembles. We saw an example of this in the AdaBoost algorithm in one of the assignments. A way of understanding how classifiers and regressors work, is by looking at the bias-variance decomposition. After visiting that, we can view ensemble methods as trying to reduce either the bias, the variance or both. A particular ensemble method, the random forest is then introduced, and some theoretical results on the generalization error and variance are presented in the final sections. 2 Bias Variance Decomposition The bias variance decomposition is a method of decomposing the generalization error of an algorithm. It is more natural in the context of regression than for classification. However, such decompositions have also been developed and used for classifiers, under certain assumptions. In the context of regression, the decomposition can be used to understand if the set of regressors being used are an overkill, or are insufficient for the task at hand. 2.1 Bias Variance Decomposition - Regression In the regression problem, given a model ˆf n, the expected generalization error at point X = x is Err( ˆf n (x)) = E Y X=x (y ˆf n (x)) 2 = E Y X=x (y f B (x) + f B (x) ˆf n (x)) 2 = E Y X=x (y f B (x)) 2 + E Y X=x (f B (x) ˆf n (x)) 2 + E Y X=x (2(y f B (x))(f B (x) ˆf n (x))) = E Y X=x (y f B (x)) 2 + E Y X=x (f B (x) ˆf n (x)) 2 = Err(f B (x)) + (f B (x) ˆf n (x)) 2 where f B (x) is the Bayes model in regression (MMSE estimator). Here, the first term is the irreducible error of any regressor. The second term represents how different ˆf n is from the Bayes model. Further, taking expectation with respect to the distribution of 1

samples, the second term becomes since E(E ˆf n (x) ˆf n (x)) = 0. E(f B (x) ˆf n (x)) 2 = E(f B (x) E ˆf n (x) + E ˆf n (x) ˆf n (x)) 2 = E(f B (x) E ˆf n (x)) 2 + E(E ˆf n (x) ˆf n (x)) 2 = (f B (x) E ˆf n (x)) 2 + E(E ˆf n (x) ˆf n (x)) 2 Through this expression, we can understand the generalization error better. The first term measures the discrepancy between the average predication and the Bayesian prediction at a fixed point x. That is, how good the expected predictor is at the point x. This is termed the bias of the predictor. The second term represents for the variability of the predictions at X = x over all models. Thus, this term will be low if the variance of the estimators is low. Figure 1: Irreducible error(noise), bias and variance at X = x, Ref [1] Combining the above expressions we have E(Err( ˆf n (x))) = Err(f B (x)) + (f B (x) E ˆf n (x)) 2 + E(E ˆf n (x) ˆf n (x)) 2 That is, the expected generalization error can be decomposed into three terms, namely, (1) the irreducible error, (2) the bias between the predictor and the Bayes optimal predictor, and (3) the variance of the estimators at the point. This is also depicted in Figure 1 above. Therefore, to reduce generalization error we can (1)reduce bias and (2)reduce variance as defined above. It should be noted that in this framework, X = x is considered fixed, and the variation with respect to noise is being recorded. 2.2 Understanding the fitting ability of regressors The bias variance decomposition can be used in practice to figure out if the set of regressors is overfitting, or underfitting the desired function. To gain insight, consider the case of predicting a 2

cosine function with polynomials of varying degree. Figure 1 below shows the bias and variance for degrees 1, 5 and 15. Figure 2: Bias variance decomposition of expected generalization error for polynomial of degrees 1, 5 and 15 in fitting a cosine function, Ref [1] In the figure, the top row depicts the regressors. The blue curve is the Bayes optimal regressor, the light red ones are the regressors on different training sets, and the dark red line is the average of all those regressors. By looking at the top row alone, we can understand that the degree 1 polynomial is not enough to find the cosine function, and degree 15 polynomial will catch the tiny variations due to noise. The bottom row depicts the bias-variance decomposition. It can be observed that there is a high bias and low variance for the degree 1 polynomial, high variance and somewhat low bias in the case of the degree 15 polynomial, while the degree 5 polynomial has both low variance and low bias. Therefore, the degree 5 case is desired. The degree 1 case is a case of underfitting, and degree 15 is overfitting. 2.3 Bias Variance Decomposition - Classification The bias variance decomposition can also be applied in the context of classifiers, under some conditions. For example, for the 0-1 loss and binary classification (if there is some averaging process involved), the expected generalization error at X = x E(Err( ˆf ( 0.5 E(ˆpL (Y = f B (x))) )( ) n (x))) = P(f B (x) y) + Q 2P(f B (x) = y) 1 V(ˆpL (Y = f B (x)) where Q(x) is cumulative distribution of the standard normal distribution and ˆp L represents the conditional class probability. For more details, please refer [1]. Therefore, from the above expression, if the expected probability estimate E(ˆp L (Y = f B (x))) for the true class is > 0.5, a reduction in variance results in reduced total error. 3

3 Ensemble Methods With the insights obtained using the bias-variance decomposition, we have an idea how to reduce the generalization error - (1)reduce bias and/or (2) reduce variance. The idea behind ensemble methods is to achieve this by Training multiple learners to solve the same problem Introducing random perturbations into learning procedure to produce several different models for same training set Combining predictions of these models to form prediction of ensemble 3.1 Examples of Ensemble Methods Popular methods of ensembles include Boosting: Combining multiple weak learners and boosting them based on their cumulative performance (e.g. AdaBoost which adaptively boosts learners). These methods aim to reduce bias and thus achieve reduced generalization error Bagging/Bootstrap Aggregating: Combining multiple learners with the aim of reducing the variance and leaving bias unaffected. These provide training data to the learners by picking uniformly at random, with replacement, from the set of all training samples. (e.g. Random Forests introduced by Breiman) 3.2 Why do ensembles work? Before introducing random forests and decision trees, we look at possible explanations as to why ensembles work. As presented by Thomas Dietterich in [2], there are three possible reasons[figure 3]: Statistical: It is possible that the learning algorithm can find many hypothesis that all give same accuracy on training data. Combining these hypotheses might do better. Computational: Many algorithms can get stuck at local optima. Using many different starting points helps getting stuck at the same local optima. Therefore multiple learners could help/ Represenational: The true function may not be in the chosen hypothesis space. Weighted sums expand the space of representable functions, increasing the chance of including the true function in the search space. Figure 3: Why ensembles work 4

4 Random Forests So far we have seen the intuition behind ensemble methods. Now we present the constituents of random forests (decision trees), how the random forests are constructed, and some theoretical results. 4.1 Decision Trees Decision trees are tree based classifiers/regressors. They arise naturally in the context of classification. To build one: Start at root node corresponding to all samples Split current node based on variable that maximizes a split criterion Split all such nodes at current level Repeat at children nodes Continue till stopping criterion is true Commonly used split criteria include maximum information gain, Gini impurity It is to be noted that decision trees have a high chance of overfitting. That is, they have a high variance, low bias. Hence, they form good candidates for ensembles which aim to reduce variance while not affected bias by much. Figure 4: A decision tree 5 Random Forests Random forests are learners which are a collection of decision trees. Breiman introduced bagging and random forests in the early 2000s. Ever since random forests have become a popular learning technique. They are particularly useful because not only do they have a good performance in practice, one can pinpoint variables and variable values corresponding to nodes and therefore make sense of what is going on. This is unlike the case of neural networks where the learner is, very often, treated like a magical black box. 5.1 Bagging The idea of bagging is to generate training sets for the individual learners of the ensemble. As an example, consider random forests to be a collection of (say M) decision trees. Bagging is used to generate training set for the decision trees as follows. Choose N cases by drawing uniformly at random with replacement from the training set of samples (Z 1... Z N ) Repeat for each tree T i, i = 1... M Further constraints are imposed on the constituent decision trees, to split only along a subset of all possible variables. Therefore, these fall under the category of subspace sampling methods. After the trees are constructed, to evaluate the output of the random forest, aggregate results of all decision trees e.g. The random forest regressor averages the output of all decision trees, thus f RF (x) = 1 M M m=1 f(x, Θ m) where Θ m parametrizes the m th tree. 5

5.2 An upper bound on generalization error In the case of regression, an upper bound on the generalization error can be obtained in terms of the individual trees and the amount of correlation between them. While this bound is loose, it gives an idea of how to construct random forests for low generalization error. For a regression problem, the mean-squared generalization error for predictor f(x) is given by E Z (y f(x)) 2. Thus, for random forests, the error is given by E Z (y f RF (x)) 2 = E Z (y 1 M M f(x, Θ m )) 2 a.s. E Z(y E Θ f(x, Θ)) 2 M m=1 The last term in the above expression is defined to be the optimal generalization error of random forests. That is, L RF = E Z(y E Θ f(x, Θ)) 2. The optimal generalization error of tree is given by L tree = E Θ E Z (y f(x, Θ)) 2. The following theorem relates these two quantities. Theorem 5.1 ([3] Breiman, 2001) Let L RF = E Z(y E Θ f(x, Θ)) 2 and L tree = E Θ E Z (y f(x, Θ)) 2. Assume that Θ, E Z (y f(x, Θ)) = 0. Then L RF ρl tree where ρ is the weighted correlation between residuals defined below and Θ, Θ are independent. Proof: L RF = E Z (y E Θ f(x, Θ)) 2 = E Z (E Θ (y f(x, Θ))) 2 = E Θ E Θ E Z (y f(x, Θ))(y f(x, Θ )) (a) = E Θ E Θ (ρ(θ, Θ )sd(θ)sd(θ )) (b) = ρ(e Θ sd(θ)) 2 (c) ρe Θ E Z (y f(x, Θ)) 2 = ρl tree where ρ(θ, Θ ) = E Z((y f(x, Θ))(y f(x, Θ ))) sd(θ)sd(θ, ρ = ) EZ (y f(x, Θ)) 2 E Θ E Θ ρ(θ, Θ ) E Θ ( E Z (y f(x, Θ)) 2 ) 2 and sd(θ) = The correlation coefficient ρ(θ, Θ ) defines the correlation between the residuals y f(x, Θ) and y f(x, Θ ). The weighted correlation coefficient ρ weights the correlation ρ(θ, Θ ) further. (a) follows by introducing the correlation coefficient. (b) follows by introducing the weighted correlation coefficient. (c) follows from the non-negativity of variance. The above gives the idea that as long as the correlation between the error of two different trees is low, it is alright for the individual decision trees overfit. Therefore, the randomization in the construction of the random forest should ensure a low correlation. 5.3 Variance of random forests estimates (Regression) Does the variance of the estimates of random forests actually go to zero, as desired? The following theorem is presented as part of the proof of that the generalization error goes to zero. In particular, the authors employ the bias-variance decomposition and show that both terms go to zero. However the rate of decay is poor, and the result is presented in a very restricted context. 6

Theorem 5.2 [Proposition 2, [4] Gerard Biau, 2012] Assume that X is uniformly distributed on [0, 1] d, sampling without replacement, and some other constraints E(E ˆf n (x) ˆf (( 1 n (x)) 2 k )) n = O n)( log(k n ) S/2d where k n represents the number of leaves, and S is the sparsity of the function to be estimated ( 1 ) In the extreme case where k n = n and S = d, the variance term = O While this log n indicates a very slow rate of convergence, there is often sparsity in the representation of the function to be estimated. Therefore, S < d/2, which yields better rates. According to the authors, the proof reveals that the log term is a by-product of the Θ-averaging process, which appears by taking into consideration the correlation between trees. 6 Simulations As part of learning about random forests, some simulations are presented here. 6.1 Bias-Variance Decomposition Figure 5 depicts the simulation of regression using trees(left) and bagged trees(right). There is significantly lesser peaks in the red and green curves corresponding to error and variance respectively, for the case of bagged trees. Figure 5: Bias-variance decomposition of trees without and with bagging 6.2 Classifiers on the Iris dataset We have worked with the Iris dataset on several of our Python problems. Here, the classification of the Iris dataset, based on subset of features is presented for decision trees, random forests, and Adaboost algorithms. 7

Figure 6: Comparing Decision Trees, Random Forests and AdaBoost It is observed that decision trees have sharp boundaries, often overfitting in multiple places. Random forests have a smoother transition between two classes. Adaboost appears to perform almost as well as random forests on this dataset. 6.3 Out of bag estimate of error In the case on online learning algorithms we saw that the error on a new point yields an idea about the generalization error of the learner. The same is true in the case of random forests. Here, we can test the trained random forest on samples not allotted to the decision tree, which yields the out-of-bag estimate of the error. In Figure 7, the out-of-bag error estimates are compared for random forests constructed with different number of features/variables for splitting along during each split of the tree. The comparison is with respect to the number of estimators used. Looking at such a plot helps in picking the number of trees you would want to include in the random forest. It is interesting to note that the OOB error when the max-features is none, that is, when all features are used, is higher than in the other two scenarios. 8

Figure 7: Out of Bag error estimate 7 Conclusions In this project one of the popular learning algorithms is studied. The generalization error bounds for certain cases are presented. While the framework is similar to the one seen in class, the proof methodology is somewhat different. The proofs rely on the bias-variance decomposition presented, as well as the results on decision trees and ensemble methods existing in literature. The claim that the random forest should reduce the variance of the learner is also presented. This turns out to be true in a certain regression case. In other contexts, the performance of random forests is not proved sufficiently. Some simulations are presented. References [1] Gilles Louppe. Understanding random forests: From theory to practice, chapters 3,4. arxiv preprint arxiv:1407.7502, 2014. [2] Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1 15. Springer, 2000. [3] Leo Breiman. Random forests. Machine learning, 45(1):5 32, 2001. [4] Gerard Biau. Analysis of a random forests model. Journal of Machine Learning Research, 13(Apr):1063 1095, 2012. 9