BAGGING PREDICTORS AND RANDOM FOREST

Similar documents
SF2930 Regression Analysis

Variance Reduction and Ensemble Methods

Lecture 13: Ensemble Methods

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

UVA CS 4501: Machine Learning

ABC random forest for parameter estimation. Jean-Michel Marin

RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY

Ensemble Methods and Random Forests

Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

Statistical Machine Learning from Data

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Data Mining und Maschinelles Lernen

Informal Definition: Telling things apart

Classification using stochastic ensembles

IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1

Statistics and learning: Big Data

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Chapter 6. Ensemble Methods

day month year documentname/initials 1

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Machine Learning. Ensemble Methods. Manfred Huber

Machine Learning Recitation 8 Oct 21, Oznur Tastan

ECE 5424: Introduction to Machine Learning

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Ensembles of Classifiers.

Holdout and Cross-Validation Methods Overfitting Avoidance

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

PDEEC Machine Learning 2016/17

Diagnostics. Gad Kimmel

1 Handling of Continuous Attributes in C4.5. Algorithm

Lossless Online Bayesian Bagging

Advanced Statistical Methods: Beyond Linear Regression

WALD LECTURE II LOOKING INSIDE THE BLACK BOX. Leo Breiman UCB Statistics

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Chapter 14 Combining Models

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

CS145: INTRODUCTION TO DATA MINING

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Bagging and Other Ensemble Methods

Learning with multiple models. Boosting.

Classification of Longitudinal Data Using Tree-Based Ensemble Methods

Decision trees COMS 4771

Constructing Prediction Intervals for Random Forests

Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities ; CU- CS

A Simple Algorithm for Learning Stable Machines

Deconstructing Data Science

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Neural Networks and Ensemble Methods for Classification

Machine Learning. Lecture 9: Learning Theory. Feng Li.

arxiv: v5 [stat.me] 18 Apr 2016

Data analysis strategies for high dimensional social science data M3 Conference May 2013

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Regression tree methods for subgroup identification I

TDT4173 Machine Learning

StatPatternRecognition: A C++ Package for Multivariate Classification of HEP Data. Ilya Narsky, Caltech

Supplementary material for Intervention in prediction measure: a new approach to assessing variable importance for random forests

Random Forests for Ordinal Response Data: Prediction and Variable Selection

VBM683 Machine Learning

the tree till a class assignment is reached

Cross Validation & Ensembling

2D1431 Machine Learning. Bagging & Boosting

Chapter ML:II (continued)

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Discrimination Among Groups. Classification (and Regression) Trees

Low Bias Bagged Support Vector Machines

JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA

Performance Evaluation and Comparison

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

Ensemble Methods for Machine Learning

REGRESSION TREE CREDIBILITY MODEL

Cover Letter. Inflated Results and Spurious Conclusions: A Re-Analysis of the MacArthur Violence Risk

Learning theory. Ensemble methods. Boosting. Boosting: history

Variable importance measures in regression and classification methods

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Ensemble Methods: Jay Hyer

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]

Performance of Cross Validation in Tree-Based Models

Influence measures for CART

Applied Machine Learning Annalisa Marsico

Random Forests: Finding Quasars

Bootstrap, Jackknife and other resampling methods

Variance and Bias for General Loss Functions

Discriminative v. generative

Machine Learning Linear Classification. Prof. Matteo Matteucci

CS7267 MACHINE LEARNING

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Machine Learning for OR & FE

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Machine Learning. Nathalie Villa-Vialaneix - Formation INRA, Niveau 3

Harrison B. Prosper. Bari Lectures

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Ensemble Learning in the Presence of Noise

Decision Tree Learning Lecture 2

Transcription:

BAGGING PREDICTORS AND RANDOM FOREST DANA KANER M.SC. SEMINAR IN STATISTICS, MAY 2017 BAGIGNG PREDICTORS / LEO BREIMAN, 1996 RANDOM FORESTS / LEO BREIMAN, 2001 THE ELEMENTS OF STATISTICAL LEARNING (CHAPTERS 8,9,15) / HASTIE, TIBSHIRANI, FRIEDMAN

TABLE OF CONTENTS Bagging predictors - Introduction. The algorithm. Justification. Examples: classification and regressions trees, variable selection. Random forest - Decision trees. The algorithm. And more...

BAGGING - INTRODUCTION A method based on Bootstrap sampling for generating multiple versions of a predictor, and using them in order to get an improved predictor. Usually the algorithm works for unstable procedures (trees, neural nets). The evidence, both experimental and theoretical, is that bagging can push a good but unstable procedure a significant step towards optimality. On the other hand, it can slightly degrade the performance of stable procedures.

AGGREGATED PREDICTORS Consider a learning set L drawn from distribution F, and a procedure that forms a predictor θ = φ x, L for an unknown x. Now, imagine we can take K samples of N independent observations from F. In order to get a better prediction, we calculate an aggregated predictor. Great. So what exactly is the problem?

BAGGING - THE ALGORITHM Usually, we have a single learning set L drawn from distribution F, and a procedure that forms a predictor θ = φ x, L for an unknown x. We ll take B Bootstrap samples from distribution F: N i.i.d observations, drawn at random with replacement from L. For each sample b=1 B, we ll form a predictor φ x, L b. Numerical value: φ B = avg(φ x, L b ). Categorical value: the majority of the votes φ x, L b.

BETWEEN BAGGING AND BOOTSTRAP F L = {(x 1, y 1 ),,(x N, y N )} A single predictor: መθ = φ x, L F L b = {(x 1, y 1 ),, (x N, y N ) } For each sampling L from F: መθ = φ x, L Aggregated estimator: E( መθ) = E φ A = L (φ x, L ) argmax j p φ x, L = j B bootstrap samples from F : መθ b = φ x, L b Bagged estimator: φ avg(φ x B =, L b ) Majority vote of the B trees

BAGGING JUSTIFICATION - NUMERIC PREDICTION Consider a numeric aggregated predictor, based on replications of L from the same distribution F, φ A (x)=e L φ x, L. Given fixed x, y : E L [(y φ x, L ) 2 ] E L 2 (y φ x, L ) = (E L y E L φ x, L ) 2 = (y φ A (x)) 2 If we integrate both sides over the joint probability of x, y : E x,y E L [(y φ x, L ) 2 x, y] E x,y [(E L 2 (y φ x, L ))] E x,y,l [(y φ x, L ) 2 ] E x,y [(y φ A (x)) 2 ] average MSE(φ x, L ) over L samples MSE(φ A (x)) Therefore, φ A is better then φ, a predictor based on one sample from F.

BAGGING JUSTIFICATION - NUMERIC PREDICTION E x,y E L [(y φ x, L ) 2 L] E x,y [(E L 2 (y φ x, L ))] If E L [φ 2 2 x, L ] E L [φ x, L ] (i.e. small variance of φ over L replicates), an aggregation will not help. The more highly variable the φ x, L are (over different replicates of L), the more improvement aggregation may produce.

BAGGING JUSTIFICATION - NUMERIC PREDICTION φ A (x):=φ A (x,f) φ B : = φ A x, F Therefore, φ B is better then φ, a predictor based on one sample from F. A cross over point between instability and stability at which φ B stops improving on φ x, L and does worse: On the one hand, if the procedure φ is unstable on F, it can give improvement through aggregation - from one to many BS samples. On the other side, if the procedure φ is stable, then φ B will not be as accurate for data drown from F as φ A x, F φ x, L - from F to F.

BAGGING JUSTIFICATION - CLASSIFICATION Q j x = p φ x, L = j j s frequency over replicates of L for tree predictor φ. P j x The real distribution of j given x. The probability that a predictor φ will classify correctly: P correct classification x = The overall probability of correct classification: j Q j x P j x r = න( Q j x P j x ) P x dx j

BAGGING JUSTIFICATION - CLASSIFICATION σ j Q j x P j x max j P j x Equality for Q j x = Ι P j x =max i P i x Theoretical Best predictor: φ (x) = argmax j P j x The highest attainable correct classification rate: r = න max j P j x P x dx

BAGGING JUSTIFICATION - CLASSIFICATION Call φ order-correct at input x if: argmax j Q j x = argmax j P j x j given x is most likely to happen usually, φ x, L = j over many L duplicates The aggregated predictor: φ A (x) = argmax j Q j x Q A j x = Ι argmaxi Q i x =j P φ A wil classify correctly x = σ j Ι argmaxi Q i x =jp j x If φ A is order-correct at x: P φ A wil classify correctly x = max j P j x

BAGGING JUSTIFICATION - CLASSIFICATION Let C be all x s where φ A order-correct at. The correct-classification rate for φ A : r A = C Ι argmaxi Q i x =j P j x P xdx + c σ j Ι argmaxi Q i x =j P j x P Xdx = C max j P j x P x dx + c σ j Ι φa (x)=jp j x P x dx Reminder: r = max j P j x P x dx If a predictor is good in the sense that it is order-correct for most inputs x, then aggregation can transform it into a nearly optimal predictor.

BAGGING - REGRESSION TREES EXAMPLE Data sets divided into test set T and a learning set L, usually 10% and 90% respectively. A regression tree is constructed from L using 10-fold CV T squared error e s L, T. 25 Bootstrap samples L b are drawn from L predictors {φ 1 x, L 1,,φ 25 x, L 25 }. (x j, y j ) T, y j = 1 σ B B b=1 φ b x j, L b T e B L, T = 1 σ T T j=1 (yj y j ) 2 The random division of the data is repeated 100 times, the predictors: ( ഥe s, e B ).

BAGGING - REGRESSION TREES EXAMPLE The results ( ഥe s, e B ):

BAGGING - CLASSIFICATION TREES EXAMPLE Data sets are divided into test set T and a learning set L, 10% - 90%. A classification tree is constructed from L using 10-fold CV T misclassification rate e s L, T. 50 Bootstrap samples L b are drawn from L predictors {φ 1 x, L 1,,φ 50 x, L 50 }. (x j, y j ) T, the estimated class is the one having the plurality in {φ 1 x j, L b,,φ 50 x j, L b } T The bagging misclassification rate: e B L, T. The random division of the data is repeated 100 times, the predictors: ( ഥe s, e B ).

BAGGING - CLASSIFICATION TREES EXAMPLE The results ( ഥe s, e B ):

BAGGING - FORWARD STEPWISE SELECTION m: Given a predictor φ m based on x "1",, x "m 1" : Form a regression for y on x "1",, x "m 1", x "m" for each m that was not chosen. Select m that minimizes RSS(m). The output: a sequence of models for each m. Subset selection is nearly optimal if there are only a few large non-zero {β i }.

BAGGING - FORWARD STEPWISE SELECTION For 3,15 and 27 non-zero {β i } - 250 simulations: y = σ P=30 i=1 β i x i + ε, ε~n 0,1, L = { x 1, y 1,, x 60, y 60 }. Run FSS on L. Predictors {φ 1 (x), φ P (x)} mean squared errors e 1 s, e P s. 50 BS samples, b: predictors {φ 1 (x, L b ), φ P (x, L b )}. Bagged predictors {φ 1 B (x), φ P B (x)} mean squared errors {e 1 B, e P B }. Average over the 250 repetitions { e m ҧ S }, { e m ҧ B }, m = 1 P = 30.

BAGGING - FORWARD STEPWISE SELECTION e m ҧs B e m ҧ m m m FSS is better for a small number of nonzero coefficients. Bigger error, less stable. Bagging is good for unstable procedures (linear regression with all coefficients is stable)

BAGGING AND RANDOM FOREST Consider B bootstrap samples drawn i.i.d from F, and B tree models based on them. Bias(average of predictions) = Bias(1 st prediction) Our only hope: reduce the variance. Assume σ 2 is the single tree variance, ρ is the correlation between trees. Then, the variance of the average of predictions is ρσ 2 + 1 ρ B σ2 B ρσ 2. Random forest: a modification of bagging that builds a large collection of decorrelated trees, and then averages them. In other words, the idea is to reduce ρ, without increasing too much σ 2 or the MSE.

DECISION TREES - CART - INTRO We will define a tree by Θ = {(R m, c m )} M m=1. The tree prediction መf x = σ M m=1 c m Ι xεrm. Choosing c m (given R M ): Regression: average of {y i x i ϵr m }. Classification: majority of votes of {y i x i ϵr m }. Choosing R = {R 1,, R M } - greedy algorithm: At each stage, minimizing the selected error by variable x j and by s, the splitting point of x j.

REGRESSION TREES SIMULATION (ESL) Simulation of L, L = 30. x i = p. y i 0,1. ቊ P y i = 1 x i 0.5 = 0.2 P y i = 1 x i > 0.5 = 0.8

RANDOM FOREST - THE ALGORITHM For b = 1 to B: 1. Draw a bootstrap sample L b = {(x 1, y 1 ),, (x N, y N ) } from L. 2. Grow a RF tree to L b, by repeating at each terminal node until n min is reached: Select m variables at random from the p variables of x i. Pick the best variable and split-point among the m. Split the node into two daughter nodes. Output: B tree predictions {T b (x, Θ b )} B b=1, where Θ b = {(R m, c m )} M m=1. Prediction at a new point x: ቐ መC B rf መf rf B x = 1 B σt b(x, Θ b ) x = majority vote of {T b (x, Θ b )} B b=1.

RANDOM FOREST - REGRESSION TREE መf rf B x = 1 B σt b x, Θ b B መf rf x = E Θ L (T x, Θ L ) Var መf rf B x = ρ x σ 2 x + 1 ρ x B σ 2 x B Var መf rf x = ρ x σ 2 x ρ x = corr(t x, θ 1 L, T x, θ 2 L ), where θ 1 L, θ 2 L are representations of two RF trees grown to the randomly sampled L. In other words, ρ x is the theoretical correlation between trees, induced by repeatedly making training samples L from the population. σ 2 x = Var(T x, θ L.

RANDOM FOREST - REGRESSION TREE Var θ,l T x, θ L = Var L E θ L T x, θ L + E L Var θ L T x, θ L Total Variance Var L መf rf x + within L Variance of a tree pred m RF ensemble is better than one RF tree. m As for the bias: Bias x = μ x E L መf rf x = μ x E L E θ L (T x, Θ L ) Although for different models the shape and rate of the bias curves may differ, the general trend is that as m decreases, the bias increases.

RANDOM FOREST - REGRESSION TREE Usually, the default value for m is p 3 for regression and p for classification.

OUT OF BAG SAMPLES It turns out roughly 37% of the examples in L do not appear in a particular bootstrap training set. OOB samples: For each Bootstrap sampling, the OOB are the observations x i, y i which did not appear in the sample. The OOB samples can be used to form estimates for important quantities - error estimate, variable importance and more (Breiman,1996b, OOB estimation).

VARIABLE IMPORTANCE RF also use the OOB samples to construct an alternative way to compute variableimportance of features. Gini importance: mean gain in Gini impurity criterion produced by x j over all trees. OOB permutation VI: When the b th tree is grown, the OOB samples are passed down the tree, and the prediction misclassification rate is recorded. Then, the values for the m th variable are randomly permuted in the OOB samples, and the rate is again computed. The VI of feature m is computed as the average increase in misclassification rate (over all trees) as compared to the out-of-bag misclassification rate.

VARIABLE IMPORTANCE (SPAM DATA)