BAGGING PREDICTORS AND RANDOM FOREST

BAGGING PREDICTORS AND RANDOM FOREST DANA KANER M.SC. SEMINAR IN STATISTICS, MAY 2017 BAGIGNG PREDICTORS / LEO BREIMAN, 1996 RANDOM FORESTS / LEO BREIMAN, 2001 THE ELEMENTS OF STATISTICAL LEARNING (CHAPTERS 8,9,15) / HASTIE, TIBSHIRANI, FRIEDMAN

TABLE OF CONTENTS Bagging predictors - Introduction. The algorithm. Justification. Examples: classification and regressions trees, variable selection. Random forest - Decision trees. The algorithm. And more...

BAGGING - INTRODUCTION A method based on Bootstrap sampling for generating multiple versions of a predictor, and using them in order to get an improved predictor. Usually the algorithm works for unstable procedures (trees, neural nets). The evidence, both experimental and theoretical, is that bagging can push a good but unstable procedure a significant step towards optimality. On the other hand, it can slightly degrade the performance of stable procedures.

AGGREGATED PREDICTORS Consider a learning set L drawn from distribution F, and a procedure that forms a predictor θ = φ x, L for an unknown x. Now, imagine we can take K samples of N independent observations from F. In order to get a better prediction, we calculate an aggregated predictor. Great. So what exactly is the problem?

BAGGING - THE ALGORITHM Usually, we have a single learning set L drawn from distribution F, and a procedure that forms a predictor θ = φ x, L for an unknown x. We ll take B Bootstrap samples from distribution F: N i.i.d observations, drawn at random with replacement from L. For each sample b=1 B, we ll form a predictor φ x, L b. Numerical value: φ B = avg(φ x, L b ). Categorical value: the majority of the votes φ x, L b.

BETWEEN BAGGING AND BOOTSTRAP F L = {(x 1, y 1 ),,(x N, y N )} A single predictor: መθ = φ x, L F L b = {(x 1, y 1 ),, (x N, y N ) } For each sampling L from F: መθ = φ x, L Aggregated estimator: E( መθ) = E φ A = L (φ x, L ) argmax j p φ x, L = j B bootstrap samples from F : መθ b = φ x, L b Bagged estimator: φ avg(φ x B =, L b ) Majority vote of the B trees

BAGGING JUSTIFICATION - NUMERIC PREDICTION Consider a numeric aggregated predictor, based on replications of L from the same distribution F, φ A (x)=e L φ x, L. Given fixed x, y : E L [(y φ x, L ) 2 ] E L 2 (y φ x, L ) = (E L y E L φ x, L ) 2 = (y φ A (x)) 2 If we integrate both sides over the joint probability of x, y : E x,y E L [(y φ x, L ) 2 x, y] E x,y [(E L 2 (y φ x, L ))] E x,y,l [(y φ x, L ) 2 ] E x,y [(y φ A (x)) 2 ] average MSE(φ x, L ) over L samples MSE(φ A (x)) Therefore, φ A is better then φ, a predictor based on one sample from F.

BAGGING JUSTIFICATION - NUMERIC PREDICTION E x,y E L [(y φ x, L ) 2 L] E x,y [(E L 2 (y φ x, L ))] If E L [φ 2 2 x, L ] E L [φ x, L ] (i.e. small variance of φ over L replicates), an aggregation will not help. The more highly variable the φ x, L are (over different replicates of L), the more improvement aggregation may produce.

BAGGING JUSTIFICATION - NUMERIC PREDICTION φ A (x):=φ A (x,f) φ B : = φ A x, F Therefore, φ B is better then φ, a predictor based on one sample from F. A cross over point between instability and stability at which φ B stops improving on φ x, L and does worse: On the one hand, if the procedure φ is unstable on F, it can give improvement through aggregation - from one to many BS samples. On the other side, if the procedure φ is stable, then φ B will not be as accurate for data drown from F as φ A x, F φ x, L - from F to F.

BAGGING JUSTIFICATION - CLASSIFICATION Q j x = p φ x, L = j j s frequency over replicates of L for tree predictor φ. P j x The real distribution of j given x. The probability that a predictor φ will classify correctly: P correct classification x = The overall probability of correct classification: j Q j x P j x r = න( Q j x P j x ) P x dx j

BAGGING JUSTIFICATION - CLASSIFICATION σ j Q j x P j x max j P j x Equality for Q j x = Ι P j x =max i P i x Theoretical Best predictor: φ (x) = argmax j P j x The highest attainable correct classification rate: r = න max j P j x P x dx

BAGGING JUSTIFICATION - CLASSIFICATION Call φ order-correct at input x if: argmax j Q j x = argmax j P j x j given x is most likely to happen usually, φ x, L = j over many L duplicates The aggregated predictor: φ A (x) = argmax j Q j x Q A j x = Ι argmaxi Q i x =j P φ A wil classify correctly x = σ j Ι argmaxi Q i x =jp j x If φ A is order-correct at x: P φ A wil classify correctly x = max j P j x

BAGGING JUSTIFICATION - CLASSIFICATION Let C be all x s where φ A order-correct at. The correct-classification rate for φ A : r A = C Ι argmaxi Q i x =j P j x P xdx + c σ j Ι argmaxi Q i x =j P j x P Xdx = C max j P j x P x dx + c σ j Ι φa (x)=jp j x P x dx Reminder: r = max j P j x P x dx If a predictor is good in the sense that it is order-correct for most inputs x, then aggregation can transform it into a nearly optimal predictor.

BAGGING - REGRESSION TREES EXAMPLE Data sets divided into test set T and a learning set L, usually 10% and 90% respectively. A regression tree is constructed from L using 10-fold CV T squared error e s L, T. 25 Bootstrap samples L b are drawn from L predictors {φ 1 x, L 1,,φ 25 x, L 25 }. (x j, y j ) T, y j = 1 σ B B b=1 φ b x j, L b T e B L, T = 1 σ T T j=1 (yj y j ) 2 The random division of the data is repeated 100 times, the predictors: ( ഥe s, e B ).

BAGGING - REGRESSION TREES EXAMPLE The results ( ഥe s, e B ):

BAGGING - CLASSIFICATION TREES EXAMPLE Data sets are divided into test set T and a learning set L, 10% - 90%. A classification tree is constructed from L using 10-fold CV T misclassification rate e s L, T. 50 Bootstrap samples L b are drawn from L predictors {φ 1 x, L 1,,φ 50 x, L 50 }. (x j, y j ) T, the estimated class is the one having the plurality in {φ 1 x j, L b,,φ 50 x j, L b } T The bagging misclassification rate: e B L, T. The random division of the data is repeated 100 times, the predictors: ( ഥe s, e B ).

BAGGING - CLASSIFICATION TREES EXAMPLE The results ( ഥe s, e B ):

BAGGING - FORWARD STEPWISE SELECTION m: Given a predictor φ m based on x "1",, x "m 1" : Form a regression for y on x "1",, x "m 1", x "m" for each m that was not chosen. Select m that minimizes RSS(m). The output: a sequence of models for each m. Subset selection is nearly optimal if there are only a few large non-zero {β i }.

BAGGING - FORWARD STEPWISE SELECTION For 3,15 and 27 non-zero {β i } - 250 simulations: y = σ P=30 i=1 β i x i + ε, ε~n 0,1, L = { x 1, y 1,, x 60, y 60 }. Run FSS on L. Predictors {φ 1 (x), φ P (x)} mean squared errors e 1 s, e P s. 50 BS samples, b: predictors {φ 1 (x, L b ), φ P (x, L b )}. Bagged predictors {φ 1 B (x), φ P B (x)} mean squared errors {e 1 B, e P B }. Average over the 250 repetitions { e m ҧ S }, { e m ҧ B }, m = 1 P = 30.

BAGGING - FORWARD STEPWISE SELECTION e m ҧs B e m ҧ m m m FSS is better for a small number of nonzero coefficients. Bigger error, less stable. Bagging is good for unstable procedures (linear regression with all coefficients is stable)

BAGGING AND RANDOM FOREST Consider B bootstrap samples drawn i.i.d from F, and B tree models based on them. Bias(average of predictions) = Bias(1 st prediction) Our only hope: reduce the variance. Assume σ 2 is the single tree variance, ρ is the correlation between trees. Then, the variance of the average of predictions is ρσ 2 + 1 ρ B σ2 B ρσ 2. Random forest: a modification of bagging that builds a large collection of decorrelated trees, and then averages them. In other words, the idea is to reduce ρ, without increasing too much σ 2 or the MSE.

DECISION TREES - CART - INTRO We will define a tree by Θ = {(R m, c m )} M m=1. The tree prediction መf x = σ M m=1 c m Ι xεrm. Choosing c m (given R M ): Regression: average of {y i x i ϵr m }. Classification: majority of votes of {y i x i ϵr m }. Choosing R = {R 1,, R M } - greedy algorithm: At each stage, minimizing the selected error by variable x j and by s, the splitting point of x j.

REGRESSION TREES SIMULATION (ESL) Simulation of L, L = 30. x i = p. y i 0,1. ቊ P y i = 1 x i 0.5 = 0.2 P y i = 1 x i > 0.5 = 0.8

RANDOM FOREST - THE ALGORITHM For b = 1 to B: 1. Draw a bootstrap sample L b = {(x 1, y 1 ),, (x N, y N ) } from L. 2. Grow a RF tree to L b, by repeating at each terminal node until n min is reached: Select m variables at random from the p variables of x i. Pick the best variable and split-point among the m. Split the node into two daughter nodes. Output: B tree predictions {T b (x, Θ b )} B b=1, where Θ b = {(R m, c m )} M m=1. Prediction at a new point x: ቐ መC B rf መf rf B x = 1 B σt b(x, Θ b ) x = majority vote of {T b (x, Θ b )} B b=1.

RANDOM FOREST - REGRESSION TREE መf rf B x = 1 B σt b x, Θ b B መf rf x = E Θ L (T x, Θ L ) Var መf rf B x = ρ x σ 2 x + 1 ρ x B σ 2 x B Var መf rf x = ρ x σ 2 x ρ x = corr(t x, θ 1 L, T x, θ 2 L ), where θ 1 L, θ 2 L are representations of two RF trees grown to the randomly sampled L. In other words, ρ x is the theoretical correlation between trees, induced by repeatedly making training samples L from the population. σ 2 x = Var(T x, θ L.

RANDOM FOREST - REGRESSION TREE Var θ,l T x, θ L = Var L E θ L T x, θ L + E L Var θ L T x, θ L Total Variance Var L መf rf x + within L Variance of a tree pred m RF ensemble is better than one RF tree. m As for the bias: Bias x = μ x E L መf rf x = μ x E L E θ L (T x, Θ L ) Although for different models the shape and rate of the bias curves may differ, the general trend is that as m decreases, the bias increases.

RANDOM FOREST - REGRESSION TREE Usually, the default value for m is p 3 for regression and p for classification.

OUT OF BAG SAMPLES It turns out roughly 37% of the examples in L do not appear in a particular bootstrap training set. OOB samples: For each Bootstrap sampling, the OOB are the observations x i, y i which did not appear in the sample. The OOB samples can be used to form estimates for important quantities - error estimate, variable importance and more (Breiman,1996b, OOB estimation).

VARIABLE IMPORTANCE RF also use the OOB samples to construct an alternative way to compute variableimportance of features. Gini importance: mean gain in Gini impurity criterion produced by x j over all trees. OOB permutation VI: When the b th tree is grown, the OOB samples are passed down the tree, and the prediction misclassification rate is recorded. Then, the values for the m th variable are randomly permuted in the OOB samples, and the rate is again computed. The VI of feature m is computed as the average increase in misclassification rate (over all trees) as compared to the out-of-bag misclassification rate.

VARIABLE IMPORTANCE (SPAM DATA)