SF2930 Regression Analysis Alexandre Chotard Tree-based regression and classication 20 February 2017 1 / 30
Idag Overview Regression trees Pruning Bagging, random forests 2 / 30
Today Overview Regression trees Pruning Bagging, random forests 3 / 30
Setup Let Y be a real-valued random variable. Given x X R d, we want to predict the value of Y. To understand how dierent values of x inuence Y, we have N N examples (x i, y i ) i [1..N]. The variable Y can have discrete or continuous values, corresponding respectively to classication and regression problems. We consider here binary decision trees. A decision tree partitions X into simple regions (R j ) j [1..J]. For x = (x 1,..., x d ) R d and j such that x R j, the value predicted by the tree is the most represented class y among {x i R j } in classication problems, the average of {y i x i R j, i [1..N]} in regression problems. 4 / 30
Setup Each node n of the tree is associated to a subset R n of X, which is further split into two regions through a test comparing x i for some i [1..d] with a value c R. Then for l and r left and right children of n, R l := {x R n x i < c} R r := {x R n x i c} Years < 4.5 5.11 Hits < 117.5 6.00 6.74 Figur: A regression tree for predicting the log salary (in 1, 000$) of a baseball player, based on the number of years played and the number of hits made in the previous year. 5 / 30
238 R 3 Hits R 1 117.5 R 2 1 4.5 24 Years 1 Figur: The three-region partition provided by the regression tree in the previous gure. 6 / 30
Example: spam classication 600/1536 ch$<0.0555 ch$>0.0555 remove<0.06 280/1177 remove>0.06 spam 48/359 hp<0.405 hp>0.405 spam spam 180/1065 9/112 26/337 0/22 ch!<0.191 ch!>0.191 george<0.15 CAPAVE<2.907 george>0.15 CAPAVE>2.907 spam spam spam 80/861 100/204 6/109 0/3 19/110 7/227 george<0.005 george>0.005 CAPAVE<2.7505 CAPAVE>2.7505 1999<0.58 1999>0.58 spam spam 80/652 0/209 36/123 16/81 18/109 0/1 hp<0.03 hp>0.03 free<0.065 free>0.065 spam 77/423 3/229 16/94 9/29 CAPMAX<10.5 business<0.145 CAPMAX>10.5 business>0.145 20/238 57/185 spam 14/89 3/5 receive<0.125 edu<0.045 receive>0.125 edu>0.045 spam 19/236 1/2 48/113 9/72 our<1.2 our>1.2 spam 37/101 1/12 Figur: Tree for spam classication. Split variables are shown in blue. CAPMAX and CAPAVE = maximal and average length of the longest uninterrupted sequences of capital letters, respectively. 7 / 30
Questions How do we construct such trees? How large should a tree be? What are the pros and cons of tree-based methods? 8 / 30
Today Overview Regression trees Pruning Bagging, random forests 9 / 30
Regression trees vs. linear regression We generally want to approximate a relation Y = f (x) + ε. Linear regression involves linear functions f (x) = β 0 + We will instead consider p β j x j. j=1 f (x) = J m j 1 Rj (x), j=1 for a partition {R j } j [1..J] of X. 10 / 30
Building a regression tree: cost function We will construct the tree by minimizing RSS(m 1,..., m J, R 1,..., R J ) = = J j=1 N (y i f (x i)) 2 i=1 i:xi R j (y i f (x i)) 2 = J j=1 For each j, the m j minimizing the sum of squares i:xi R j (y i m j ) 2 is always i:xi R j (y i m j ) 2. m j = ȳ Rj = average over all y i such that x i R j. To nd the minimizing R j is however computationally infeasible in general. 11 / 30
Building a regression tree: top-down minimization Thus, in order to nd regions {R j } J j=1 minimizing J j=1 i:xi R j (y i ȳ Rj ) 2 }{{} Q Rj we proceed with a top-down, greedy approach. In the rst step, when splitting X into R 1 (k, s) = {x X : x k < s} and R 2 (k, s) = {x X : x k s}, we minimize w.r.t. k and s. Q R1 (k,s) + Q R2 (k,s) 12 / 30
Building a regression tree: top-down minimization (cont.) For a given k, the optimal s is quite easily found by scanning though the data. Repeating for all k [1..d] yields an optimal pair (k, s). Having found the best split, we partition the data into the two resulting regions and repeat the splitting process on each of these. And so forth. We stop growing the tree when some stopping criterion is reached, e.g., when no region contains more than ve training data. 13 / 30
Bias versus variance The tree size is a tuning parameter governing the model's complexity. A too large tree leads to overtting (high variance on test sets), while a too small tree leads to high bias. The optimal tree size should be determined adaptively from the data. One possibility is to split a node only if the split implies an RSS reduction exceeding a certain threshold; this is however treacherous in top-down minimization, as an apparently meaningless split may imply a good split later on. 14 / 30
Today Overview Regression trees Pruning Bagging, random forests 15 / 30
Cost complexity pruning Grow a large tree T 0 while storing the dierent Q Rj associated with each split. We dene a subtree T T 0 obtained by pruning T 0 in a bottom-up fashion. Let T denote the number of leaves R m (T ) in T. Dene, for some α 0, the cost function C α (T ) = T m=1 Q Rm(T ) + α T, where the Q Rm(T ) are available from the rst step. The idea is to nd a subtree T α minimizing C α (T ). 16 / 30
Cost complexity pruning (cont.) Large values of α leads to smaller trees. For α = 0, the solution is T 0. To nd T α, we create a sequence of subtrees by removing successively leaves whose elimination leads to the smallest increase of T m=1 Q R m(t ). To nd the optimal α, apply K-fold cross validation. 17 / 30
Example: baseball player salaries (cont.) Years < 4.5 RBI < 60.5 Hits < 117.5 Putouts < 82 Years < 3.5 Years < 3.5 5.487 4.622 5.183 5.394 6.189 Walks < 43.5 Runs < 47.5 6.407 6.015 5.571 6.549 Walks < 52.5 RBI < 80.5 Years < 6.5 7.289 6.459 7.007 Figur: Unpruned tree for the log salary (in 1, 000$) of a baseball player. 18 / 30
Trees vs. linear models X 2 X 2 2 1 0 1 2 2 1 0 1 2 X 1 X 2 2 1 0 1 2 2 1 0 1 2 X 2 2 1 0 1 2 2 1 0 1 2 X 1 2 1 0 1 2 2 1 0 1 2 X 1 X 1 Figur: True decision boundary is indicated by the shaded regions. Linear boundary vs. decision tree (boxes). 19 / 30
Node impurity The task of growing a classication tree is very similar to that of growing a regression tree. The only dierence is that the RSS is not meaningful any longer. We thus need to nd alternative measures Q m (T ) of node impurity. For this purpose, dene ˆp j,k = 1 R j i:xi R j 1 {yi =k} and k(j) = arg max k ˆp j,k and let, e.g., 1 ˆp j,k(j) misclassication error, Q m (T ) = K k=1 ˆp j,k(1 ˆp j,k ) Gini index, K ˆp k=1 j,k log ˆp j,k cross-entropy. 20 / 30
Node impurity measures 0.0 0.1 0.2 0.3 0.4 0.5 Gini index Misclassification error Entropy 0.0 0.2 0.4 0.6 0.8 1.0 p Figur: Node impurity measures for two-class classication, as a function of the proportion p in class 2. (Cross-entropy has been scaled.) 21 / 30
Pros and cons with tree-based methods Pros: Cons: Easy to explain, able to handle categorical or numerical data, scales well with N Do not, in their most basic form, have the same level of predictive accuracy as some other regression and classication approaches, can be very non-robust (i.e., small changes in the data leads to completely dierent trees). highly sensitive to data-representation: ill-suited to many problems The predictive accuracy can however be drastically improved using ideas as bagging and random forests discussed in the following. 22 / 30
Today Overview Regression trees Pruning Bagging, random forests 23 / 30
Prelude: the bootstrap Given an i.i.d. sample {x i} N i=1 from a probability density p, the bootstrap algorithm is based on the approximation ˆp(x) = 1 N N δ (x) x i i=1 of p. Given ˆp, a new sample set {x i }N i=1 with approximately the same distribution as {x i} N i=1 can be formed by sampling from ˆp, i.e., by drawing repeatedly N times among {x i} N i=1 with replacement. In machine learning applications, this can be used for creating articial replicates of the training set. 24 / 30
Bagging In the case of regression, we have approximated the target function using J f (x) = ȳ Rj 1 Rj (x), j=1 where the partition {R j } N j=1 is formed using a decision tree. The decision trees discussed above suer generally from high variance, i.e., building the tree using a dierent training set may lead to a quite dierent f. Better would be to use B independent training sets of similar size, each yielding a tree f b, and use the model f avg (x) = 1 B B f b (x). b=1 25 / 30
Bagging (cont.) We use bootstrap sampling for creating articially B such training sets, each yielding a tree fb, and use the model f bag (x) = 1 B B b=1 f b (x). The trees fb are grown deep without pruning, implying that each tree has high variance but low bias. In the classication case, we can record the class predicted by each of the B trees, and take the most popular vote. 26 / 30
Bagging (cont.) The number of trees B is not a critical parameter with bagging; using a very large value of B will not lead to overtting. The reason is that the probability of making an error converges as B. The probability of making an error depends also on the expected correlation between errors of dierent trees f b (x) and f b (x) when (x, Y ) varies. 27 / 30
Some theory Indeed, in the classication case, let the margin function be m(x, y) = 1 B B b=1 1 {fb (x)=y} max k y 1 B B b=1 1 {fb (x)=k} Moreover, dene the generalization error by PE = P(m(x, Y ) < 0). breiman:2001 (breiman:2001) shows that PE converges as B. In addition, the PE can be bounded as PE (1 s 2 )ρ/s 2, where, roughly, s is the limiting margin functionthe strengthand ρ describes the expected correlation between the trees' errors as (x, Y ) varies. 28 / 30
Random park random forest In order to decorrelate the trees, it is desirable so build them dierent. Say that there is one strong predictor x i along with a number of other moderately strong predictors most or all of the trees will use this strong predictor in the top split all trees will look quite similar highly correlated predictions. Thus, at each split, consider only splitting of a randomly chosen subset of m d predictors. Therefore, on average (d m)/d of the splits will not even consider the strong predictor, and so other predictors will have more of a chance. Typically, one uses m d; m = d corresponds to standard bagging. 29 / 30
Example: cancer-type prediction Test Classification Error 0.2 0.3 0.4 0.5 m=p m=p/2 m= p 0 100 200 300 400 500 Number of Trees Figur: Here p = 500. The test error as a function of the number of trees. Each colored line corresponds to a dierent value of m. A single classication tree has an error rate of 45.7%. 30 / 30