Statistical Consulting Topics Classification and Regression Trees (CART)

Size: px

Start display at page:

Download "Statistical Consulting Topics Classification and Regression Trees (CART)"

Catherine Carr
5 years ago
Views:

1 Statistical Consulting Topics Classification and Regression Trees (CART) Suppose the main goal in a data analysis is the prediction of a categorical variable outcome. Such as in the examples below. Given a set of known characteristics on a person x i, will this person vote for Hillary or for Bernie?... enroll at the University of Iowa or not?... graduate from high school or not? A good prediction tool (or classifier) will have high accuracy in its predictions. But we often compare classifiers by considering their lack of accuracy as a misclassification rate (i.e. how often do the classifiers make a wrong prediction, less often is better). 1

2 2

3 A classification tree (response is categorical) or regression tree (response is continuous) is a prediction model that can be represented with a decision tree. Example data for classification tree: y i is a 0-1 variable (classification) x i is a set of candidate predictors (continuous or categorical) The classical statistical analysis would look like a logistic regression model. Parameter estimates are of high importance (i.e. interpretation of the log of the odds ratio, hypothesis tests). If interested in testing hypotheses (with p-values), then model assumptions are a concern (such as sigmoidal shape). Can be used for prediction, but it s probably not the main goal of the modeling. 3

4 A classification tree is formed by repeatedly splitting the data into parts. The splits are chosen such that the items within a subset become more homogenous.... the chosen split maximizes the reduction in impurity (reduction in misclassification rate). Example: Two continuous predictors. For obs For obs with x1<a, with x1 a Split on x1 split on x2 split on x2 at a at b at c c b a a a 4

5 With two continuous predictors, CART partitions the 2D predictor space into rectangles and each rectangle is associated with a specific probability of being a 1 or, in other words, a specific ˆp. The response surface looks like rectangular plateaus. > library(rpart) > library(plotmo) prob 1 rpart(formula=group~x1+x2,data= > tree=rpart(group~x1+x2) > plotmo(tree, -2 0 type="prob", 2 4 6nresponse="1") Group (0/1) prediction based on x1,x Group (0/1) pr Group (0/1) prediction based on x1,x2 x2 x1 5

6 Example: Many predictor variables. Can we predict who is a smoker? Predictor Variables Sex - M,F Age - continuous, range Marital status - Divorced, Married, Separated, Single, Widowed Education level - 9 categorical levels Nationality - 8 categorical levels Ethnicity - 7 categorical levels Region - 7 categorical levels Response Variable Smoker - yes, no 6

7 N=1693 subjects. The overall proportion of smokers is So, if you had to predict without any personal information, there s a probability that they are a smoker. The best first split of the data comes from the Age variable at Age=51.5 years. The older group (n=764) has 15.1% smokers. The younger group (n=929) has 33.2% smokers. What does the suggested tree look like? 7

8 Age < 51.5 Highest.Qualification: A Levels,Degree,Higher/Sub Degree No 115/764 No 69/343 No Marital.Status: Married No 73/ /341 At each node, the observation goes to the left branch if and only if the stated condition is satisfied. The majority rule of the subset is shown at the terminal node (in this case, the probabilities for all subsets were less than 0.5 for being a smoker). 8

9 Proportion of smokers shown at the leaves: Are you under 51.5? No,Age >51.5 Yes,A-Levels,Degree, Higher/Sub Degree No,GCSE/CSE,GCSE/OLevel,No Qualification,ONC/BTEC,Other/Sub Degree 15.05% Are you married? No Yes No,Divorced,Separated,Sin gle,widowed 20.12% 29.80% 48.68% 48.68% 9

10 R code to get the previously shown tree using the tree package by Brian Ripley. > library(tree) > tree.output=tree(smoke ~ Sex + Age + Marital.Status + Highest.Qualif + Nationality + Ethnicity + Region, split="deviance") > plot(tree.output) > text(tree.output,pretty=0) There are many things you can get from the output, though I ve found the format somewhat hard to work with. > summary(tree.output) Classification tree: tree(formula = Smoke. ~ Sex + Age + Marital.Status + Highest.Qualification + Nationality + Ethnicity + Region, split = "deviance") Variables actually used in tree construction: [1] "Age" "Highest.Qualification" "Marital.Status" Number of terminal nodes: 4 Residual mean deviance: = 1763 / 1689 Misclassification error rate: = 423 /

11 > tree.output$frame[,c(1,2,5)] var n splits.cutleft splits.cutright 1 Age 1693 <51.5 > Highst.Qul 929 :bcf :adeghi 4 <leaf> Marit.Stat 586 :b :acde 10 <leaf> <leaf> <leaf> 764 > tree.output$frame[,c(1,2,6)] var n yprob.no yprob.yes 1 Age Highest.Qualification <leaf> Marital.Status <leaf> <leaf> <leaf> ## How many were misclassified? > misclass.tree(tree.output) [1] 423 ## Looking at the tree.output$frame, I can see ## that the end nodes are in rows 3,5,6,7. 11

12 ## How many were misclassified at each leaf? > misclass.tree(tree.output,detail=true)[c(3,5,6,7)] ## Verify overall number of misclassifications: > sum(misclass.tree(tree.output,detail=true)[c(3,5,6,7)]) [1] 423 ## Hand-calculate the misclassification rate ## at each end node (count wrong/total count at node): > misclass.tree(tree.output,detail=true)[c(3,5,6,7)]/ tree.output$frame[c(3,5,6,7),2] Notice how these misclassification rates match our ˆp values at each node. At the older age node, we will predict all individuals to be nonsmokers (as ˆp = < 0.5), in other words we predict Ŷ = 0 for everyone over 51.5 years old. Thus, we will get 15.05% of those incorrect. 12

13 As for misclassification, you may find that classification rates are very good at some nodes, and not so good at other nodes. This can potentially be very useful to the researcher. Perhaps the researcher wants to find certain subgroups who have very little chance of having a 1 (or of enrolling at the university, for example). As a note, classification trees inherently allow for interaction or complex relationships. For instance, you can split on the same covariate farther down the tree (at a different threshold). Splitting of the data continues until the terminal nodes are too small or too few to be split, or it is found that no gain can be made (i.e. less impurity) with more splitting. 13

14 In this example, we chose split="deviance" as our criterion for splitting. Our final tree had a deviance of and a residual mean deviance of > deviance(tree.output) [1] The deviance is calculated based on the set of specific bernoulli models represented at the end nodes in the given tree. Deviance in 0-1 response case 2 classes at each end node. At a given end node i, there are n i observations. For node i, let n i0 = # of no s (coded 0), n i1 = # of yes s (coded 1), and n i = n i0 + n i1. Y i x i Bernoulli(p i1 ) where x i represents the predictors used in the tree, and p i1 is the probability of a yes. 14

15 All individuals at the same end node have the same p i1. Likelihood for n i observations (one node): L(p i1 ) = n i j=1 [ p y j i1 (1 p i1) (1 y j) ] = p n i1 i1 (1 p i1) n i0 Deviance for node i is 2 log likelihood: 2LL(p i1 ) = 2 log [ p n i1 i1 (1 p i1) n ] i0 = 2 [n i1 log(p i1 ) + n i0 log(1 p i1 )] {inputting the ˆp i1 estimate} [ ni1 = 2n i log(ˆp n i1 ) + n ] i0 log(1 ˆp i n i1 ) i = 2n i [ˆp i1 log(ˆp i1 ) + ˆp i0 log(ˆp i0 )] =D i Deviance, D, for whole tree: D = i D i = i 2n i [ˆp i1 log(ˆp i1 ) + ˆp i0 log(ˆp i0 )] 15

16 And for the smoking example with 4 nodes... 2(343)[ log(0.2012) log(0.7988)]+ 2(245)[ log(0.2980) log(0.7020)]+ 2(341)[ log(0.4868) log(0.5132)]+ 2(764)[ log(0.1505) log(0.8495)] = i D i = > deviance(tree.output) [1] And Residual mean deviance... D n #end nodes = = Smaller deviance (less impurity) is better, and this will occur when you have a tree whose conditional ˆp i1 values are closer to 0 or 1 compared to a tree that does not have this characteristic. But the issue of overfitting the sample still exists for classification trees. 16

17 A tree shouldn t be so specific (i.e. have so many splits) that it only predicts well for the sample. The goal is that it should perform well for the general population of interest. Cross-validation (training set/test set) can be used to decide on where to prune the tree at a particular split. > cv.output=cv.tree(tree.output) > plot(cv.output) Inf deviance size 17

18 > pruned.tree=prune.tree(tree.output, best=3) > plot(pruned.tree) > text(pruned.tree,pretty=0) Age < 51.5 Highest.Qualification: A Levels,Degree,Higher/Sub Degree No No No 18

19 You can find other packages for plotting. 1 Age p < > 51 2 Highest.Qualification p < GCSE/CSE, GCSE/O Level, No Qualification, A Levels, ONC/BTEC, Degree, Other/Sub Higher/Sub Degree Yes No Node 3 (n = 586) Yes No Node 4 (n = 343) Yes No Node 5 (n = 764)

20 Splits are chosen to minimize deviance D. R help says (kind of vaguely): The split which maximizes the reduction in impurity is chosen, the data set split and the process repeated. Splitting continues until the terminal nodes are too small or too few to be split. Hastie, et al. (2009) and other references mention using a cost-complexity function to choose the number of end nodes: C α (T ) = i n im i (T ) + α size(t ) C is the cost function, input is a tree. α is a tuning parameter. size(t ) is number of end nodes in tree. i n im i (T ) is the measure of impurity for tree T, where lower is better. α size(t ) is a penalty for too large of a tree (has the same flavor as BIC or AIC). 20

21 For a given α, we expect to see a plot of C α (T ) vs. size(t ) that initially decreases with size(t ) and then hits a minimum and then starts to increase. R help says α is determined algorithmically (see prune.tree), but I can t directly get the code as it is written in C language. There is also a package called rpart by Brian Ripley and others that will do CART analysis. I ve found other packages that will make prettier trees from rpart objects, such as the rpart.plot package. 21

22 References: James, G., Witten, S., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. Springer. Hastie, T., Tibshirani, R, Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. Breiman L., Friedman J. H., Olshen R. A., and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth. 22

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model