Data Mining Classification Trees (2)

Size: px

Start display at page:

Download "Data Mining Classification Trees (2)"

Tyler Pope
5 years ago
Views:

1 Data Mining Classification Trees (2) Ad Feelders Universiteit Utrecht September 14, 2017 Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

2 Basic Tree Construction Algorithm Construct tree nodelist {{training sample}} Repeat current node select node from nodelist nodelist nodelist current node if impurity(current node) > 0 then S candidate splits in current node s* arg max s S impurity reduction(s,current node) child nodes apply(s*,current node) nodelist nodelist child nodes fi Until nodelist = Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

3 Overfitting and Pruning We continue splitting until all leaf nodes of T contain examples of a single class (i.e. resubstitution error R(T ) = 0). Is this a good tree for predicting the class of new examples? Not unless the problem is truly deterministic! Problem of overfitting. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

4 Proposed Solutions Stopping Rules: e.g. don t expand a node if the impurity reduction of the best split is below some threshold. Pruning: grow a very large tree T max and merge back nodes. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

5 Stopping Rules Disadvantage: sometimes you first have to make a weak split to be able to follow up with a good split. Since we only look one step ahead we may miss the good follow-up split. x 2 x 1 Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

6 Pruning To avoid the problem of stopping rules, we first grow a very large tree on the training sample, and then prune this large tree. Objective: select the pruned subtree that has lowest true error rate. Problem: how to find this pruned subtree? Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

7 Pruning Methods Cost-complexity pruning (Breiman et al.; CART), also called weakest link pruning. Reduced-error pruning (Quinlan) Pessimistic pruning (Quinlan; C4.5)... Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

8 Terminology: Tree T t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 T denotes the collection of leaf nodes of tree T. T = {t 5, t 6, t 7, t 8, t 9 }, T = 5 Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

9 Terminology: Pruning T in node t 2 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

10 Terminology: T after pruning in t 2 : T T t2 t 1 t 2 t 3 t 6 t 7 Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

11 Terminology: Branch T t2 t 2 t 4 t 5 t 8 t 9 T t2 = {t 5, t 8, t 9 }, T t2 = 3 Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

12 Cost-complexity pruning The total number of pruned subtrees of a balanced binary tree with l leaves is l With just 40 leaf nodes we have approximately 12 million pruned subtrees. Exhaustive search not recommended. Basic idea of cost-complexity pruning: reduce the number of pruned subtrees we have to consider by selecting the ones that are the best of their kind (in a sense to be defined shortly...) Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

13 Total cost of a tree Strike a balance between fit and complexity. Total cost C α (T ) of tree T C α (T ) = R(T ) + α T Total cost consists of two components: resubstitution error R(T ), and a penalty for the complexity of the tree α T, (α 0). Note: R(T ) = number of wrong classifications made by T number of examples in the training sample Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

14 Tree with lowest total cost Depending on the value α, different pruned subtrees will have lowest total cost. For α = 0 (no complexity penalty) the tree with smallest resubstitution error wins. For higher values of α, a less complex tree that makes a few more errors might win. As it turns out, we can find a nested sequence of pruned subtrees of T max, such that the trees in the sequence minimize total cost for consecutive intervals of α values. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

15 Smallest minimizing subtree For any value of α, there exists a smallest minimizing subtree T (α) of T max that satisfies the following conditions: 1 C α (T (α)) = min T Tmax C α (T ) (that is, T (α) minimizes total cost for that value of α). 2 If C α (T ) = C α (T (α)) then T (α) T. (that is, T (α) is a pruned subtree of all trees that minimize total cost). Note: T T means T is a pruned subtree of T, i.e. it can be obtained by pruning T in 0 or more nodes. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

16 Sequence of subtrees Construct a decreasing sequence of pruned subtrees of T max T max > T 1 > T 2 > T 3 >... > {t 1 } (where t 1 is the root node of the tree) such that T k is the smallest minimizing subtree for α [α k, α k+1 ). Note: From a computational viewpoint, the important property is that T k+1 is a pruned subtree of T k, i.e. it can be obtained by pruning T k. No backtracking is required. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

17 Decomposition of total cost Total cost has additive decomposition over the leaf nodes of a tree: C α (T ) = t T R(t) + α R(t) is the number of errors we make in node t if we predict the majority class, divided by the total number of observations in the training sample. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

18 Finding the T k and corresponding α k T t : branch of T with root node t. After pruning in t, its contribution to total cost is: C α ({t}) = R(t) + α, The contribution of T t to the total cost is: C α (T t ) = R(T t ) + α T t, where R(T t ) = t T t R(t ). T T t becomes better than T when C α ({t}) = C α (T t ) Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

19 Computing contributions to total cost of T t t t 3 t t t t t t 9 C α ({t 2 }) = R(t 2 ) + α = α C α (T t2 ) = R(T t2 ) + α T t2 = t T t2 R(t ) + α T t2 = 0 + 3α Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

20 Solving for α The total cost of T and T T t become equal when C α ({t}) = C α (T t ), At what value of α does this happen? R(t) + α = R(T t ) + α T t Solving for α we get α = R(t) R(T t) T t 1 Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

21 Computing g(t): the critical α value for node t For each non-terminal node t we compute its critical alpha value: In words: g(t) = g(t) = R(t) R(T t) T t 1 increase in error due to pruning in t decrease in leaf nodes due to pruning in t Subsequently, we prune in the nodes for which g(t) is the smallest (the weakest links ). This process is repeated until we reach the root node. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

22 Computing g(t): the critical α value for node t t t t 3 t t t t t t 9 g(t 1 ) = 1 8, g(t 2) = 3 20, g(t 3) = 1 20, g(t 5) = Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

23 Finding the weakest links t t t 3 t t t t t t 9 g(t 1 ) = 1 8, g(t 2) = 3 20, g(t 3) = 1 20, g(t 5) = Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

24 Pruning in the weakest links t t t 3 t t By pruning the weakest links we obtain the next tree in the sequence. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

25 Repeating the same procedure t t t 3 t t g(t 1 ) = 2 10, g(t 2) = 1 4. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

26 Going back to the root t We have arrived at the root so we re done. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

27 The best tree for α [0, 1 20 ) t t t 3 t t t t t t 9 The big tree is the best for values of α below Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

28 The best tree for α [ 1 20, 2 10 ) t t t 3 t t When α reaches 1 20 this tree becomes the best. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

29 The best tree for α [ 2 10, ) t When α reaches 2 10 the root wins and we re done. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

30 Compute Tree Sequence T 1 T (α = 0); α 1 0; k 1 While T k > {t 1 } do For all non-terminal nodes t T k g k (t) R(t) R(T k,t) T k,t 1 α k+1 min t g k (t) Visit the nodes in top-down order and prune whenever g k (t) = α k+1 to obtain T k+1 k k + 1 od Note: T k,t is the branch of T k with root node t, and T k is the pruned tree in iteration k. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

31 Algorithm to compute T 1 from T max If we don t continue splitting until all nodes are pure, then T 1 = T (α = 0) may not be the same as T max. Compute T 1 from T max T T max Repeat Pick any pair of terminal nodes l and r with common parent t in T such that R(t) = R(l) + R(r), and set T T T t (i.e. prune T in t) Until no more such pair exists T 1 T Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

32 Selection of the final tree: using a test set Pick the tree T from the sequence with the lowest error rate R ts (T ) on the test set. This is an estimate of the true error rate R (T ) of T. The standard error of this estimate is SE(R ts R ) = ts (1 R ts ) n test Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

33 Selection of the final tree: the 1-SE rule 1 Estimated error rate selected Size of tree in the sequence 1-SE rule: select the smallest tree with R ts within one standard error of the minimum. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

34 Selection of the final tree: using cross-validation When the data set is relatively small, it is a bit of a waste to set aside part of the data for testing. A way to avoid this problem is to use cross-validation. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

35 v-fold cross-validation (general) Let C be a complexity parameter of a learning algorithm (like α in the classification tree algorithm). To select the best value of C from a range of values c 1,..., c m we proceed as follows. 1 Divide the data into v groups G 1,..., G v. 2 For each value c i of C 1 For j = 1,..., v 1 Train with C = c i on all data except group G j. 2 Predict on group G j. 2 Compute the CV prediction error for C = c i. 3 Select the value c of C with the smallest CV prediction error. 4 Train on the complete training sample with C = c Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

36 Using cross-validation: Step 1 Construct a tree on the full data set, and compute α 1, α 2,..., α K and T 1 > T 2 >... > T K. Estimate the error of a tree T k from this sequence as follows. Set β 1 = 0, β 2 = α 2 α 3, β 3 = α 3 α 4,..., β k is typical value for [α k, α k+1 ). β K 1 = α K 1 α K, β K =. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

37 Using cross-validation: Step 2 Divide the data set into v groups G 1, G 2,..., G v (of approximately equal size) and for each group G j 1 Build a tree on all data except G j, and determine the smallest minimizing subtrees T (j) (β 1 ), T (j) (β 2 ),..., T (j) (β K ) for this reduced data set. 2 Compute the error of T (j) (β k ) (k = 1,..., K) on G j. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

38 Using cross-validation: Step 3 1 For each β k, sum the errors of T (j) (β k ) over G j (j = 1,..., v). 2 Let β h be the one with the lowest overall error. Select T h as the best tree. 3 Use the error rate computed with cross-validation as an estimate of its error rate. Remark: Alternatively, we could again use the 1-SE rule in the final step to select the final tree from the sequence. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

39 Using cross-validation: Step 1 Tree sequence constructed on full data set: T 1 is the best tree for α [0, 1 20 ). T 2 is the best tree for α [ 1 20, 2 10 ). T 3 is the best tree for α [ 2 10, ). Set β 1 = 0, value corresponding to T β 2 = = 1 10, value corresponding to T 2 β 3 =, value corresponding to T 3 (root). Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

40 Using cross-validation: Step 2 Divide the data set in v = 4 groups G 1, G 2, G 3, G 4 of size 50 each. First CV-run 1 Build a tree on all data except G 1, and determine the smallest minimizing subtrees T (1) (0), T (1) ( 1 10 ) and T (1) ( ). 2 Compute the error of those trees on G 1. Repeat this procedure for G 2, G 3 and G 4. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

41 Using cross-validation: Step 3 CV-run β 1 = 0 β 2 = 1 10 β 3 = Total β 2 wins (40 errors), so T 2 gets selected. We estimate the error rate of T 2 at 20%. Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

42 Building Trees in R: Rpart Pima Indians Diabetes Database from the UCI ML Repository 1. Number of times pregnant 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 3. Diastolic blood pressure (mm Hg) 4. Triceps skin fold thickness (mm) 5. 2-Hour serum insulin (mu U/ml) 6. Body mass index (weight in kg/(height in m)^2) 7. Diabetes pedigree function 8. Age (years) 9. Class variable (0 or 1) Class Distribution: (class value 1 is interpreted as "tested positive for diabetes") Class Value Number of instances Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

43 Building Trees in R: Rpart > pima.dat[1:5,] npreg plasma bp triceps serum bmi pedigree age class > library(rpart) > pima.tree <- rpart(class ~.,data=pima.dat,cp=0,minbucket=1,minsplit=2,method="class") > printcp(pima.tree) Classification tree: rpart(formula = class ~., data = pima.dat, method = "class", cp = 0, minbucket = 1, minsplit = 2) Variables actually used in tree construction: [1] age bmi bp npreg pedigree plasma serum triceps Root node error: 268/768 = n= 768 Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

44 cptable: the pruning sequence CP nsplit rel error xerror xstd CP is α divided by the resubstitution error in the root node. Example: tree with 2 splits is best for CP [ , ). Tree with 2 splits has cross-validation error of = Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

45 Plot of pruning sequence size of tree X val Relative Error Inf cp Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

46 Selecting the best tree > pima.pruned <- prune(pima.tree,cp=0.02) Endpoint = class > post(pima.pruned) 0 500/268 plasma< plasma>= / /174 bmi< bmi>= / /150 Ad Feelders ( Universiteit Utrecht ) Data Mining September 14, / 46

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data