Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Lecture 06 - Regression & Decision Trees Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-06-RDT 13 Feb 2015 1 / 31

Some preliminary jargon Training data Validation data Resubstitution error the error rate of a tree on the cases from which it was constructed Generalisation error the error rate of a tree on unseen cases Purity how mixed-up the training data is at a node Tom Kelsey ID5059-06-RDT 13 Feb 2015 2 / 31

Tree Generation All tree methods follow this basic scheme: We have a set of mixed-up data so immediately we need some measure of how mixed-up the data is Find the covariate-value pair (x j, s j ) that produces the most separation in the data X, y split the data into two subsets: rows in which values of x j < s j, the other rows in which x j s j the data in each is less mixed-up each split forms the root of a new tree Recurse by repeating for each subtree Tom Kelsey ID5059-06-RDT 13 Feb 2015 3 / 31

Tree Generation The methods differ based on three choices: How "mixed-up" is measured we need to measure randomness or, equivalently, levels of node purity How we decide when to stop splitting the heuristics for this are common to all methods often as simple as "stop when 4 or fewer items are in subset" How we condense the instances that fall into each split i.e. what is the actual output at a terminal node (predictions) Tom Kelsey ID5059-06-RDT 13 Feb 2015 4 / 31

Regression Trees Classical statistical approach: Mixed-up is measured by standard deviation (or any measure of variability) In ANOVA terms, find nodes with minimal within variance and hence maximal between variance Condense using the average of the instances that fall into each split (i.e. predict with the mean) Tom Kelsey ID5059-06-RDT 13 Feb 2015 5 / 31

Decision Trees Information Theory approach: Mixed-up is measured by amount of (im)purity Condense via majority class (predict the most common) Purity can be measured many ways: Entropy, Gini index & the twoing rule Gini can produce pure but small nodes the twoing rule is a tradeoff between purity and equality of data on either side of the split Twoing isn t part of the R package rpart that implements CART, so we don t consider it in detail. We also need pruning criteria since the plan is always to (a) construct a big (overfitting) tree then (b) reduce tree complexity to get a good tradeoff between resubstitution error and generalisation error Before we look at tree construction we need to learn more about randomness and differences between categorical and numeric data Tom Kelsey ID5059-06-RDT 13 Feb 2015 6 / 31

Purity A table or subtable is pure if it contains only one class In regression tree terminology, the SD of the outputs is zero In classification tree terminology, the We split and resplit in order to increase node purity Complete tree purity is analogous to overfitting Tom Kelsey ID5059-06-RDT 13 Feb 2015 7 / 31

Standard Deviations Tom Kelsey ID5059-06-RDT 13 Feb 2015 8 / 31

Types of correct/incorrect classification When considering a two-class problem, the range of prediction outcomes is small: False positive or false negative Correct negative or correct positive For a multi-class problem this range of outcomes is considerably larger: Incorrectly predicting class as j when correct class is i (for i, j 1,..., J) Correctly predicting clas as j (j 1,..., J) So we have J types of correct classification, and J 2 J types of incorrect classification Tom Kelsey ID5059-06-RDT 13 Feb 2015 9 / 31

Purity of nodes Intuitively that we want to optimise for some measure of the purity of nodes Consider J-level categorical response variable. A node gives a vector of proportions p = [p 1,..., p J ] for the levels of our response J j=1 p j = 1 so the vector p is a probability distribution of the response classes within the node Tom Kelsey ID5059-06-RDT 13 Feb 2015 10 / 31

Node purity We can list some desirable/necessary properties of an impurity measure, which will be a function of the these proportions φ(p) φ(p) will be a maximum when p = [ 1 J,..., 1 ]. This is our J definition of the least pure node, i.e. there is an equal mixture of all classes φ(p) will be a minimum when p j = 1 (and therefore the the others are zero). This is our definition of our most pure node, only one class exists Tom Kelsey ID5059-06-RDT 13 Feb 2015 11 / 31

Node purity Our measure of the impurity of a node t will be given by i(t) = φ((p 1 t),..., (p J t)) A measure of the decrease of impurity resulting from splitting node t into a left node t L and a right node t R will be given by δi(t) = i(t) p L i(t L ) p R i(t R ) where p L and p R are the proportion of points in t that go to the left and right respectively Tom Kelsey ID5059-06-RDT 13 Feb 2015 12 / 31

Logarithms: refresher b x = y if and only if log b y = x To be precise one needs to distinguish special cases, for example y cannot be 0. The log of a negative number is complex; not needed here. Inverse of exponentiation: log b b x = x = b (log b x) Base change: log a x = log b x log b a (This is why you don t need a log 2 button on your calculator.) Useful identity: log b x 1 = log b x Tom Kelsey ID5059-06-RDT 13 Feb 2015 13 / 31

Entropy Definition We work in base 2, taking the bit as our unit. We can now precisely define the entropy of a set of output classes as: H(p 1,..., p n ) = p i log 2 1 p i = p i log 2 p i Tom Kelsey ID5059-06-RDT 13 Feb 2015 14 / 31

Example Class Bus Car Train Probability 0.4 0.3 0.3 H(0.4, 0.3, 0.3) = 0.4 log 2 0.4 0.3 log 2 0.3 0.3 log 2 0.3 1.571 We say our output class data has entropy about 1.57 bits per class. Tom Kelsey ID5059-06-RDT 13 Feb 2015 15 / 31

Example Class Bus Car Train Probability 0 1 0 H(0, 1, 0) = 1 log 2 1 = 0 We say our output class data has zero entropy, meaning zero randomness Tom Kelsey ID5059-06-RDT 13 Feb 2015 16 / 31

Example Class Bus Car Train Probability 1/3 1/3 1/3 H( 1 3, 1 3, 1 3 ) = 1 3 log 1 2 3 1 3 log 1 2 3 1 3 log 1 2 3 1.584963 1.584963 1.584963 3 3 3 0.528321 + 0.528321 + 0.528321 1.584963 We say our output class data has maximum entropy for a class of this size, meaning the most randomness Tom Kelsey ID5059-06-RDT 13 Feb 2015 17 / 31

Gini Index One minus the sum of the squared output probabilities: 1 Σp 2 j In our example, 1 (0.4 2 + 0.3 2 + 0.3 2 ) = 0.660. Minimum Gini index is zero Maximum Gini index is 1 n( 1 n )2 = 1 1, two thirds in our n example Tom Kelsey ID5059-06-RDT 13 Feb 2015 18 / 31

Entropy and Gini compared Node impurity measures versus class proportion for 2-class problem Tom Kelsey ID5059-06-RDT 13 Feb 2015 19 / 31

Misclassification error rate Defined as the number of incorrect classifications divided by the number of all classifications Hence, using terminology from Lecture 4, equal to 1 minus the accuracy of a classification predictor: MER = 1 a + d a + b + c + d In the context of analysing nodes, this is 1 minus the maximum proportion in p = [p 1,..., p J ]: MER = 1 max(p j ) From the chart, Entropy and Gini capture more of the notion of node impurity, and so are preferred measures for tree growth Misclassification is used extensively in tree pruning Tom Kelsey ID5059-06-RDT 13 Feb 2015 20 / 31

Binary output, numeric attributes, Gini index Worked example from the published literature Incomplete calculations Show how to obtain Gini-gain at a potential knot position See RegTree.xlsx Data from a study published in the European Journal of Cancer Current Gold Standard diagnostic predictor is age Can AMH improve prediction? If so, by how much? Tom Kelsey ID5059-06-RDT 13 Feb 2015 21 / 31

Study Cohort Pretreatment anti-müllerian hormone predicts for loss of ovarian function after chemotherapy for early breast cancer. RA Anderson, M Rosendahl, TW Kelsey and DA Cameron, European Journal of Cancer 49(16): 3404-3411, 2013 Tom Kelsey ID5059-06-RDT 13 Feb 2015 22 / 31

Our Data Pretreatment anti-müllerian hormone predicts for loss of ovarian function after chemotherapy for early breast cancer. RA Anderson, M Rosendahl, TW Kelsey and DA Cameron, European Journal of Cancer 49(16): 3404-3411, 2013 Tom Kelsey ID5059-06-RDT 13 Feb 2015 23 / 31

Binary output, numeric attributes, Gini index Recipe 1 Use a pivot table to get output (i.e. A & M) proportions the base Gini index is 1 (p(a) 2 + p(m) 2 ) 2 Choose a split position for covariate AMH 3 Work out the numbers of A & M above and below the split 4 Create a contingency matrix 5 Gini gain is: original Gini - p(above)*gini(above) - p(below)*gini(below) 6 repeat for all other candidate split positions 7 repeat for Age instead of AMH 8 Select split position at greatest Gini gain 9 Start all over again at the split nodes Tom Kelsey ID5059-06-RDT 13 Feb 2015 24 / 31

Contingency Matrix Tom Kelsey ID5059-06-RDT 13 Feb 2015 25 / 31

Grown Tree Tom Kelsey ID5059-06-RDT 13 Feb 2015 26 / 31

Partition of the Euclidean Plane Tom Kelsey ID5059-06-RDT 13 Feb 2015 27 / 31

Initial Validation Analysis Tom Kelsey ID5059-06-RDT 13 Feb 2015 28 / 31

Pruned Tree Tom Kelsey ID5059-06-RDT 13 Feb 2015 29 / 31

Pruned Partition of the Euclidean Plane Pretreatment anti-müllerian hormone predicts for loss of ovarian function after chemotherapy for early breast cancer. RA Anderson, M Rosendahl, TW Kelsey and DA Cameron, European Journal of Cancer 49(16): 3404-3411, 2013 Tom Kelsey ID5059-06-RDT 13 Feb 2015 30 / 31

Next Lecture Carl Donovan is standing in for me More on tree building Internal validation pruning the trees Tom Kelsey ID5059-06-RDT 13 Feb 2015 31 / 31