Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro

Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2

Why Trees? } interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks } model discrete outcomes nicely } can be very powerful, can be as complex as you need them } C4.5 and CART - from top 10 entries on Kaggle - decision trees are very effective and popular 4

Sure, But Why Trees? } Easy to understand knowledge representation } Can handle mixed variables } Recursive, divide and conquer learning method } Efficient inference 5

} Top-down recursive divide and conquer algorithm Start with all examples at root Select best attribute/feature Grow it by splitting attributes one by one. To determine which attribute to split, look at node impurity. Recurse and repeat } Other issues: How to construct features When to stop growing Pruning irrelevant parts of the tree 6

Fraud Age Degree StartYr Series7 + 22 Y 2005 N - 25 N 2003 Y - 31 Y 1995 Y - 27 Y 1999 Y Score each attribute split for these instances: Age, Degree, StartYr, Series7 + 24 N 2006 N - 29 N 2003 N Y choose split on Series7 N Fraud Age Degree StartYr Series7-25 N 2003 Y - 31 Y 1995 Y - 27 Y 1999 Y Y Fraud Age Degree StartYr Series7 + 22 Y 2005 N + 24 N 2006 N - 29 N 2003 N choose split on Age>28 Score N each attribute split for these instances: Age, Degree, StartYr Fraud Age Degree StartYr Series7-29 N 2003 N Fraud Age Degree StartYr Series7 + 22 Y 2005 N + 24 N 2006 N 7

t 1 t 3 X Overview (with 1 X two features and 1D target) 1 X 1 t 1 X 2 t 2 X 1 t 3 Y X 2 t 4 R 1 R 2 R 3 X 2 X 1 R 4 R 5 FIGURE 9.2. Partitions and CART. Top right panel shows a partition of a Features: X two-dimensional feature space by recursive 1, X binary 2 splitting, as usedincart, applied to some fake data. Top Target: left panel Y shows a general partition that cannot be obtained from recursive binary splitting. Bottom left panel shows the tree corresponding to the partition in the top right panel, and a perspective plot of the prediction surface appears in the bottom right panel. 8

} Most well-known systems CART: Breiman, Friedman, Olshen and Stone ID3, C4.5: Quinlan } How do they differ? Split scoring function Stopping criterion Pruning mechanism Predictions in leaf nodes 9

} Idea: a good feature splits the examples into subsets that distinguish among the class labels as much as possible... ideally into pure sets of "all positive" or "all negative" Patrons? Type? None Some Full French Italian Thai Burger } Bias-variance tradeoff: Choosing most discriminating attribute first may not be best tree (bias), but it can make tree small (low variance). 11

Association between attribute and class label Data Income High Med Low no no yes yes yes no yes yes yes no yes no yes yes Contingency table Class label value Buy No buy High 2 2 Med 4 2 Low 3 1 12

Mathematically Defining Good Split } We start with information theory How uncertain of the answer will be if we split the tree this way? } Say need to decide between k options Uncertainty in the answer Y {1,,k} when probability is p 1,..., p k can be quantified via entropy: H(hp 1,...,p k i)= X i p i log 2 p i } Convenient notation: B(p) = H( p, 1 p ), number of bits necessary to encode 13

Amount of Information in the Tree } Suppose we have p positive and n negative examples at the root B(p/(p + n)) bits needed to classify a new example } Information is always conserved If encoding the information in the leaves is lossless then tree has lossless encoding The entropy of the leaves (amount of bits) + the tree information (bits) carried in the tree = total information in the data } Let split Y i have p i positive and n i negative examples B(p i /(p i + n i ) bits needed to classify a new example expected number of bits per example over all branches is X i p i + n i p + n B(p i/(p i + n i )) choose the next attribute to split that minimizes the remaining information needed Which maximizes the information in the tree (as information is conserved) 14

Information gain } Information Gain (Gain) is the amount of information that the tree structure encodes H[x] is the entropy: expected number of bits to encode a randomly selected x Gain(S, A) =H[S] X A A A S H[A] H[buys_computer] = -9/14 log 9/14-5/14 log 5/14 = 0.9400 15

Gain(S, A) =H[S] X A A A S H[A] High no no yes yes Income yes no yes yes yes no Med Low yes no yes yes Entropy(Income=high) = -2/4 log 2/4-2/4 log 2/4 = 1 Entropy(Income=med) = -4/6 log 4/6-2/6 log 2/6 = 0.9183 Entropy(Income=low) = -3/4 log 3/4-1/4 log 1/4 = 0.8113 Gain(D,Income) = 0.9400 - (4/14 [1] + 6/14 [0.9183] + 4/14 [0.8113]) = 0.029 16

Gini gain } Similar to information gain } Uses gini index instead of entropy } Measures decrease in gini index after split: Gain(S, A) = Gigi(S) X A A A S Gini(A) 17

Entropy Gini 18

Entropy Gini 19

Entropy Entropy Gini Gini x 2 Gini score can produce larger gain 20

Chi-Square score } Widely used to test independence between two categorical attributes (e.g., feature and class label) } Considers counts in a contingency table and calculates the normalized squared deviation of observed (predicted) values from expected (actual) values 2 kx i=1 (o i e i ) 2 } Sampling distribution is known to be chi-square distributed, given that cell counts are above minimum thresholds e i 21

Buy No buy High 2 2 Med 4 2 Low 3 1 4 6 4 9 5 14 22

Class 2 kx i=1 (o i e i ) 2 e i Attribute + 0 a b 1 c d N 23

Observed Expected Buy No buy Buy No buy High 2 2 Med 4 2 Low 3 1 High 2.57 1.43 Med 3.86 2.14 Low 2.57 1.43 24

Tree learning } Top-down recursive divide and conquer algorithm Start with all examples at root Select best attribute/feature Partition examples by selected attribute Recurse and repeat } Other issues: How to construct features When to stop growing Pruning irrelevant parts of the tree 25

Overfitting } Consider a distribution D of data representing a population and a sample DS drawn from D, which is used as training data } Given a model space M, a score function S, and a learning algorithm that returns a model m M, the algorithm overfits the training data DS if: m' M such that S(m,DS) > S(m',DS) but S(m,D) < S(m',D) In other words, there is another model (m ) that is better on the entire distribution and if we had learned from the full data we would have selected it instead 26

Knowledge representation: If-then rules Example rule: If x > 25 then + Else - + What is the model space? All possible thresholds - What score function? Prediction error rate X 27

Search procedure? Small sample Biased sample Best result on full data Try all thresholds, select one with lowest score Full data Large sample Overfitting threshold 28

Approaches to avoid overfitting } Regularization (Priors) e.g.,dirichlet prior in NBC } Hold out evaluation set, used to adjust structure of learned model e.g., pruning in decision trees } Statistical tests during learning to only include structure with significant associations e.g., pre-pruning in decision trees } Penalty term in classifier scoring function i.e., change scorre function to prefer simpler models 29

How to avoid overfitting in decision trees } Postpruning Use a separate set of examples to evaluate the utility of pruning nodes from the tree (after tree is fully grown) } Prepruning Apply a statistical test to decide whether to expand a node Use an explicit measure of complexity to penalize large trees (e.g., Minimum Description Length) 30

Algorithm comparison } CART Evaluation criterion: Gini index Search algorithm: Simple to complex, hill-climbing search Stopping criterion: When leaves are pure Pruning mechanism: Cross-validation to select gini threshold } C4.5 Evaluation criterion: Information gain Search algorithm: Simple to complex, hill-climbing search Stopping criterion: When leaves are pure Pruning mechanism: Reduce error pruning 31

Example: reduced error pruning } } Use pruning set to estimate accuracy in sub-trees and for individual nodes Let T be a sub-tree rooted at node v v } Define: T } } Repeat: Prune at node with largest gain until until only negative gain nodes remain Bottom-up restriction : T can only be pruned if it does not contain a sub-tree with lower error than T Source: www.ailab.si/blaz/predavanja/uisp/slides/uisp05-postpruning.ppt 32

Pre-pruning methods } Stop growing tree at some point during top-down construction when there is no longer sufficient data to make reliable decisions } Approach: Choose threshold on feature score Stop splitting if the best feature score is below threshold 33

Determine chi-square threshold analytically } Stop growing when chi-square feature score is not statistically significant } Chi-square has known sampling distribution, can look up significance threshold Degrees of freedom= kx 2 (o i e i ) 2 (#rows-1)(#cols-1) e i=1 i 2X2 table: 3.84 is 95% critical value 34

K-fold cross validation } Randomly partition training data into k folds } For i=1 to k Learn model on D - i th fold; evaluate model on i th fold } Average results from all k trials Train1 Train2 Train3 Train4 Train5 Train6 Dataset Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Test1 Test2 Test3 Test4 Test5 Test6 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 35

Choosing a Gini threshold with cross validation } For i in 1.. k For t in threshold set (e.g, [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]) Learn decision tree on Traini with Gini gain threshold t (i.e. stop growing when max Gini gain is less than t) Evaluate learned tree on Testi (e.g., with accuracy) Set tmax,i to be the t with best performance on Testi } Set tmax to the average of tmax,i over the k trials } Relearn the tree on all the data using tmax as Gini gain threshold 36

Modification of score function to better represent model value Small sample Full data Biased sample Large sample threshold 37