Information Theory & Decision Trees

Information Theory & Decision Trees Jihoon ang Sogang University Email: yangjh@sogang.ac.kr

Decision tree classifiers Decision tree representation for modeling dependencies among input variables using elements of information theory ow to learn decision tree from data Over-fitting and how to minimize it ow to deal with missing values in the data Learning decision trees from distributed data Learning decision trees at multiple levels of abstraction

Decision tree representation In the simplest case Each internal node tests on an attribute Each branch corresponds to an attribute value Each leaf node corresponds to a class label In general Each internal node corresponds to a test on input instances with mutually exclusive and exhaustive outcomes tests may be univariate or multivariate Each branch corresponds to an outcome of a test Each leaf node corresponds to a class label 3

Decision tree representation,0,0,00 Attributes Class x y c,0,0,00 x y 0 E x a m p l e s A 0 B 3 0 A 4 0 0 B 0 0 c=a 0 00 c=b 0 0 00 x 0 x 0 0 c=a c=b 0 00 c=a c=b Data set Tree Tree Should we choose Tree or Tree? Why? 4

Decision tree representation Any boolean function can be represented by a decision tree Any function f : A A A n where each Ai is the domain of the ith attribute and C is a discrete set of values class labels can be represented by a decision tree C In general, the inputs need not be discrete valued 5

Learning decision tree classifiers Decision trees are especially well suited for representing simple rules for classifying instances that are described by discrete attribute values Decision tree learning algorithms Implement Ockham s razor as a preference bias simpler decision trees are preferred over more complex trees Are relatively efficient linear in the size of the decision tree and the size of the data set Produce comprehensible results Are often among the first to be tried on a new data set 6

Learning decision tree classifiers Ockham s razor recommends that we pick the simplest decision tree that is consistent with the training set Simplest tree is one that takes the fewest bits to encode why? information theory There are far too many trees that are consistent with a training set Searching for the simplest tree that is consistent with the training set is not typically computationally feasible Solution Use a greedy algorithm not guaranteed to find the simplest tree but works well in practice Or restrict the space of hypothesis to a subset of simple trees 7

Information some intuitions Information reduces uncertainty Information is relative to what you already know Information content of a message is related to how surprising the message is Information depends on context 8

Information and Uncertainty Message Sender Receiver ou are stuck inside. ou send me out to report back to you on what the weather is like. I do not lie, so you trust me. ou and I are both generally familiar with the weather in Korea. On a July afternoon in Korea, I walk into the room and tell you it is hot outside On a January afternoon in Korea, I walk into the room and tell you it is hot outside 9

Information and Uncertainty Message Sender Receiver ow much information does a message contain? If my message to you describes a scenario that you expect with certainty, the information content of the message for you is zero The more surprising the message to the receiver, the greater the amount of information conveyed by the message What does it mean for a message to be surprising? 0

Information and Uncertainty Suppose I have a coin with heads on both sides and you know that I have a coin with heads on both sides; I toss the coin, and without showing you the outcome, tell you that it came up heads how much information did I give you? Suppose I have a fair coin and you know that I have a fair coin; I toss the coin, and without showing you the outcome, tell you that it came up heads how much information did I give you?

Information & Coding Theory Without loss of generality, assume that messages are binary made of 0s and s Conveying the outcome of a fair coin toss requires bit of information need to identify one out of two equally likely outcomes Conveying the outcome of an experiment with 8 equally likely outcomes requires 3 bits.. Conveying an outcome of that is certain takes 0 bits In general, if an outcome has a probability p, the information content of the corresponding message is I p log p, I0 0

Information is Subjective Suppose there are 3 agents Kim, Lee, Park, in a world where a dice has been tossed; Kim observes the outcome is a 6 and whispers to Lee that the outcome is even but Park knows nothing about the outcome Probability assigned by Lee to the event 6 is a subjective measure of Lee s belief about the state of the world Information gained by Kim by looking at the outcome of the dice = log 6 bits Information conveyed by Kim to Lee = log 6 - log 3 bits Information conveyed by Kim to Park = 0 bits 3

Information and Shannon Entropy Suppose we have a message that conveys the result of a random experiment with m possible discrete outcomes, with probabilities p p,...,, p m The expected information content of such a message is called the entropy of the probability distribution p, p,..., I p i I p i p log m 0 m i p I p i p otherwise i i provided p i 0 4

Entropy and Coding Theory Entropy is a measure of uncertainty associated with a random variable Noiseless coding theorem: the Shannon entropy is the minimum average message length, in bits, that must be sent to communicate the true value of a random variable to a recipient 5

Shannon s entropy as a measure of information Let P p n p p P p log p log p i... i n be a discrete probability distribution The entropy of the distribution P is given by i n i i i, p p 0, p log p I 0I0 0 bit i i i i log i i log log bit 6

The entropy for a binary variable 7

Properties of Shannon s entropy a. P P 0 If there are N possibleoutcomes, P log N b. If i p i N, P log N c. If i such that p i, P 0 d. P is a continuous function of P 8

Shannon s entropy as a measure of information P, P For any distribution is the optimal number of binary questions required on average to determine an outcome drawn from P We can extend these ideas to talk about how much information is conveyed by the observation of the outcome of one experiment about the possible outcomes of another mutual information We can also quantify to the difference between two probability distribution Kullback-Leibler divergence or relative entropy 9

Coding theory perspective Suppose you and I both know the distribution I choose an outcome according to P P Suppose I want to send you a message about the outcome ou and I could agree in advance on the questions I can simply send you the answers Optimal message length on average is P This generalizes to noisy communication 0

Entropy and Coding Theory x a b c d P e f g h Px / /4 /8 /6 /64 /64 /64 /64 code 0 0 0 0 00 0 0 x log bits 4 log 4 8 log 8 6 log 6 4 64 log 64 Average code length = 4 8 3 6 4 4 64 6 bits

Entropy of random variables and sets of random variables For a random variable n i P log P taking values a... a P a log P a i i n, If is a set of random variables, P log P

3 Joint entropy and conditional entropy log, log given of entropy Conditional, log,, joint entropy the, and random variable For a,, P P a P a P P a P P P

4 Joint entropy and conditional entropy, Chain rule for entropy When do we have equality?, results: useful Some The entropy never increases after conditioning observing value of reduces our uncertainty about, unless independent

Example of entropy calculations P = ; = = 0.; P = ; = T = 0.4 P = T; = = 0.3; P = T; = T = 0.,=-0.log 0.+.85 P = = 0.6, so = 0.97 P = = 0.5, so =.0 P = = = 0./0.6 = 0.333 P = T = = -0.333=0.667 P = = T= 0.3/0.4=0.75 P = T = T=0./0.4=0.5 0.88 5

6 Mutual information, log,, distributions, probability In terms of,,,, using chain rule Or by,, and information between mutual the average, and For random variables, b P a P b a P b a P I I I I Mutual information measures the extent to which observing one variable will reduce the uncertainty in another Mutual information is nonnegative, and equal to zero iff and are independent

Relative entropy Let The relative entropy Kullback - Leibler distance is a P and Q measure of be two distributions over random variable "distance" from D P Q P log P toq. P Q. Note D P Q D Q P D P Q 0 D P P 0 KL-divergence is not a true distance measure in that it is not symmetric 7

Learning decision tree classifiers On average, the information needed to convey the class membership of a random instance drawn from nature is P Nature Training Data S S S Instance S m Classifier P m i p i log p i Class label where P is an estimate of P and is a random variable with distribution P S i is the multi-set of examples belonging to class C i 8

Learning decision tree classifiers The task of the learner then is to extract the needed information from the training set and store it in the form of a decision tree for classification Information gain based decision tree learner Start with the entire training set at the root Recursively add nodes to the tree corresponding to tests that yield the greatest expected reduction in entropy or the largest expected information gain until some termination criterion is met e.g. the training data at every leaf node has zero entropy 9

Learning decision tree classifiers 30

Which attribute to choose? 3

Learning decision tree classifiers Example Instances ordered 3-tuples of attribute values corresponding to eight tall, short air dark, blonde, red Eye blue, brown Training Data Instance Class label I t, d, l + I s, d, l + I 3 t, b, l I 4 t, r, l I 5 s, b, l I 6 t, b, w + I 7 t, d, w + I 8 s, b, w + 33

Learning decision tree classifiers Example t eight I I 8 s 3 3 5 5 log log 0. 954bits 8 8 8 8 3 3 eight t log log 0. 97bits 5 5 5 5 eight s log log 0. 98bits 3 3 3 3 S t I I 3 I 4 I 6 I 7 I I 5 I 8 5 3 5 3 eight eight t eight s 0.97 0.98 0. 95bits 8 8 8 8 Similarly, air 3 8 Eye 0.607bits S s air d 4 8 air b 8 air r 0.5bits and air is the most informative because it yields the largest reduction in entropy; Test on the value of air is chosen to correspond to the root of the decision tree 34

Learning decision tree classifiers Example air d b + - Eye r Compare the result with Naïve Bayes l w - + In practice, we need some way to prune the tree to avoid overfitting the training data 35

Learning, generalization, overfitting Consider the error of a hypothesis h over Training data: Error Train h Error D h Entire distribution D of data: ypothesis hypothesis h h' over fits training data if there is an alternative such that and Error Error Train D h Error h Error D Train h' h' 36

Overfitting in decision tree learning e.g. diabetes dataset 37

Causes of overfitting As we move further away from the root, the data set used to choose the best test becomes smaller poor estimates of entropy Noisy examples can further exacerbate overfitting 38

Minimizing overfitting Use roughly the same size sample at every node to estimate entropy when there is a large data set from which we can sample Stop when further split fails to yield statistically significant information gain estimated from validation set Grow full tree, then prune Minimize size tree + size exceptions tree 39

Reduced error pruning Each decision node in the tree is considered as a candidate for pruning Pruning a decision node consists of Removing the sub tree rooted at that node, Making it a leaf node, and Assigning it the most common label at that node or storing the class distribution at that node for probabilistic classification 40

Reduced error pruning Do until further pruning is harmful:. Evaluate impact on validation set of pruning each candidate node. Greedily select a node which most improves the performance on the validation set when the sub tree rooted at that node is pruned Drawback holding back the validation set limits the amount of training data available Not desirable when data set is small 4

Reduced error pruning Example Node Accuracy gain by Pruning A -0% B +0% A 00 A=a A=a + B 40 60 A A=a A=a 55 5 - + Before Pruning + - After Pruning 4

Reduced error pruning Pruned Tree 43

pruning Evaluate candidate split to decide if the resulting information gain is significantly greater than zero as determined using a suitable hypothesis testing at a desired significance level Example: statistics -class, binary [Lleft, Rright] split case n of class, n of class ; N = n + n Assume a candidate split s sends pn to L and -pn to R A random split having this probability i.e. the null hypothesis would send pn of class to L and pn of class to L Deviation of the results due to s from the random split: i nil: number of patterns in class i sent to the left under decision s nie = p*ni : number expected by the random rule n il n n ie ie 44

pruning The critical value of depends on the degrees of freedom which is in this case for a given p, nl fully specifies nl, nr, and nr In general, the number of degrees of freedom can be > 45

Pruning based on whether information gain is significantly greater than zero 50 50 L R 50 0 0 50 n n n L e 50, n n n n n L n 50, N n e e e L n e e 00 nl 50, nl 0, nl 50, p 0. 5 N pn 5, n pn 5 n 5 5 50 This split is significantly better than random with confidence > 99% because > 6.64 46

Value Table 47

Pruning based on whether information gain is significantly greater than zero N p n ie [ p j n Branches Classes p j j p n n... p i i... n n Classes Branches ij ]; n n ie j ie Branches j j p j Degrees of freedom = Classes Branches The greater the value of, the less likely it is that the split is random; For a sufficiently high value of, the difference between the expected random and the observed split is statistically significant and we reject the null hypothesis that the split is random 48

Rule post-pruning Convert tree to equivalent set of rules IF TEN IF TEN... Outlook = Sunny umidity = igh PlayTennis = No Outlook = Sunny umidity = Normal PlayTennis = es 49

Rule post-pruning. Convert tree to equivalent set of rules. Prune each rule independently of others by dropping a condition at a time if doing so does not reduce estimated accuracy at the desired confidence level 3. Sort final rules in order of lowest to highest error for classifying new instances Advantage can potentially correct bad choices made close to the root Post pruning based on validation set is the most commonly used method in practice 50

Classification of instances Unique classification possible when each leaf has zero entropy and there are no missing attribute values Most likely or probabilistic classification based on distribution of classes at a node when there are no missing attributes 5

andling different types of attribute values Types of attributes Nominal values are names Ordinal values are ordered Cardinal numeric values are numbers hence ordered 5

andling numeric attributes Attribute T 40 48 50 54 60 70 Class N N N Candidate splits E S T T 48 50? T 60 70? 4 3 3 49? 0 log log 6 6 4 4 4 4 Sort instances by value of numeric attribute under consideration For each attribute, find the test which yields the lowest entropy Greedily choose the best test across all attributes 53

andling numeric attributes Axis-parallel split Oblique split C C C C Oblique splits cannot be realized by univariate tests 54

Two-way versus multi-way splits Entropy criterion favors many-valued attributes Pathological behavior what if in a medical diagnosis data set, social security number is one of the candidate attributes? Solutions Only two-way splits CART: A = value versus A = ~value Gain ratio C4.5 Gain S, A GainRatio S, A SplitInformation S, A SplitInformation S, A Values A Si log S Si i S 55

56 Alternative split criteria Consider split of set S based on attribute A expected rate of error if class label is picked randomly according to distribution of instances in a set j A Values j S Impurity A S Impurity log Entropy Z Z Z Z Z Impurity i Classes i i Classes i i j j i i Z Z Z Z Z Z Z Impurity Gini

Incorporating attribute costs Not all attribute measurements are equally costly or risky Example: In medical diagnosis Blood-test has cost $50 Exploratory-surgery may have a cost of $3000 Goal: Learn a decision tree classifier which minimizes cost of classification Tan and Schlimmer 990 Gain S, A Cost A Nunez 988 Gain S, A Cost A where w [0, ] determines importance of cost w 57

Incorporating different misclassification costs for different classes Not all misclassifications are equally costly: An occasional false alarm about a nuclear power plant meltdown is less costly than the failure to alert when there is a chance of a meltdown Weighted Gini Impurity Impurity S ij ij S i S S S j ij is the cost of wrongly assigning an instance belonging to class i to class j 58

Dealing with missing attribute values Solution Sometimes, the fact that an attribute value is missing might itself be informative Missing blood sugar level might imply that the physician had reason not to measure it Introduce a new value one per attribute to denote a missing value Decision tree construction and use of tree for classification proceed as before 59

Dealing with missing attribute values Solution During decision tree construction Replace a missing attribute value in a training example with the most frequent value found among the instances at the node During use of tree for classification Replace a missing attribute value in an instance to be classified with the most frequent value found among the training instances at the node 60

Dealing with missing attribute values Solution 3 During decision tree construction Replace a missing attribute value in a training example with the most frequent value found among the instances at the node that have the same class label as the training example During use of tree for classification Same as solution 6

Dealing with missing attribute values Solution 4 During decision tree construction Generate several fractionally weighted training examples based on the distribution of values for the corresponding attribute at the node During use of tree for classification Generate multiple instances by assigning candidate values for the missing attribute based on the distribution of instances at the node Sort each such instance through the tree to generate candidate labels and assign the most probable class label or probabilistically assign class label Used in C4.5 6

Dealing with missing attribute values A 0 n + =60, n - =40 n + A==50 + B n + A=0 = 0; n - A=0 = 40 0 n - A=0, B==40 - + n + A=0, B=0 = 0 Suppose B is missing Replacement with most frequent value at the node B = Replacement with most frequent value if class is + B = 0 63

Dealing with missing attribute values A 0 n + =60, n - =40 n + A==50 + B n + A=0 = 0; n - A=0 = 40 0 n - A=0, B==40 - + n + A=0, B=0 = 0 Suppose B is missing Fractional instance based on distribution at the node: 4/5 for B =, /5 for B = 0 Fractional instance based on distribution at the node for class +: /5 for B = 0, 0 for B = 64

See5/C5.0 [997] Boosting New data types e.g. dates, N/A values, variable misclassification costs, attribute pre-filtering Unordered rulesets: all applicable rules are found and voted Improved scalability: multi-threading, multi-core/cpus 65

Recent developments in decision trees Learning decision trees from distributed data Learning decision trees from attribute value taxonomies and partially specified data Learning decision trees from relational data Decision trees for multi-label classification tasks Learning decision trees from vary large data sets Learning decision trees from data streams Other Research Issues: Stable trees e.g. against leave-one-out CV Decomposing complex trees for comprehensibility Gradient Boosted Decision Trees GBDT 66

Summary of decision trees Simple Fast linear in size of the tree, linear in the size of the training set, linear in the number of attributes Produce easy to interpret rules Good for generating simple predictive rules from data with lots of attributes 67