Decision Trees Tirgul 5
Using Decision Trees It could be difficult to decide which pet is right for you. We ll find a nice algorithm to help us decide what to choose without having to think about it. 2
Using Decision Trees Another example: The tree helps a student decide what to do in the evening. <Party? (Yes, No), Deadline (Urgent, Near, None), Lazy? (Yes, No)> Party/tudy/TV/Pub 3
Definition A Decision tree is a flowchart-like structure which provides a useful way to describe a hypothesis h from a domain X to a label set Y = 0, 1, k. Given x X, a prediction h(x) of a decision tree h corresponds to a path from the root of the tree to a leaf. Most of the time we do not distinguish between the representation (the tree) and the hypothesis. An internal node corresponds to a question. A branch corresponds to an answer. A leaf corresponds to a label. 4
hould I study? Equivalent to: Party == No Deadline == Urgent Party == No Deadline == Near (Lazy == 5
Constructing the Tree Features: Party, Deadline, Lazy. Based on these features, how do we construct the tree? The DT algorithm use the following principle: Build the tree in a greedy manner; tarting at the root, choose the most informative feature at each step. 6
Constructing the Tree Informative features? Choosing which feature to use next in the decision tree can be thought of as playing the game 20 Questions. At each stage, you choose a question that gives you the most information given what you know already. Thus, you would ask Is it an animal? before you ask Is it a cat?. 7
Constructing the Tree 20 Questions example. Akinator s questions My character 8
Constructing the Tree The idea: quantify how much information is provided. Mathematically: Information Theory 9
Pivot example: 10
Quick Aside: Information Theory
Entropy is a sample of training examples. p + is the proportion of positive examples in p - is the proportion of negative examples in Entropy measures the impurity of : Entropy() = - p + log 2 p + - p - log 2 p - The smaller the better (we define: 0log0 = 0) Entropy p = i p i log 2 p i 12
Entropy Generally, entropy refers to disorder or uncertainty. Entropy = 0 if outcome is certain. E.g.: Consider a coin toss: Probability of heads == probability of tails The entropy of the coin toss is as high as it could be. This is because there is no way to predict the outcome of the coin toss ahead of time: the best we can do is predict that the coin will come up heads, and our prediction will be correct with probability 1/2. 13
Entropy: Example The dataset Entropy = p "yes" log 2 p "yes" p "no" log 2 p "no" Entropy p = i p i log 2 p i = 9 14 log 2 9 14 5 14 log 2 5 14 = 0.409 + 0.530 = 0.939 } 14 examples: 9 positive 5 negative 14
Information Gain Important Idea: find how much the entropy of the whole training set would decrease if we choose each particular feature for the next classification step. Called: Information Gain. defined as the entropy of the whole set minus the entropy when a particular feature is chosen. 15
Information Gain The information gain is the expected reduction in entropy caused by partitioning the examples with respect to an attribute. Given is the set of examples (at the current node), A the attribute, and v the subset of for which attribute A has value v: IG(, a) = Entropy Entropy v Values A v That is, current entropy minus new entropy v 16
Information Gain: Example Attribute A: Outlook Values A = sunny, overcast, rain Outlook unny Overcast Rain [2 +, 3-] [4 +, 0 -] [3 +, 2 -] 17
18
Information Gain: Example Entropy = 0.939 Values A = Outlook = sunny, overcast, rain sunny overcast rain Entropy sunny Entropy overcast Entropy rain [9 +, 5 -] Outlook Entropy p = p i log 2 p i unny Overcast Rain IG(, a) = Entropy v Values A i v Entropy v [2 +, 3-] [4 +, 0 -] [3 +, 2 -] 19
Information Gain: Example Entropy = 0.939 Values A = Outlook = sunny, overcast, rain sunny overcast rain Entropy sunny Entropy overcast Entropy rain = 5 14 Entropy sunny = 4 14 Entropy overcast = 5 14 Entropy rain [9 +, 5 -] Outlook Entropy p = p i log 2 p i unny Overcast Rain IG(, a) = Entropy v Values A i v Entropy v [2 +, 3-] [4 +, 0 -] [3 +, 2 -] 20
Information Gain: Example Entropy = 0.939 Values A = Outlook = sunny, overcast, rain sunny overcast rain Entropy sunny = 5 2 log 14 5 2 3 log 5 5 2 Entropy overcast = 4 4 log 14 4 2 0 log 4 4 2 Entropy rain = 5 3 log 14 5 2 2 log 5 5 2 3 2 4 3 5 = 5 14 0.970 0 4 = 4 14 0 2 5 = 5 14 0.970 [9 +, 5 -] Outlook Entropy p = p i log 2 p i unny Overcast Rain IG(, a) = Entropy v Values A i v Entropy v [2 +, 3-] [4 +, 0 -] [3 +, 2 -] 21
Information Gain: Example unny [9 +, 5 -] Outlook Overcast Rain Entropy = 0.939 Values A = Outlook = sunny, overcast, rain sunny overcast rain Entropy sunny = 5 0.970 = 0.346 14 Entropy overcast = 4 14 0 = 0 Entropy rain = 5 0.970 = 0.346 14 [2 +, 3-] [4 +, 0 -] [3 +, 2 -] IG(, a = outlook) = 0.939 0.346 + 0 + 0.346 = 0. 247 IG(, a) = Entropy v Values A v Entropy v 22
Information Gain: Example Attribute A: Wind Values A = weak, strong [9 +, 5-] Wind weak strong [6 +, 2-] [3 +, 3 -] 23
Information Gain: Example Entropy = 0.939 Values A = Wind = weak, strong weak strong Entropy weak = 8 6 log 14 8 2 2 log 8 8 2 Entropy strong = 6 3 log 14 6 2 3 log 6 6 2 6 3 2 8 = 8 14 0.811 3 6 = 6 14 1 weak [9 +, 5-] Wind strong Entropy p = p i log 2 p i IG(, a) = Entropy v Values A i v Entropy v [6 +, 2-] [3 +, 3 -] 24
Information Gain: Example Entropy = 0.939 Values A = Wind = weak, strong weak Entropy weak = 8 14 0.811 = 0.463 strong Entropy strong = 6 14 1 = 0.428 IG(, a = wind) = 0.939 0.463 + 0.428 = 0. 048 IG(, a) = Entropy v Values A v Entropy v 25
Information Gain The smaller the value of : v Entropy v is, the larger IG becomes. Probability of getting to the node IG(, a) = Entropy v Values A before v after Entropy v The ID3 algorithm computes this IG for each attribute and chooses the one that produces the highest value. Greedy 26
ID3 Algorithm Artificial Intelligence: A Modern Approach 27
ID3 Algorithm Majority Value function: Returns the label of the majority of the training examples in the current subtree. Choose attribute function: Choose the attribute that maximizes the Information Gain. (Could use other measures other than IG.) 28
Back to the example IG(, a = outlook) = 0. 247 IG(, a = wind) = 0.048 IG(, a = temperature) = 0.028 IG(, a = humidity) = 0.151 Maximal value The tree (for now): 29
Decision Tree after first step: hot hot mild cool mild hot cool mild hot mild cool cool mild mild 30
Decision Tree after first step: hot hot mild cool mild hot cool mild hot yes mild cool cool mild mild 31
The second step: hot [2 +, 3 -] temperature mild cool [0 +, 2-] [1 +, 1 -] [1 +, 0-] IG is maximal high [2 +, 3-] humidity normal weak [2 +, 3-] wind strong [0 +, 3-] [2 +, 0 -] [1 +, 2-] [1 +, 1 -] Entropy = 0 Entropy = 0 32
Decision Tree after second step: 33
Next 34
Final DT: 35
Minimal Description Length The attribute we choose is the one with the highest information gain We minimize the amount of information that is left Thus, the algorithm is biased towards smaller trees Consistent with the well-known principle that short solutions are usually better than longer ones. 36
Minimal Description Length MDL: The shortest description of something, i.e., the most compressed one, is the best description. a.k.a: Occam s Razor: Among competing hypotheses, the one with the fewest assumptions should be selected. 37
New Data: <Rain, Mild, High, Weak> - False Outlook unny Overcast Rain Humidity True Wind Wait What!? High Normal Weak trong False True True False Prediction: True 38
Overfitting Learning was performed for too long or the learner may adjust to very specific random features of the training data, that have no causal relation to the target function. 39
Overfitting 40
Overfitting The performance on the training examples still increases while the performance on unseen data becomes worse. 41
42
Pruning 43
44
ID3 for real-valued features Until now, we assumed that the splitting rules are of the form: I [xi =1] For real-valued features, use threshold-based splitting rules: I [xi <θ] *I (boolean expression) is the indicator function (equals 1 if expression is true and 0 otherwise. 45
Random forest 46
Random forest Collection of decision trees. Prediction: a majority vote over the predictions of the individual trees. Constructing the trees: Take a random subsample (of size m ) from, with replacements. Construct a sequence I 1, I 2,, where I t is a subset of the attributes (of size k) The algorithm grows a DT (using ID3) based on the sample At each splitting stage, the algorithm chooses a feature that maximized IG from I t 47
Weka http://www.cs.waikato.ac.nz/ml/weka/ 48
ummary Intro: Decision Trees Constructing the tree: Information Theory: Entropy, IG ID3 Algorithm MDL Overfitting: Pruning ID3 for real-valued features Random Forests Weka 49