EECS 349:Machine Learning Bryan Pardo Topic 2: Decision Trees (Includes content provided by: Russel & Norvig, D. Downie, P. Domingos) 1
General Learning Task There is a set of possible examples Each example is an n-tuple of attribute values 1 1 There is a target function that maps X onto some finite set Y The DATA is a set of duples <example, target function values> D Find a hypothesis h such that... X!! = { x,... x } 1 n! x =< a,..., ak f : X Y!!!! x, f ( x ) >,... < x, f ( x ) > } 1 m = { < 1 m!!! x, h( x) f ( x) >
Attribute-based representations Bryan Pardo, EECS 349 Fall 2009 3
Decision Tree Bryan Pardo, EECS 349 Fall 2009 4
Expressiveness of D-Trees Bryan Pardo, EECS 349 Fall 2009 5
Decision Trees represent disjunctions of conjunctions f ( x) = yes iff... (Outlook = Sunny Humidity = Normal) (Outlook = overcast) (Outlook = rain Wind = weak)
Decision Tree Boundaries Bryan Pardo, EECS 349 Fall 2009 7
A learned decision tree Bryan Pardo, EECS 349 Fall 2009 8
Choosing an attribute The more skewed the examples in a bin, the better. We re going to use ENTROPY to as a measure of how skewed each bin is. Bryan Pardo, EECS 349 Fall 2009 9
Counts as probabilities P 1 = probability I will wait for a table P 2 = probability I will NOT wait for a table P 1 = 0.5 P 2 = 0.5 P 1 = 0 P 2 = 1 P 1 = 1 P 2 = 0 P 1 = 0.333 P 2 = 0.667 Bryan Pardo, EECS 349 Fall 2014 10
Information Bryan Pardo, EECS 349 Fall 2009 11
About ID3 A recursive, greedy algorithm to build a decision tree At each step it picks the best variable to split the data on, and then moves on It is greedy because it makes the optimal choice at the current step, without considering anything beyond the current step. This can lead to trouble, if one needs to consider things beyond a single variable (e.g. multiple variables) when making a choice. (Try it on XOR) Bryan Pardo, EECS 349 Fall 2009 12
Decision Tree Learning (ID3) Bryan Pardo, EECS 349 Fall 2009 13
Choosing an attribute in ID3 For each attribute, find the entropy H of the example set AFTER splitting on that example *note, this means taking the entropy of each subset created by splitting on the attribute, and then combining these entropies weighted by the size of each subset. Pick the attribute that creates the lowest overall entropy. Bryan Pardo, EECS 349 Fall 2009 14
Entropy prior to splitting Instances where I waited Instances where I didn t P 1 = probability I will wait for a table P 2 = probability I will NOT wait for a table H 0 P,P 1 2 = P j j log 2 P j = P 1 log 2 P 1 P 2 log 2 P 2 = 1 Bryan Pardo, EECS 349 Fall 2009 15
If we split on Patrons H none H some H full H Patrons = W none H none + W some H some + W full H full = 2 12 0 + 4 12 0 + 6 12 2 6 log 2 Bryan Pardo, EECS 349 Fall 2014 2 6 4 6 log 2 4 6 =.459 16
If we split on Type H Type = W french H french + W italian H italian + W thai H thai + W burger H burger = 2 12 1+ 2 12 1+ 4 12 1+ 4 12 1 = 1 Bryan Pardo, EECS 349 Fall 2009 17
Measuring Performance Bryan Pardo, EECS 349 Fall 2009 18
What the learning curve tells us Bryan Pardo, EECS 349 Fall 2009 19
Rule #2 of Machine Learning The best (i.e. the one that generalizes well) hypothesis almost never achieves 100% accuracy on the training data. (Rule #1 was: you can t learn anything without inductive bias)
Overfitting
Avoiding Overfitting Approaches Stop splitting when information gain is low or when split is not statistically significant. Grow full tree and then prune it when done How to pick the best tree? Performance on training data? Performance on validation data? Complexity penalty? Bryan Pardo, EECS 349 Fall 2009 22
Effect of Reduced Error Pruning Bryan Pardo, EECS 349 Fall 2009 24
C4.5 Algorithm Builds a decision tree from labeled training data Also by Ross Quinlan Generalizes ID3 by Allowing continuous value attributes Allows missing attributes in examples Prunes tree after building to improve generality Bryan Pardo, EECS 349 Fall 2009 25
Used in C4.5 Steps Rule post pruning 1. Build the decision tree 2. Convert it to a set of logical rules 3. Prune each rule independently 4. Sort rules into desired sequence for use Bryan Pardo, EECS 349 Fall 2009 26
Take away about decision trees Used as classifiers Supervised learning algorithms (ID3, C4.5) (mostly) Batch processing Good for situations where The classification categories are finite The data can be represented as vectors of attributes You want to be able to UNDERSTAND how the classifier makes its choices Bryan Pardo, EECS 349 Fall 2009 28