Decision Trees Lewis Fishgold (Material in these slides adapted from Ray Mooney's slides on Decision Trees)
Classification using Decision Trees Nodes test features, there is one branch for each value of the feature, and leaves specify the classification. color red blue green f([color = red, shape = circle, size = big]) = + shape - + circle square triangle + - - Logical view: (red circle) \/ green Geometrical view Represents axis-parallel boundaries circle red blue green + square - - + triangle -
Top-Down Decision Tree Induction Recursively build a tree by finding good splits, and partitioning examples <big, red, circle>: + <small, red, circle>: + <small, red, square>: <big, blue, circle>:
Top-Down Decision Tree Induction Recursively build a tree by finding good splits, and partitioning examples red blue color green <big, red, circle>: + <small, red, circle>: + <small, red, square>: <big, blue, circle>:
Top-Down Decision Tree Induction Recursively build a tree by find good splits, and partitioning examples color <big, red, circle>: + <small, red, circle>: + <small, red, square>: red green blue shape - - <big, blue, circle>: circle square triangle + <big, red, circle>: + <small, red, circle>: + - - <small, red, square>:
Top-Down Decision Tree Induction Pseudocode DTree(examples, features) returns a tree If all examples are in one category, return a leaf node with that category label. Else if the set of features is empty, return a leaf node with the category label that is the most common in examples. Else pick a good feature F and create a node R for it For each possible value v i of F: Let examples i be the subset of examples that have value v i for F Add an out-going edge E to node R labeled with the value v i. If examples i is empty then attach a leaf node to edge E labeled with the category that is the most common in examples. else call DTree(examples i, features {F}) and attach the resulting tree as the subtree under edge E. Return the subtree rooted at R.
Picking good features to split on Goal is to have the resulting tree be as small as possible in accord with Occam s razor Finding a minimal decision tree (nodes, leaves, or depth) is an NP-hard optimization problem. So, we use greedy heuristic search which might find suboptimal solutions Heuristic: Want to pick a feature that creates subsets of examples that are relatively pure in a single class so they are closer to being leaf nodes Sounds like a job for information theory
Entropy Entropy (ie. impurity) of a set of examples, S, for binary classification Entropy where p + is the fraction of positive examples in S and p - is the fraction of negative examples. If all examples are in one category, entropy is zero (we define 0 log(0)=0) If examples are equally mixed (p 1 =p 0 =0.5), entropy is a maximum of 1. For multi-class problems with c categories, entropy generalizes to: ( S) = p + log2( p+ ) p log2( p Entropy( S) = c i= 1 p i log ( p i 2 ) )
Entropy Plot for Binary Classification
Information Gain The information gain of a feature F is the expected reduction in entropy resulting from splitting on this feature. where S v is the subset of S having value v for feature F. Second term is entropy of each resulting subset weighted by its relative size Example: Gain( S, F) = Entropy( S) v Values( F ) Entropy( S <big, red, circle>: + <small, red, circle>: + <small, red, square>: <big, blue, circle>: S S v v ) 2+, 2 : E=1 size big small 1+,1 1+,1 E=1 E=1 Gain=1 (0.5 1 + 0.5 1) = 0 2+, 2 : E=1 color red blue 2+,1 0+,1 E=0.918 E=0 Gain=1 (0.75 0.918 + 0.25 0) = 0.311 2+, 2 : E=1 shape circle square 2+,1 0+,1 E=0.918 E=0 Gain=1 (0.75 0.918 + 10 0.25 0) = 0.311
Decision Trees in the Real World In real world data, we can't expect leaves to be pure The features might not be adequate for perfect classification So, the leaves contain probability distributions When doing classification, we pick the class in the leaf with greatest probability
When to stop growing the tree? Use (chi-squared) statistical significance test Are the post-split distributions significantly different than the pre-split distribution?
Relation to other methods Perceptrons Can learn non-axis-parallel decision boundaries Can't learn nonlinear decision boundaries Better suited for continuous input SVM Can learn nonlinear functions, but you have to pick a good kernel Can efficiently find global solution to the optimization problem Better suited for continuous input Decision Trees Can learn nonlinear decision boundaries (but has trouble with non-axisparallel boundaries) Better suited for discrete input Hill climbing can can get stuck in local optima Automatically performs variable selection Classifier is human-readable (sometimes)