Decision Tree Analysis for Classification Problems Entscheidungsunterstützungssysteme SS 18
Supervised segmentation An intuitive way of thinking about extracting patterns from data in a supervised manner is to try to segment the population into subgroups that have different values for the target variable (and within the subgroup the instances have similar values for the target variable). If the segmentation is done using values of variables that will be known when the target is not, then these segments can be used to predict the value of the target variable. E.g. One such segment expressed in might be: Middle-aged professionals who reside in New York City on average have a churn rate of 5%. Specifically, the term middle-aged professionals who reside in New York City is the definition of the segment (which references some particular attributes) and a churn rate of 5% describes the predicted value of the target variable for the segment The question at hand is how do we know if that segment is the best for describing the target variable, or if we can rank the variables based on how well they predict the target variable
Classification using Decision Trees Figure shows the space broken up into regions by horizontal and vertical decision boundaries that partition the instance space into similar regions. Examples in each region should have similar values for the target variable. Main purpose of creating homogeneous regions is so that we can predict the target variable of a new, unseen instance by determining which segment it falls into. e.g if a new customer falls into the lower-left segment, we can conclude that the target value is very likely to be. Similarly, if it falls into the upper-right segment, we can predict its value as +.
Loan Write-Off Example Consider 12 people represented as stick figures below: Head-shape: Square/Circular Body-shape: Rectangular/Oval Body-color: Gray/White Target variable: Yes or No, indicating whether the person becomes a loan write-off
Selecting Informative Attributes Best attributes for segmenting Which of the attributes would be best to segment these people into groups, in a way that will distinguish write-offs from non-write-offs? Ideally, we would like the resulting groups to be as pure as possible i.e. homogeneous with respect to the target variable. If every member of a group has the same value for the target, then the group is pure. If there is at least one member of the group that has a different value for the target variable than the rest of the group, then the group is impure.
Selecting Informative Attributes Which of the attributes would be best to segment these people into groups, in a way that will distinguish write-offs from non-write-offs? Ideally, we would like the resulting groups to be as pure as possible i.e. homogeneous with respect to the target variable. If every member of a group has the same value for the target, then the group is pure. If there is at least one member of the group that has a different value for the target variable than the rest of the group, then the group is impure. The most common splitting criterion is called information gain is based on a purity measure called entropy.
Entropy Entropy is a measure of disorder that can be applied to a set, such as one of our individual segments. Disorder corresponds to how mixed (impure) the segment is with respect to these properties of interest. So, for example, a mixed up segment with lots of write-offs and lots of non-write-offs would have high entropy. entropy = p 1 log p 1 p 2 log p 2 Each p i is the probability (the relative percentage) of property i within the set, ranging from p i = 1 when all members of the set have property i, and p i = 0 when no members of the set have property i.
Entropy The plot shows the entropy of a set containing 10 instances of two classes, + and Starting with all negative instances at the lower left, p+ = 0, the set has minimal disorder (it is pure) and the entropy is zero. If we start to switch class labels of elements of the set from to +, the entropy increases. Entropy is maximized at 1 when the instance classes are balanced (five of each), and p+ = p = 0.5. As more class labels are switched, the + class starts to predominate and the entropy lowers again. When all instances are positive, p+ = 1 and entropy is minimal again at zero.
Entropy Example Consider a set S of 10 people with seven of the non-write-off class and three of the write-off class p(non-write-off) = 7 / 10 = 0.7 p(write-off) = 3 / 10 = 0.3 entropy(s) = - 0.7 log 2 (0.7) + (-0.3) log 2 (0.3) - 0.7-0.51 + (-0.3) - 1.74 0.88
Information Gain Entropy only tells us how pure or impure a subset is, we would like to measure how informative an attribute is with respect to the target variable Information gain measures the change in entropy due to any new information being added In the context of supervised segmentation, we consider the information gained by splitting the set on all values of a single attribute Consider an attribute we use to split has k different values. Let the original set be called the Parent set and split sets as the k children sets. The information gain due to this split can be expressed as below: IG(parent, children) = entropy(parent) [p(c 1 ) entropy(c 1 ) + p(c 2 ) entropy(c 2 ) + + p(c k ) entropy(c k )]
Information Gain Example Consider the parent set of 30 instances being split into two children sets (c 1 : Balance < 50K and c 2 : Balance >= 50K) based on the attribute Balance p(c 1 ) = 13/30 0.43 p(c 2 ) = 17/30 0.57 p( ) = 16/30, p( ) = 14/30 for the parent set. entropy(parent) = - p( ) log 2 p( ) -p( ) log 2 p( ) - 0.53-0.9-0.47-1.1 0.99 (very impure)
Information Gain Example entropy(parent) 0.99, p(c 1 ) = 13/30 0.43, p(c 2 ) = 17/30 0.57 The entropy of the left child (c 1 ) is: Entropy (Balance < 50K) = - p( ) log 2 p( ) - p( ) log 2 p( ) - 0.92 ( - 0.12) - 0.08 ( - 3.7) 0.39 The entropy of the right child (c 2 ) is: entropy(balance 50K) = - p( ) log2 p( ) - p( ) log2 p( ) - 0.24 ( - 2.1) - 0.76 ( - 0.39) 0.79 Information gain can be calculated as below IG = entropy(parent) [p(balance < 50K) entropy(balance < 50K) + p(balance 50K) entropy(balance 50K)] 0.99 (0.43 0.39 + 0.57 0.79) 0.37
Attribute Selection with Information Gain (Mushroom Example) We will try to find attribute which is most informative with respect to estimating the value of the target variable. We also can rank a set of attributes by their informativeness We will take a data set with 20 odd attributes related to Mushroom (like cap shape, veil type, stalk color, etc.), where the target variable is whether the mushroom is edible or not For this information gain was calculated by splitting the parent set into child sets for each attribute. Entropy for each such set is represented by an entropy graph, where on the x axis is the proportion of the dataset (0 to 1), and on the y axis is the entropy (also 0 to 1) of a given piece of the data. The amount of shaded area in each graph represents the amount of entropy in the dataset
Attribute selection with Information Gain Parent set (Entropy = 96%) Split on attribute GILL COLOR Split on attribute SPORE PRINT COLOR Split on attribute ODOR
Attribute selection with Information Gain The letters in graph represent different values of the attribute on which the dataset has been split Thus it can be seen ODOR has the highest information gain of any attribute in the Mushroom dataset. It can reduce the dataset s total entropy to about 0.1, which gives it an information gain of 0.96 0.1 = 0.86. Therefore it can be inferred that many odors are completely characteristic of poisonous or edible mushrooms, so odor is a very informative attribute to check when considering mushroom edibility Thus if we want to build a model to determine the mushroom edibility using only a single feature, we should choose its odor.
Supervised segmentation with Tree- Structured models Consider a segmentation of the data to take the form of a tree, such that the tree is upside down with the root at the top. The tree is made up of nodes, interior nodes and terminal nodes, and branches emanating from the interior nodes. Each interior node in the tree contains a test of an attribute, with each branch from the node representing a distinct value of the attribute. Following the branches from the root node down, each path eventually terminates at a terminal node, or leaf. The tree creates a segmentation of the data: every data point will correspond to one and only one path in the tree, and thereby to one and only one leaf. When the leaf contains a classification of the segment, the tree is referred to as Classification tree
Supervised segmentation with Tree- Consider the 12 stick people example Structured models Tree induction takes a divide-and-conquer approach, starting with the whole dataset and applying variable selection to try to create the purest subgroups possible using the attributes. One way is to separate people based on their body type - rectangular versus oval. The rectangular-body people on the left are mostly Yes, with a single No person, so it is mostly pure. Similarly, The oval-body group on the right has mostly No people, but two Yes people. Doing this recursively, gives the 4 pure segments as shown
Visualizing segmentation Classification trees can be visualized as segments on instance space (space described by data features) A common form of instance space visualization is a scatterplot on some pair of features, used to compare one variable against another to detect correlations and relationships. Consider a classification tree related to write off probability based on the features like age and account balance. It can be represented in form of segments as shown in the figure
Visualizing segmentation The black dots correspond to instances of the class Write-off, the plus signs correspond to instances of class non-write-off. The shading shows how the tree leaves correspond to segments of the population in instance space.
Trees as Sets of Rules Trees can also be represented as logical segments. If one traces down a single path from the root node to a leaf, collecting the conditions as they go, a rule can be generated Considering the tree in the last slide, starting at the root node and choosing the left branches of the tree, we get the rule: IF (Balance < 50K) AND (Age < 50) THEN Class=Write-off Doing this for all the nodes, we get 3 more rules as below: IF (Balance < 50K) AND (Age 50) THEN Class=No Write-off IF (Balance 50K) AND (Age < 45) THEN Class=Write-off IF (Balance 50K) AND (Age < 45) THEN Class=No Write-off
Probability estimation It is always preferable to get a more informative prediction than just classification. E.g. If a company is trying to predict the churning of its employees, it would much rather have an estimate of the probability that one will leave the company. This can help in many ways like allocating the incentive budget to the instances with the highest expected loss, etc. This can be done in classification trees by using instance counts at each leaf to compute a class probability estimate E.g. if a leaf contains n positive instances and m negative instances, the probability of any new instance being positive may be estimated as n/(n +m). This is called a frequency-based estimate of class membership probability. Consider a leaf with only one instance. Using this formulae it can be said there is a 100% probability that members of that segment will have the class that this one instance happens to have
Probability estimation To overcome this problem, a smoothed version of frequency-based estimate called Laplace correction is used. It is represented as below: For the example described earlier, the probability now will be 2/3 = 0.75 As the number of instances increases, the Laplace equation converges to the frequency-based estimate
Addressing the Churn problem through Decision Tree Consider a problem wherein we try to predict the churn probability of cell phone subscribers The attributes of the dataset of 20000 customers are as below:
Addressing the Churn problem through Decision Tree To start building a classification tree, information gain obtained on dividing the dataset by based on each attribute was calculated The highest information gain feature (HOUSE) according to this figure will be at the root of the tree The classification tree for the dataset can be seen in the next slide It can be noticed that the order in which features are chosen for the tree doesn t exactly correspond to their ranking in information gain
Addressing the Churn problem through Decision Tree This is because the table ranks each feature by how good it is independently, evaluated separately on the entire population of instances. Nodes in a classification tree depend on the instances above them in the tree.
Playing Tennis Example (Quinlan 1986) Day Outlook Temperature Humidity Windy Play Tennis Day1 Sunny Hot High False N Day2 Sunny Hot High True N Day3 Overcast Hot High False P Day4 Rain Mild High False P Day5 Rain Cool Normal False P Day6 Rain Cool Normal True N Day7 Overcast Cool Normal True P Day8 Sunny Mild High False N Day9 Sunny Cool Normal False P Day10 Rain Mild Normal False P Day11 Sunny Mild Normal True P Day12 Overcast Mild High True P Day13 Overcast Hot Normal False P Day14 Rain Mild High True N
A Simple Decision Tree
A Complex Decision Tree
Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf There are no samples left
Attribute Selection Measure: Information Gain (ID3/C4.5) Select the attribute with the highest information gain Let p i be the probability that an arbitrary tuple in D belongs to class C i, estimated by C i, D / D (proportion of C i in the sample) Expected information (entropy) needed to classify a tuple in D: Info( D) p i log 2( pi ) Information needed (after using A to split D into v partitions) to classify D: v D j InfoA( D) Info( D j1 D Information gained by branching on attribute A m i1 j ) Gain(A) Info(D) Info A (D)
Attribute Selection: Information Gain Class P: play tennis = yes Class N: play tennis = no 9 Info D) I(9,5) log 14 9 5 ( ) log 14 14 5 ( ) 14 ( 2 2 Outlook (i) # Yes s (P i ) # No s (N i ) I (P i, N i ) Sunny 2 3 0.971 Overcast 4 0 0 Rain 3 2 0.971 Outlook Temperature Humidity Windy Play Tennis Sunny Hot High False N Sunny Hot High True N Overcast Hot High False P Rain Mild High False P Rain Cool Normal False P Rain Cool Normal True N Overcast Cool Normal True P Sunny Mild High False N Sunny Cool Normal False P Rain Mild Normal False P Sunny Mild Normal True P Overcast Mild High True P Overcast Hot Normal False P Rain Mild High True N 0.940 Info outlook 5 14 I(2,3) means Outlook=sunny has 5 out of 14 samples, with 2 yes s and 3 no s. Hence, Gain(outlook) = Info(D) Info outlook (D) =0.246 ( D) 5 14 5 14 I(3,2) I(2,3) 0.694 Similarly, Gain(temperature) = 0.029 Gain(humidity) = 0.151 Gain(windy) = 0.048 4 14 I(4,0)
Computing Information Gain for Continuous-Valued Attributes Let attribute A be a continuous-valued attribute Must determine the best split point for A Sort the value A in increasing order Typically, the midpoint between each pair of adjacent values is considered as a possible split point (a i +a i+1 )/2 is the midpoint between the values of a i and a i+1 The point with the minimum expected information requirement for A is selected as the split-point for A Split: 32 D1 is the set of tuples in D satisfying A split-point, and D2 is the set of tuples in D satisfying A > split-point
Gain Ratio for Attribute Selection (C4.5) Information gain measure is biased towards attributes with a large number of values C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain) Ex. SplitInfo v D j ( D) log 2 D GainRatio(A) = Gain(A)/SplitInfo(A) SplitInfo temperature D = 4 14 log 2 A D ( j j1 D 4 14 6 14 log 2 gain_ratio(temperature) = 0.029/1.557 = 0.019 The attribute with the maximum gain ratio is selected as the splitting attribute ) 6 14 4 14 log 2 4 14 = 1.557 33
Gini Index (used by CART) If a data set D contains examples from n classes, gini index, n 2 gini(d) is defined as gini D = 1 σ j=1 p j where p j is the relative frequency of class j in D If a data set D is split on A into two subsets D 1 and D 2, the gini index gini(d) is defined as D ) 1 D ( ) 2 gini A D gini D1 gini( D D D Reduction in Impurity: ( 2 gini( A) gini( D) gini ( D) The attribute provides the smallest gini split (D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute) A 34 )
Computation of Gini Index Ex. D has 9 tuples in play_tennis = yes and 5 in no 9 gini( D) 1 14 5 14 Suppose the attribute income partitions D into 10 in D 1 : {cool, mild} and 4 in D 2 : {hot} gini temperature{ cool, mild} 10 7 1 14 10 0.443 gini temperature{ hot} ( D) gini {cool,hot} is 0.458; gini {mild,hot} is 0.450. Thus, split on the {cool, mild} (and {hot}) since it has the lowest Gini index All attributes are assumed continuous-valued May need other tools, e.g., clustering, to get the possible split values Can be modified for categorical attributes 2 3 10 2 10 ( D) gini( D 14 2 4 14 2 1 0.459 1 ) 2 4 4 14 2 gini( D 2 4 2 2 ) 35
Comparing Attribute Selection Measures The three measures, in general, return good results but 36 Information gain: biased towards multivalued attributes Gain ratio: tends to prefer unbalanced splits in which one partition is much smaller than the others Gini index: biased to multivalued attributes has difficulty when # of classes is large tends to favor tests that result in equal-sized partitions and purity in both partitions
37 Overfitting and Tree Pruning Overfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a fully grown tree get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the best pruned tree
Enhancements to Basic Decision Tree Induction Allow for continuous-valued attributes 38 Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values Assign the most common value of the attribute Assign probability to each of the possible values Attribute construction Create new attributes based on existing ones that are sparsely represented This reduces fragmentation, repetition, and replication
Sources F. Provost and T. Fawcett, Data Science for Business J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques Quinlan, J. Ross. "Induction of decision trees." Machine learning 1.1 (1986): 81-106. Quinlan