Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Chapter ML:III III. Decision Trees Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning ML:III-34 Decision Trees STEIN/LETTMANN 2005-2017

Splitting Let t be a leaf node of an incomplete decision tree, and let D(t) be the subset of the example set D that is represented by t. [Illustration] Possible criteria for a splitting of X(t) : 1. Size of D(t). D(t) will not be partitioned further if the number of examples, D(t), is below a certain threshold. 2. Purity of D(t). D(t) will not be partitioned further if all examples in D are members of the same class. 3. Ockham s Razor. D(t) will not be partitioned further if the resulting decision tree is not improved significantly by the splitting. ML:III-35 Decision Trees STEIN/LETTMANN 2005-2017

Splitting (continued) Let D be a set of examples over a feature space X and a set of classes C = {c 1, c 2, c 3, c 4 }. Distribution of D for two possible splittings of X : (a) (b) c 1 c 2 c 3 c 4 ML:III-37 Decision Trees STEIN/LETTMANN 2005-2017

Splitting (continued) Let D be a set of examples over a feature space X and a set of classes C = {c 1, c 2, c 3, c 4 }. Distribution of D for two possible splittings of X : (a) (b) c 1 c 2 c 3 c 4 Splitting (a) minimizes the impurity of the subsets of D in the leaf nodes and should be preferred over splitting (b). This argument presumes that the misclassification costs are independent of the classes. The impurity is a function defined on P(D), the set of all subsets of an example set D. ML:III-38 Decision Trees STEIN/LETTMANN 2005-2017

Definition 4 (Impurity Function ι) Let k N. An impurity function ι : [0; 1] k R is a partial function defined on the standard k 1-simplex, denoted k 1, for which the following properties hold: (a) ι becomes minimum at points (1, 0,..., 0), (0, 1,..., 0),..., (0,..., 0, 1). (b) ι is symmetric with regard to its arguments, p 1,..., p k. (c) ι becomes maximum at point (1/k,..., 1/k). ML:III-39 Decision Trees STEIN/LETTMANN 2005-2017

Definition 5 (Impurity of an Example Set ι(d)) Let D be a set of examples, let C = {c 1,..., c k } be set of classes, and let c : X C be the ideal classifier for X. Moreover, let ι : [0; 1] k R an impurity function. Then, the impurity of D, denoted as ι(d), is defined as follows: ( {(x, c(x)) D : c(x) = c1 } ι(d) = ι,..., {(x, c(x)) D : c(x) = c ) k} Definition 6 (Impurity Reduction ι) Let D 1,..., D s be a partitioning of an example set D, which is induced by a splitting of a feature space X. Then, the resulting impurity reduction, denoted as ι(d, {D 1,..., D s }), is defined as follows: ι(d, {D 1,..., D s }) = ι(d) s j=1 D j ι(d j) ML:III-41 Decision Trees STEIN/LETTMANN 2005-2017

Remarks: The standard { k 1-simplex comprises all k-tuples with non-negative elements that sum to 1: k 1 = (p 1,..., p k ) R k : } k i=1 p i = 1 and p i 0 for all i Observe the different domains of the impurity function ι in the Definitions 4 and 5, namely, [0; 1] k and D. The domains correspond to each other: the set of examples, D, defines via its class portions an element from [0; 1] k and vice versa. The properties in the definition of ι suggest to minimize the external path length of T with respect to D in order to minimize the overall impurity characteristics of T. Within the DT -construct algorithm usually a greedy strategy (local optimization) is employed to minimize the overall impurity characteristics of a decision tree T. ML:III-42 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on the Misclassification Rate Definition for two classes: ι misclass (p 1, p 2 ) = 1 max{p 1, p 2 } = { p1 if 0 p 1 0.5 1 p 1 otherwise { {(x, c(x)) D : c(x) = c1 } ι misclass (D) = 1 max, {(x, c(x)) D : c(x) = c } 2} ML:III-43 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on the Misclassification Rate (continued) Definition for k classes: ι misclass (p 1,..., p k ) = 1 max i=1,...,k p i ι misclass (D) = 1 max c C {(x, c(x)) D : c(x) = c} ML:III-45 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on the Misclassification Rate (continued) Problems: ι misclass = 0 may hold for all possible splittings. The impurity function that is induced by the misclassification rate underestimates pure nodes, as illustrated in splitting (b): (a) (b) c 1 c 2 ι misclass = ι misclass (D) ( ) D1 ι misclass (D 1 ) + D 2 ι misclass (D 2 ) left splitting: ι misclass = 1 2 (1 2 1 4 + 1 2 1 4 ) = 1 4 right splitting: ι misclass = 1 2 (3 4 1 3 + 1 4 0) = 1 4 ML:III-47 Decision Trees STEIN/LETTMANN 2005-2017

Definition 7 (Strict Impurity Function) Let ι : [0; 1] k R be an impurity function and let p, p k 1. Then ι is called strict, if it is strictly concave: (c) (c ) ι(λp + (1 λ)p ) > λ ι(p) + (1 λ) ι(p ), 0 < λ < 1, p p Lemma 8 Let ι be a strict impurity function and let D 1,..., D s be a partitioning of an example set D, which is induced by a splitting of a feature space X. Then the following inequality holds: ι(d, {D 1,..., D s }) 0 The equality is given iff for all i {1,..., k} and j {1,..., s} holds: {(x, c(x)) D : c(x) = c i } = {(x, c(x)) D j : c(x) = c i } D j ML:III-49 Decision Trees STEIN/LETTMANN 2005-2017

Remarks: Strict concavity entails Property (c) of the impurity function definition. For two classes, strict concavity means ι(p 1, 1 p 1 ) > 0, where 0 < p 1 < 1. If ι is a twice differentiable function, strict concavity is equivalent with a negative definite Hessian of ι. With properly chosen coefficients, polynomials of second degree fulfill the properties (a) and (b) of the impurity function definition as well as strict concavity. See impurity functions based on the Gini index in this regard. The impurity function that is induced by the misclassification rate is concave, but it is not strictly concave. The proof of Lemma 8 exploits the strict concavity property of ι. ML:III-50 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on Entropy Definition 9 (Entropy) Let A denote an event and let P (A) denote the occurrence probability of A. Then the entropy (self-information, information content) of A is defined as log 2 (P (A)). Let A be an experiment with the exclusive outcomes (events) A 1,..., A k. Then the mean information content of A, denoted as H(A), is called Shannon entropy or entropy of experiment A and is defined as follows: H(A) = k P (A i ) log 2 (P (A i )) i=1 ML:III-51 Decision Trees STEIN/LETTMANN 2005-2017

Remarks: The smaller the occurrence probability of an event, the larger is its entropy. An event that is certain has zero entropy. The Shannon entropy combines the entropies of an experiment s outcomes, using the outcome probabilities as weights. In the entropy definition we stipulate the identity 0 log 2 (0) = 0. ML:III-52 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on Entropy (continued) Definition 10 (Conditional Entropy, Information Gain) Let A be an experiment with the exclusive outcomes (events) A 1,..., A k, and let B be another experiment with the outcomes B 1,..., B s. Then the conditional entropy of the combined experiment (A B) is defined as follows: s H(A B) = P (B j ) H(A B j ), j=1 where H(A B j ) = k P (A i B j ) log 2 (P (A i B j )) i=1 The information gain due to experiment B is defined as follows: s H(A) H(A B) = H(A) P (B j ) H(A B j ) j=1 ML:III-53 Decision Trees STEIN/LETTMANN 2005-2017

Remarks [Bayes for classification] : Information gain is defined as reduction in entropy. In the context of decision trees, experiment A corresponds to classifying feature vector x with regard to the target concept. A possible question, whose answer will inform us about which event A i A occurred, is the following: Does x belong to class c i? Likewise, experiment B corresponds to evaluating feature B of feature vector x. A possible question, whose answer will inform us about which event B j B occurred, is the following: Does x have value b j for feature B? Rationale: Typically, the events target concept class and feature value are statistically dependent. Hence, the entropy of the event c(x) will become smaller if we learn about the value of some feature of x (recall that the class of x is unknown). We experience an information gain with regard to the outcome of experiment A, which is rooted in our information about the outcome of experiment B. Under no circumstances the information gain will be negative; the information gain is zero if the involved events are conditionally independent: P (A i ) = P (A i B j ), i {1,..., k}, j {1,..., s}, which leads to a split as specified as the special case in Lemma 8. ML:III-56 Decision Trees STEIN/LETTMANN 2005-2017

Remarks (continued) : Since H(A) is constant, the feature that provides the maximum information gain (= the maximally informative feature) is given by the minimization of H(A B). The expanded form of H(A B) reads as follows: H(A B) = s P (B j ) j=1 k P (A i B j ) log 2 (P (A i B j )) i=1 ML:III-57 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on Entropy (continued) Definition for two classes: ι entropy (p 1, p 2 ) = (p 1 log 2 (p 1 ) + p 2 log 2 (p 2 )) ( {(x, c(x)) D : c(x) = c1 } ι entropy (D) = {(x, c(x)) D : c(x) = c 2 } {(x, c(x)) D : c(x) = c 1 } log 2 + ) {(x, c(x)) D : c(x) = c 2 } log 2 ML:III-58 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on Entropy (continued) Graph of the function ι entropy (p 1, p 2, 1 p 1 p 2 ) : 0.4 x 0.6 0.8 1 Entropy H 1.6 1.4 1.2 0 0 0.2 0.2 0.4 0.6 y 0.8 1 1 0.8 0.6 0 0 0.2 0.2 0.4 0.6 0.4 0.8 0.6 x 1 0.8 1 y ML:III-60 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on Entropy (continued) Definition for k classes: k ι entropy (p 1,..., p k ) = p i log 2 (p i ) i=1 ι entropy (D) = k i=1 {(x, c(x)) D : c(x) = c i } log 2 {(x, c(x)) D : c(x) = c i } ML:III-61 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on Entropy (continued) ι entropy corresponds to the information gain H(A) H(A B) : s D j ι entropy = ι entropy (D) ι entropy(d j ) } {{ } H(A) j=1 } {{ } H(A B) ML:III-62 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on Entropy (continued) ι entropy corresponds to the information gain H(A) H(A B) : s D j ι entropy = ι entropy (D) ι entropy(d j ) Legend: } {{ } H(A) j=1 } {{ } H(A B) A i, i = 1,..., k, denotes the event that x X(t) belongs to class c i. The experiment A corresponds to the classification c : X(t) C. B j, j = 1,..., s, denotes the event that x X(t) has value b j for feature B. The experiment B corresponds to evaluating feature B and entails the following splitting: X(t) = X(t 1 )... X(t s ) = {x X(t) : x B = b 1 }... {x X(t) : x B = b s } ι entropy (D) = ι entropy (P (A 1 ),..., P (A k )) = k i=1 P (A i) log 2 (P (A i )) = H(A) D j ι entropy (D j ) = P (B j ) ι entropy (P (A 1 B j ),..., P (A k B j )), j = 1,..., s P (A i ), P (B j ), P (A i B j ) are estimated as relative frequencies based on D. ML:III-63 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on the Gini Index Definition for two classes: ι Gini (p 1, p 2 ) = 1 (p 1 2 + p 2 2 ) = 2 p 1 p 2 ι Gini (D) = 2 {(x, c(x)) D : c(x) = c 1} {(x, c(x)) D : c(x) = c 2} ML:III-64 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on the Gini Index Definition for two classes: ι Gini (p 1, p 2 ) = 1 (p 1 2 + p 2 2 ) = 2 p 1 p 2 ι Gini (D) = 2 {(x, c(x)) D : c(x) = c 1} {(x, c(x)) D : c(x) = c 2} Graph of the function ι Gini (p 1, 1 p 1 ) : ι 0.25 0 0.5 p 2 1 p 1 [Graph: Misclassification, Entropy] ML:III-65 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on the Gini Index (continued) Definition for k classes: k ι Gini (p 1,..., p k ) = 1 (p i ) 2 i=1 ι Gini (D) = ( k i=1 ) 2 {(x, c(x)) D : c(x) = c i } k ( {(x, c(x)) D : c(x) = ci } i=1 ) 2 = 1 k ( {(x, c(x)) D : c(x) = ci } i=1 ) 2 ML:III-66 Decision Trees STEIN/LETTMANN 2005-2017