Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Similar documents
the tree till a class assignment is reached

Informal Definition: Telling things apart

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Machine Learning & Data Mining

Generalization to Multi-Class and Continuous Responses. STA Data Mining I

Knowledge Discovery and Data Mining

Jeffrey D. Ullman Stanford University

Decision trees COMS 4771

Decision Trees. Tirgul 5

Decision Tree Learning

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Lecture 7: DecisionTrees

Decision T ree Tree Algorithm Week 4 1

Classification: Decision Trees

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Learning Decision Trees

Lecture 7 Decision Tree Classifier

Error Error Δ = 0.44 (4/9)(0.25) (5/9)(0.2)

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Information Theory & Decision Trees

Lecture 3: Decision Trees

Chapter ML:II (continued)

Decision Tree Learning Lecture 2

Decision Tree And Random Forest

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Decision Tree Learning

CS145: INTRODUCTION TO DATA MINING

CS 6375 Machine Learning

Data Mining and Analysis: Fundamental Concepts and Algorithms

Lecture 1: September 25, A quick reminder about random variables and convexity

CHAPTER-17. Decision Tree Induction

Growing a Large Tree

CSCI 5622 Machine Learning

Machine Learning 3. week

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Decision Trees Part 1. Rao Vemuri University of California, Davis

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Data classification (II)

Learning Decision Trees

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Notes on Machine Learning for and

Lecture VII: Classification I. Dr. Ouiem Bchir

Lecture 3: Decision Trees

Kernel Density Estimation

Induction on Decision Trees

Classification and regression trees

Chapter ML:IV. IV. Statistical Learning. Probability Basics Bayes Classification Maximum a-posteriori Hypotheses

SF2930 Regression Analysis

Decision Tree Learning

Classification and Regression Trees

Machine Learning

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Procedures An Algorithmic Point of View

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Classification and Prediction

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Rule Generation using Decision Trees

Data Mining Project. C4.5 Algorithm. Saber Salah. Naji Sami Abduljalil Abdulhak

Chapter 6: Classification

day month year documentname/initials 1

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Chapter 14 Combining Models

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Induction of Decision Trees

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Machine Learning

Regression tree methods for subgroup identification I

Suppose that f is continuous on [a, b] and differentiable on (a, b). Then

Final Exam, Machine Learning, Spring 2009

Machine Learning 2nd Edi7on

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups

Decision Tree Learning

EECS 349:Machine Learning Bryan Pardo

Decision Trees.

Generative v. Discriminative classifiers Intuition

Machine Learning

16.4 Multiattribute Utility Functions

Classification Using Decision Trees

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Randomized Decision Trees

Decision Tree. Decision Tree Learning. c4.5. Example

Decision Tree Learning - ID3

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Bayesian Learning. Bayesian Learning Criteria

Tutorial 6. By:Aashmeet Kalra

Decision Support. Dr. Johan Hagelbäck.

Decision Tree Learning

Statistics and learning: Big Data

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Decision Trees.

Supervised Learning via Decision Trees

5.1 Learning using Polynomial Threshold Functions

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Transcription:

Chapter ML:III III. Decision Trees Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning ML:III-34 Decision Trees STEIN/LETTMANN 2005-2017

Splitting Let t be a leaf node of an incomplete decision tree, and let D(t) be the subset of the example set D that is represented by t. [Illustration] Possible criteria for a splitting of X(t) : 1. Size of D(t). D(t) will not be partitioned further if the number of examples, D(t), is below a certain threshold. 2. Purity of D(t). D(t) will not be partitioned further if all examples in D are members of the same class. 3. Ockham s Razor. D(t) will not be partitioned further if the resulting decision tree is not improved significantly by the splitting. ML:III-35 Decision Trees STEIN/LETTMANN 2005-2017

Splitting Let t be a leaf node of an incomplete decision tree, and let D(t) be the subset of the example set D that is represented by t. [Illustration] Possible criteria for a splitting of X(t) : 1. Size of D(t). D(t) will not be partitioned further if the number of examples, D(t), is below a certain threshold. 2. Purity of D(t). D(t) will not be partitioned further if all examples in D are members of the same class. 3. Ockham s Razor. D(t) will not be partitioned further if the resulting decision tree is not improved significantly by the splitting. ML:III-36 Decision Trees STEIN/LETTMANN 2005-2017

Splitting (continued) Let D be a set of examples over a feature space X and a set of classes C = {c 1, c 2, c 3, c 4 }. Distribution of D for two possible splittings of X : (a) (b) c 1 c 2 c 3 c 4 ML:III-37 Decision Trees STEIN/LETTMANN 2005-2017

Splitting (continued) Let D be a set of examples over a feature space X and a set of classes C = {c 1, c 2, c 3, c 4 }. Distribution of D for two possible splittings of X : (a) (b) c 1 c 2 c 3 c 4 Splitting (a) minimizes the impurity of the subsets of D in the leaf nodes and should be preferred over splitting (b). This argument presumes that the misclassification costs are independent of the classes. The impurity is a function defined on P(D), the set of all subsets of an example set D. ML:III-38 Decision Trees STEIN/LETTMANN 2005-2017

Definition 4 (Impurity Function ι) Let k N. An impurity function ι : [0; 1] k R is a partial function defined on the standard k 1-simplex, denoted k 1, for which the following properties hold: (a) ι becomes minimum at points (1, 0,..., 0), (0, 1,..., 0),..., (0,..., 0, 1). (b) ι is symmetric with regard to its arguments, p 1,..., p k. (c) ι becomes maximum at point (1/k,..., 1/k). ML:III-39 Decision Trees STEIN/LETTMANN 2005-2017

Definition 5 (Impurity of an Example Set ι(d)) Let D be a set of examples, let C = {c 1,..., c k } be set of classes, and let c : X C be the ideal classifier for X. Moreover, let ι : [0; 1] k R an impurity function. Then, the impurity of D, denoted as ι(d), is defined as follows: ( {(x, c(x)) D : c(x) = c1 } ι(d) = ι,..., {(x, c(x)) D : c(x) = c ) k} ML:III-40 Decision Trees STEIN/LETTMANN 2005-2017

Definition 5 (Impurity of an Example Set ι(d)) Let D be a set of examples, let C = {c 1,..., c k } be set of classes, and let c : X C be the ideal classifier for X. Moreover, let ι : [0; 1] k R an impurity function. Then, the impurity of D, denoted as ι(d), is defined as follows: ( {(x, c(x)) D : c(x) = c1 } ι(d) = ι,..., {(x, c(x)) D : c(x) = c ) k} Definition 6 (Impurity Reduction ι) Let D 1,..., D s be a partitioning of an example set D, which is induced by a splitting of a feature space X. Then, the resulting impurity reduction, denoted as ι(d, {D 1,..., D s }), is defined as follows: ι(d, {D 1,..., D s }) = ι(d) s j=1 D j ι(d j) ML:III-41 Decision Trees STEIN/LETTMANN 2005-2017

Remarks: The standard { k 1-simplex comprises all k-tuples with non-negative elements that sum to 1: k 1 = (p 1,..., p k ) R k : } k i=1 p i = 1 and p i 0 for all i Observe the different domains of the impurity function ι in the Definitions 4 and 5, namely, [0; 1] k and D. The domains correspond to each other: the set of examples, D, defines via its class portions an element from [0; 1] k and vice versa. The properties in the definition of ι suggest to minimize the external path length of T with respect to D in order to minimize the overall impurity characteristics of T. Within the DT -construct algorithm usually a greedy strategy (local optimization) is employed to minimize the overall impurity characteristics of a decision tree T. ML:III-42 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on the Misclassification Rate Definition for two classes: ι misclass (p 1, p 2 ) = 1 max{p 1, p 2 } = { p1 if 0 p 1 0.5 1 p 1 otherwise { {(x, c(x)) D : c(x) = c1 } ι misclass (D) = 1 max, {(x, c(x)) D : c(x) = c } 2} ML:III-43 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on the Misclassification Rate Definition for two classes: ι misclass (p 1, p 2 ) = 1 max{p 1, p 2 } = { p1 if 0 p 1 0.5 1 p 1 otherwise { {(x, c(x)) D : c(x) = c1 } ι misclass (D) = 1 max, {(x, c(x)) D : c(x) = c } 2} Graph of the function ι misclass (p 1, 1 p 1 ) : ι 0.5 0 0.5 p 2 1 p 1 [Graph: Entropy, Gini] ML:III-44 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on the Misclassification Rate (continued) Definition for k classes: ι misclass (p 1,..., p k ) = 1 max i=1,...,k p i ι misclass (D) = 1 max c C {(x, c(x)) D : c(x) = c} ML:III-45 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on the Misclassification Rate (continued) Problems: ι misclass = 0 may hold for all possible splittings. The impurity function that is induced by the misclassification rate underestimates pure nodes, as illustrated in splitting (b): (a) (b) c 1 c 2 ML:III-46 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on the Misclassification Rate (continued) Problems: ι misclass = 0 may hold for all possible splittings. The impurity function that is induced by the misclassification rate underestimates pure nodes, as illustrated in splitting (b): (a) (b) c 1 c 2 ι misclass = ι misclass (D) ( ) D1 ι misclass (D 1 ) + D 2 ι misclass (D 2 ) left splitting: ι misclass = 1 2 (1 2 1 4 + 1 2 1 4 ) = 1 4 right splitting: ι misclass = 1 2 (3 4 1 3 + 1 4 0) = 1 4 ML:III-47 Decision Trees STEIN/LETTMANN 2005-2017

Definition 7 (Strict Impurity Function) Let ι : [0; 1] k R be an impurity function and let p, p k 1. Then ι is called strict, if it is strictly concave: (c) (c ) ι(λp + (1 λ)p ) > λ ι(p) + (1 λ) ι(p ), 0 < λ < 1, p p ML:III-48 Decision Trees STEIN/LETTMANN 2005-2017

Definition 7 (Strict Impurity Function) Let ι : [0; 1] k R be an impurity function and let p, p k 1. Then ι is called strict, if it is strictly concave: (c) (c ) ι(λp + (1 λ)p ) > λ ι(p) + (1 λ) ι(p ), 0 < λ < 1, p p Lemma 8 Let ι be a strict impurity function and let D 1,..., D s be a partitioning of an example set D, which is induced by a splitting of a feature space X. Then the following inequality holds: ι(d, {D 1,..., D s }) 0 The equality is given iff for all i {1,..., k} and j {1,..., s} holds: {(x, c(x)) D : c(x) = c i } = {(x, c(x)) D j : c(x) = c i } D j ML:III-49 Decision Trees STEIN/LETTMANN 2005-2017

Remarks: Strict concavity entails Property (c) of the impurity function definition. For two classes, strict concavity means ι(p 1, 1 p 1 ) > 0, where 0 < p 1 < 1. If ι is a twice differentiable function, strict concavity is equivalent with a negative definite Hessian of ι. With properly chosen coefficients, polynomials of second degree fulfill the properties (a) and (b) of the impurity function definition as well as strict concavity. See impurity functions based on the Gini index in this regard. The impurity function that is induced by the misclassification rate is concave, but it is not strictly concave. The proof of Lemma 8 exploits the strict concavity property of ι. ML:III-50 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on Entropy Definition 9 (Entropy) Let A denote an event and let P (A) denote the occurrence probability of A. Then the entropy (self-information, information content) of A is defined as log 2 (P (A)). Let A be an experiment with the exclusive outcomes (events) A 1,..., A k. Then the mean information content of A, denoted as H(A), is called Shannon entropy or entropy of experiment A and is defined as follows: H(A) = k P (A i ) log 2 (P (A i )) i=1 ML:III-51 Decision Trees STEIN/LETTMANN 2005-2017

Remarks: The smaller the occurrence probability of an event, the larger is its entropy. An event that is certain has zero entropy. The Shannon entropy combines the entropies of an experiment s outcomes, using the outcome probabilities as weights. In the entropy definition we stipulate the identity 0 log 2 (0) = 0. ML:III-52 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on Entropy (continued) Definition 10 (Conditional Entropy, Information Gain) Let A be an experiment with the exclusive outcomes (events) A 1,..., A k, and let B be another experiment with the outcomes B 1,..., B s. Then the conditional entropy of the combined experiment (A B) is defined as follows: s H(A B) = P (B j ) H(A B j ), j=1 where H(A B j ) = k P (A i B j ) log 2 (P (A i B j )) i=1 The information gain due to experiment B is defined as follows: s H(A) H(A B) = H(A) P (B j ) H(A B j ) j=1 ML:III-53 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on Entropy (continued) Definition 10 (Conditional Entropy, Information Gain) Let A be an experiment with the exclusive outcomes (events) A 1,..., A k, and let B be another experiment with the outcomes B 1,..., B s. Then the conditional entropy of the combined experiment (A B) is defined as follows: s H(A B) = P (B j ) H(A B j ), j=1 where H(A B j ) = k P (A i B j ) log 2 (P (A i B j )) i=1 The information gain due to experiment B is defined as follows: s H(A) H(A B) = H(A) P (B j ) H(A B j ) j=1 ML:III-54 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on Entropy (continued) Definition 10 (Conditional Entropy, Information Gain) Let A be an experiment with the exclusive outcomes (events) A 1,..., A k, and let B be another experiment with the outcomes B 1,..., B s. Then the conditional entropy of the combined experiment (A B) is defined as follows: s H(A B) = P (B j ) H(A B j ), j=1 where H(A B j ) = k P (A i B j ) log 2 (P (A i B j )) i=1 The information gain due to experiment B is defined as follows: s H(A) H(A B) = H(A) P (B j ) H(A B j ) j=1 ML:III-55 Decision Trees STEIN/LETTMANN 2005-2017

Remarks [Bayes for classification] : Information gain is defined as reduction in entropy. In the context of decision trees, experiment A corresponds to classifying feature vector x with regard to the target concept. A possible question, whose answer will inform us about which event A i A occurred, is the following: Does x belong to class c i? Likewise, experiment B corresponds to evaluating feature B of feature vector x. A possible question, whose answer will inform us about which event B j B occurred, is the following: Does x have value b j for feature B? Rationale: Typically, the events target concept class and feature value are statistically dependent. Hence, the entropy of the event c(x) will become smaller if we learn about the value of some feature of x (recall that the class of x is unknown). We experience an information gain with regard to the outcome of experiment A, which is rooted in our information about the outcome of experiment B. Under no circumstances the information gain will be negative; the information gain is zero if the involved events are conditionally independent: P (A i ) = P (A i B j ), i {1,..., k}, j {1,..., s}, which leads to a split as specified as the special case in Lemma 8. ML:III-56 Decision Trees STEIN/LETTMANN 2005-2017

Remarks (continued) : Since H(A) is constant, the feature that provides the maximum information gain (= the maximally informative feature) is given by the minimization of H(A B). The expanded form of H(A B) reads as follows: H(A B) = s P (B j ) j=1 k P (A i B j ) log 2 (P (A i B j )) i=1 ML:III-57 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on Entropy (continued) Definition for two classes: ι entropy (p 1, p 2 ) = (p 1 log 2 (p 1 ) + p 2 log 2 (p 2 )) ( {(x, c(x)) D : c(x) = c1 } ι entropy (D) = {(x, c(x)) D : c(x) = c 2 } {(x, c(x)) D : c(x) = c 1 } log 2 + ) {(x, c(x)) D : c(x) = c 2 } log 2 ML:III-58 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on Entropy (continued) Definition for two classes: ι entropy (p 1, p 2 ) = (p 1 log 2 (p 1 ) + p 2 log 2 (p 2 )) ( {(x, c(x)) D : c(x) = c1 } ι entropy (D) = {(x, c(x)) D : c(x) = c 2 } {(x, c(x)) D : c(x) = c 1 } log 2 + ) {(x, c(x)) D : c(x) = c 2 } log 2 Graph of the function ι entropy (p 1, 1 p 1 ) : ι 1.0 0 0.5 p 2 1 p 1 [Graph: Misclassification, Gini] ML:III-59 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on Entropy (continued) Graph of the function ι entropy (p 1, p 2, 1 p 1 p 2 ) : 0.4 x 0.6 0.8 1 Entropy H 1.6 1.4 1.2 0 0 0.2 0.2 0.4 0.6 y 0.8 1 1 0.8 0.6 0 0 0.2 0.2 0.4 0.6 0.4 0.8 0.6 x 1 0.8 1 y ML:III-60 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on Entropy (continued) Definition for k classes: k ι entropy (p 1,..., p k ) = p i log 2 (p i ) i=1 ι entropy (D) = k i=1 {(x, c(x)) D : c(x) = c i } log 2 {(x, c(x)) D : c(x) = c i } ML:III-61 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on Entropy (continued) ι entropy corresponds to the information gain H(A) H(A B) : s D j ι entropy = ι entropy (D) ι entropy(d j ) } {{ } H(A) j=1 } {{ } H(A B) ML:III-62 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on Entropy (continued) ι entropy corresponds to the information gain H(A) H(A B) : s D j ι entropy = ι entropy (D) ι entropy(d j ) Legend: } {{ } H(A) j=1 } {{ } H(A B) A i, i = 1,..., k, denotes the event that x X(t) belongs to class c i. The experiment A corresponds to the classification c : X(t) C. B j, j = 1,..., s, denotes the event that x X(t) has value b j for feature B. The experiment B corresponds to evaluating feature B and entails the following splitting: X(t) = X(t 1 )... X(t s ) = {x X(t) : x B = b 1 }... {x X(t) : x B = b s } ι entropy (D) = ι entropy (P (A 1 ),..., P (A k )) = k i=1 P (A i) log 2 (P (A i )) = H(A) D j ι entropy (D j ) = P (B j ) ι entropy (P (A 1 B j ),..., P (A k B j )), j = 1,..., s P (A i ), P (B j ), P (A i B j ) are estimated as relative frequencies based on D. ML:III-63 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on the Gini Index Definition for two classes: ι Gini (p 1, p 2 ) = 1 (p 1 2 + p 2 2 ) = 2 p 1 p 2 ι Gini (D) = 2 {(x, c(x)) D : c(x) = c 1} {(x, c(x)) D : c(x) = c 2} ML:III-64 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on the Gini Index Definition for two classes: ι Gini (p 1, p 2 ) = 1 (p 1 2 + p 2 2 ) = 2 p 1 p 2 ι Gini (D) = 2 {(x, c(x)) D : c(x) = c 1} {(x, c(x)) D : c(x) = c 2} Graph of the function ι Gini (p 1, 1 p 1 ) : ι 0.25 0 0.5 p 2 1 p 1 [Graph: Misclassification, Entropy] ML:III-65 Decision Trees STEIN/LETTMANN 2005-2017

Impurity Functions Based on the Gini Index (continued) Definition for k classes: k ι Gini (p 1,..., p k ) = 1 (p i ) 2 i=1 ι Gini (D) = ( k i=1 ) 2 {(x, c(x)) D : c(x) = c i } k ( {(x, c(x)) D : c(x) = ci } i=1 ) 2 = 1 k ( {(x, c(x)) D : c(x) = ci } i=1 ) 2 ML:III-66 Decision Trees STEIN/LETTMANN 2005-2017