Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Similar documents
Classification: Decision Trees

Machine Learning 2nd Edi7on

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18


Decision Tree Learning and Inductive Inference

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Lecture 3: Decision Trees

Learning Decision Trees

CS 6375 Machine Learning

Learning Classification Trees. Sargur Srihari

Classification and Prediction

the tree till a class assignment is reached

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Learning Decision Trees

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Induction on Decision Trees

Decision Trees Entropy, Information Gain, Gain Ratio

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision Tree Learning

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Decision Trees.

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Decision Tree. Decision Tree Learning. c4.5. Example

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Decision Tree Learning

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

CS145: INTRODUCTION TO DATA MINING

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Decision Trees. Tirgul 5

EECS 349:Machine Learning Bryan Pardo

Introduction to Machine Learning CMU-10701

Lecture 3: Decision Trees

Dan Roth 461C, 3401 Walnut

Chapter 3: Decision Tree Learning

Decision Trees.

Decision Trees. Gavin Brown

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Notes on Machine Learning for and

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision Tree Learning - ID3

Einführung in Web- und Data-Science

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Decision Tree Learning

C4.5 - pruning decision trees

Decision Trees / NLP Introduction

Administrative notes. Computational Thinking ct.cs.ubc.ca

Machine Learning & Data Mining

Decision Support. Dr. Johan Hagelbäck.

Rule Generation using Decision Trees

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Algorithms for Classification: The Basic Methods

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Symbolic methods in TC: Decision Trees

The Solution to Assignment 6

Symbolic methods in TC: Decision Trees

Classification and Regression Trees

Bayesian Classification. Bayesian Classification: Why?

Machine Learning Alternatives to Manual Knowledge Acquisition

Decision Trees. Danushka Bollegala

Machine Learning 3. week

Data classification (II)

Decision Trees Part 1. Rao Vemuri University of California, Davis

Classification Using Decision Trees

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Lecture 7 Decision Tree Classifier

Induction of Decision Trees

Decision Tree And Random Forest

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Machine Learning: Symbolische Ansätze. Decision-Tree Learning. Introduction C4.5 ID3. Regression and Model Trees

CHAPTER-17. Decision Tree Induction

Artificial Intelligence. Topic

Decision Tree Learning Lecture 2

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Artificial Intelligence Decision Trees

Classification: Decision Trees

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Chapter 3: Decision Tree Learning (part 2)

Typical Supervised Learning Problem Setting

Slides for Data Mining by I. H. Witten and E. Frank

Classification II: Decision Trees and SVMs

Decision Tree Learning

Machine Learning Chapter 4. Algorithms

Artificial Intelligence EDAF70

Decision Trees (Cont.)

Tutorial 6. By:Aashmeet Kalra

Chapter 6: Classification

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Classification and regression trees

10-701/ Machine Learning: Assignment 1

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Transcription:

Decision trees Special Course in Computer and Information Science II Adam Gyenge Helsinki University of Technology 6.2.2008

Introduction Outline: Definition of decision trees ID3 Pruning methods Bibliography: J. Ross Quinlan: Induction of Decision Trees. Machine Learning 1(1): 81-106 (1986) J. Ross Quinlan: Simplifying Decision Trees. International Journal of Man-Machine Studies 27(3): 221-234 (1987) T. Mitchell: Machine Learning. McGraw Hill (1997) Adam Gyenge, HUT Decision trees, Spec. Course II 2/32

Decision trees There are objects in the world, they have many attributes. The attributes can take their values from a given set (type: category, integer or real). A decision tree, using the attributes, gives a category for the object (classification). Each non-leaf node of the tree is a test on an attribute, having a child for each result of the test. For an incoming object, we shall first do the test in the root, then follow the route down to the leaves. Each leaf is a category, if we reach one of them, then the category of the object is the category belonging to that leaf. Adam Gyenge, HUT Decision trees, Spec. Course II 3/32

A simple example Whether to play tennis or not. Attributes: outlook {sunny, overcast, rain} humidity {high, normal} windy {true, false} A possible tree: Adam Gyenge, HUT Decision trees, Spec. Course II 4/32

Decision rules A path from the root to a leaf: a decision rule. if T 1 T 2 T n then Y = y i Where T j is a test on the j-th attribute (it may be always true ) and y i is the i-th category. The decision tree represents a consistent set of decision rules. Adam Gyenge, HUT Decision trees, Spec. Course II 5/32

Learning a decision tree 1 Input: a training set of objects, whose class is known. Output: a decision tree, that can categorize any object induction Now, for simplicity, there will be only two classes (Positive and Negative), and all the attributes are category type Adam Gyenge, HUT Decision trees, Spec. Course II 6/32

Learning a decision tree 2 If there are two objects with identical attributes but different categories, no correct tree exists There can be several trees that classify the entire training set correctly Aim: construct the simplest tree We expect, that this classifies correctly more objects outside the training set Adam Gyenge, HUT Decision trees, Spec. Course II 7/32

ID3 Recursive algorithm: 1. If all the training examples are from the same class, then let this be a category node and RETURN. Otherwise: 2. If there are no more attributes left, then let this be a category node using majority voting and RETURN. Otherwise: 3. Choose the best attribute for the current node 4. For each value of the attribute create a child node 5. Split the training set: assign each example to the corresponding child node 6. For each node: 6.1 If the node is empty, then let it be a category node using majority voting in the examples of it s parents 6.2 If the node is not empty, then call this algorithm on it Adam Gyenge, HUT Decision trees, Spec. Course II 8/32

Choosing the best attribute 1 Which is the best attribute? We want the classes of examples in the child-nodes to be homogeneous We assume, that the training set represents well the original population: Pr(C = Positive) = p p + n Pr(C = Negative) = n p + n where C is the class attribute p is the number of samples with positive class n is the number of samples with negative class Adam Gyenge, HUT Decision trees, Spec. Course II 9/32

Choosing the best attribute 2 Entropy: The expected number of bits, required to encode the class labels as messages, drawn randomly from the samples (S) H(S) = p ( ) p p + n log 2 n ( ) n p + n p + n log 2 p + n The less entropy means less uncertainty the more homogeneous the sample is Adam Gyenge, HUT Decision trees, Spec. Course II 10/32

Choosing the best attribute 3 Assume attribute A taking V different values {S 1,..., S v } are the sample sets belonging to each value The average entropy of the sorted sample set is: E(S, A) = v i=1 S i S H(S i) The information gained by choosing A as branching attribute is: gain(a) = H(S) E(S, A) The best attribute: for which gain(a) is maximum, or (since H(S) is the same for each attribute) E(S, A) is minimum. Adam Gyenge, HUT Decision trees, Spec. Course II 11/32

A simple example No Outlook Temperature Humidity Windy Class 1 sunny hot high false N 2 sunny hot high true N 3 overcoast hot high false P 4 rain mild high false P 5 rain cool normal false P 6 rain cool normal true N 7 overcast cool normal true P 8 sunny mild high false N 9 sunny cool normal false P 10 rain mild normal false P 11 sunny mild normal true P 12 overcast mild high true P 13 overcast hot normal false P 14 rain mild high true N Adam Gyenge, HUT Decision trees, Spec. Course II 12/32

Calculations There are 9 positive and 5 negative examples (9 + 5 = 14). H(S) = 9 14 log 9 14 5 14 log 5 14 = 0.94 For the attribute Outlook : sunny: 2 positive, 3 negative (2+3=5), H(S sunny ) = 0.971 overcast: 4 positive, 0 negative (4+0=4), H(S overcast ) = 0 rain: 3 positive, 2 negative (3+2=5), H(S rain ) = 0.971 E(S, Outlook) = 5 14 H(S sunny) + 4 14 H(S overcast) + 5 14 H(S rain) = 0.694 Adam Gyenge, HUT Decision trees, Spec. Course II 13/32

Gain of attributes The information gain of the Outlook attribute: gain(outlook) = 0.940 0.694 = 0.246 Similarly: gain(temperature) = 0.029 gain(humidity) = 0.151 gain(windy) = 0.048 Result: Outlook is the best attribute for the root node Adam Gyenge, HUT Decision trees, Spec. Course II 14/32

Problem of irrelevant attributes There may be some attribute, that are irrelevant for the decision In these cases, the information gain is small (although not 0) We can define a threshold for the gain (absolute or percentage) If there is no attribute to exceed the threshold, we stop the recursion Adam Gyenge, HUT Decision trees, Spec. Course II 15/32

Problem with information gain A is an attribute with a set of possible values: {A 1,..., A v } Create a new attribute B by splitting one of the values into two (eg. A v splits into B v and B v+1, proportion does not matter) It can be shown, that gain(b) gain(a) always So ID3 prefers attributes with more values builds flat trees It s better to use the gain ratio as attribute selecting function IV (A) = v p i +n i i=1 p+n log p i +n i p+n gain ratio(a) = gain(a) IV (A) Adam Gyenge, HUT Decision trees, Spec. Course II 16/32

Problem of overfitting The training data may contain noise We want to learn the general distribution, to reduce classification error on any data Error of hypothesis h on training data: error train (h) Error of hypothesis h on entire distribution: error D (h) Hypothesis h overfits the training data, if there is an alternative hypothesis h H, for which error train (h) < error train (h ) but error D (h) > error D (h ) Adam Gyenge, HUT Decision trees, Spec. Course II 17/32

Avoiding overfitting Pruning: Prepruning: stop growing, when the data split is not significant (covered earlier, problem of irrelevant attributes) Postpruning: grow full tree, and then substitute some subtrees by single nodes. A few methods: 1. Reduced Error Pruning 2. Cost-Complexity Pruning 3. Pessimistic Pruning We usually split the input set into 3 parts 1. Training set: used for constructing the tree 2. Pruning set: independent from training set, used by some pruning methods 3. Test set: estimation of accuracy on a real data Adam Gyenge, HUT Decision trees, Spec. Course II 18/32

Reduced Error Pruning Algorithm: 1. Let S be a subtree in T 2. T : S is substituted with the best leaf (majority voting in training set) 3. E is the number of missclassified items of T on the pruning set 4. E is the number of missclassified items of T on the pruning set 5. If E E, then prune S 6. Repeat this, until possible What is the order of the nodes to consider? Bottom-up Adam Gyenge, HUT Decision trees, Spec. Course II 19/32

Example of REP An example tree and pruning set: Adam Gyenge, HUT Decision trees, Spec. Course II 20/32

Example of REP Adam Gyenge, HUT Decision trees, Spec. Course II 21/32

Cost-complexity pruning Considers error on training set error on pruning set size of the tree Tries to minimize error and complexity Adam Gyenge, HUT Decision trees, Spec. Course II 22/32

Notations T is a decision tree N is the size of the training set used for generating T L(T) is the number of leaves in T E is the number of missclassified training examples Cost-complexity (or total cost) = Error cost + Cost of complexity CC(T ) = E N + α L(T ) α is the cost of complexity per leaf Adam Gyenge, HUT Decision trees, Spec. Course II 23/32

CC Pruning S is a subtree of T, we substitute it with a single leaf by majority voting beetwen its descendant leaves M = (the new number of misclassified training examples) - E The new tree has L(T ) L(S) + 1 leaves. If CC(T old ) = CC(T new ), then α = M N (L(S) 1) Adam Gyenge, HUT Decision trees, Spec. Course II 24/32

CC Pruning - Algorithm Algorithm: 1. Compute α for each node, 2. Find the minimum, prune subtrees having this value 3. Repeat steps 1, 2 until one leaf is left, this gives a series of trees: T 0,..., T k 4. Standard error (N is the size of the pruning set, and E is the smallest error of all the trees on the pruning set): E se = (N E ) 5. Choose the smallest tree, whose observed error on the pruning set does not exceeds E + se (=not very far from the minimum error) N Adam Gyenge, HUT Decision trees, Spec. Course II 25/32

A medical example 4 categories: primary, secondary, compensated, negative The numbers of training examples, used for generating the tree, are shown in parenthesis Adam Gyenge, HUT Decision trees, Spec. Course II 26/32

Example of CCP Let s consider the subtree of the node [T4U]. There is only one example, which is not negative If we replace this subtree, with a leaf (of course with negative label), then: M=1 L(S)=4 N=2018 α = 1 2018 3 = 0.00013, this will be the least in the whole tree So at first, we shall replace this subtree by a leaf labelled as negative Adam Gyenge, HUT Decision trees, Spec. Course II 27/32

Pessimistic Pruning Notations again: N is the number of training examples K of them corresponds to a specific leaf J of them is missclassified, if we use a majority elected label (J = K - number of majority examples) Estimation of error on the leaf: J K A better one (with the continuity correction of binomial distribution): J+1/2 K Expected error for K unseen samples: J + 1/2 Adam Gyenge, HUT Decision trees, Spec. Course II 28/32

Pessimistic Pruning If we think pessimisticly: S is a subtree, contains L(S) leaves J and K are the sums of J and K over these leaves Expected error on S for K unseen samples: J + L(S)/2 Let E be the number of misclassified training examples, if we substitute S by majority vote If E + 1/2 ( J + L(S)/2) + se then prune S (se is the standard error, as before) Algorithm: 1. Top-down checking of nodes (all none-leaf nodes are examined just once) 2. If possible, then substitute subtree with majority labelled leaf Adam Gyenge, HUT Decision trees, Spec. Course II 29/32

A medical example - revisited Adam Gyenge, HUT Decision trees, Spec. Course II 30/32

Example of PP For the node [T4U]: K = 2018 J = 0 L(S) = 4 Estimate of error on S = 0 + 4/2 = 2 2 (2018 2) Standard error = 2018 = 1.41 Result of election: negative, E = 1 1 + 1/2 < 2.0 + 1.41 substitute subtree of [T4U] with [negative] leaf Adam Gyenge, HUT Decision trees, Spec. Course II 31/32

Summary Decision trees for classification ID3 uses entropy based information gain for selecting best attribute to split on Problem of irrelevant attributes Overfitting can be avoided by postpruning 1. Reduced error pruning 2. Cost-complexity pruning 3. Pessimistic pruning Thank you for your attention. Questions? Adam Gyenge, HUT Decision trees, Spec. Course II 32/32