Decision Trees. Introduction. Some facts about decision trees: They represent data-classification models.

Similar documents
EECS 349:Machine Learning Bryan Pardo

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision Tree Learning Lecture 2

Notes on Machine Learning for and

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Lecture 7: DecisionTrees

Lecture 3: Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Machine Learning & Data Mining

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Machine Learning 3. week

Linear Classifiers: Expressiveness

Lecture 3: Decision Trees

Decision Tree Learning

Chapter 3: Decision Tree Learning

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Introduction to Machine Learning CMU-10701

the tree till a class assignment is reached

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Trees.

Jeffrey D. Ullman Stanford University

C4.5 - pruning decision trees

Tutorial 6. By:Aashmeet Kalra

Machine Learning

Decision Tree Learning

Informal Definition: Telling things apart

CS 6375 Machine Learning

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision Trees. Tirgul 5

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Machine Learning

Decision Trees.

Machine Learning 2nd Edi7on

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Artificial Intelligence Roman Barták

Decision Trees Entropy, Information Gain, Gain Ratio

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Stephen Scott.

Decision Trees / NLP Introduction

Apprentissage automatique et fouille de données (part 2)

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Dan Roth 461C, 3401 Walnut

Information Theory & Decision Trees

Decision Tree Learning and Inductive Inference

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Midterm, Fall 2003

Decision Tree Learning

Machine Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Decision Tree Learning

Statistical Machine Learning from Data

CSCI 5622 Machine Learning

Machine Learning 2010

Decision Tree Learning

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Holdout and Cross-Validation Methods Overfitting Avoidance

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Artificial Intelligence. Topic

Empirical Risk Minimization, Model Selection, and Model Assessment

Classification: Decision Trees

Classification Algorithms

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Classification: Decision Trees

Decision Tree Learning

Machine Learning

CHAPTER-17. Decision Tree Induction

Machine Learning

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Decision Trees (Cont.)

Learning Classification Trees. Sargur Srihari

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Chapter 3: Decision Tree Learning (part 2)

Learning Decision Trees

Generative v. Discriminative classifiers Intuition

CS 543 Page 1 John E. Boon, Jr.

Learning Decision Trees

Introduction to Machine Learning (67577) Lecture 5

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Name (NetID): (1 Point)

Learning Decision Trees

CS145: INTRODUCTION TO DATA MINING

Nonlinear Classification

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Information Gain, Decision Trees and Boosting ML recitation 9 Feb 2006 by Jure

UVA CS 4501: Machine Learning

Decision Tree Learning - ID3

Supervised Learning via Decision Trees

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Transcription:

Decision Trees Introduction Some facts about decision trees: They represent data-classification models. An internal node of the tree represents a question about feature-vector attribute, and whose answer dictates which child node is next queried. Each leaf node represents a potential classification of the feature vector. Alternatively (from a logical perspective), decision trees allow classes of elements to be represented as logical disjunctions of conjunctions of attribute values. 1

2

Here are some characteristics of data universes that admit good decision-tree models. Instances are represented by feature vectors whose attributes are preferably discrete. Instances are discretely classified. A disjunctive description represents a reasonable means of representing a class of elements. Errors may exist, such as missing attribute values or erroneous classification of some training vectors. Entropy of a collection of classified feature vectors. Given a set of training vectors S, suppose that the vectors are classified in c different ways, and that p i represents the proportion of vectors in S that belong to the i th class. Then the classification entropy of S is defined as H(S) = c p i log 2 p i. i=1 Example 1. Calculate the classification entropy for the following set of feature vectors. Weight color texture classification medium orange smooth orange heavy green smooth melon medium green smooth apple light red bumpy berry medium orange bumpy orange light red bumpy berry heavy green rough melon medium red smooth apple heavy yellow smooth melon medium yellow smooth orange medium red smooth apple medium green smooth apple medium orange rough orange 3

Quinlan s ID3 algorithm. At each phase of the construction of the decision tree, and for each branch of the decision tree under construction, the attribute A that is considered next is the one which 1. has yet to be considered along that branch; and 2. minimizes the conditional classification entropy H(S A). Indeed, for a given feature/attribute A, H(S A) = a A S a S H(S a), where S is the set of training vectors that that reach the current branch under construction, and, for all a A, S a represents the set of feature vectors v S such that v A = a. Clearly, the smaller H(S A), the less classification information that remains in the vectors of S, once they are divided according to their A-attribute. Example 2. Using the table of feature vectors from Example 1 and the concept of conditional classificastion entropy, construct a decision tree for classifying fruit as either, apple, orange, melon, or berry. 4

Split Information. Let A be an attribute, considered here as a discrete set of possible values. Then the split information relative to A and a set of feature vectors S is defined as SI(A, S) = a A S a S log S a S, where S a represents the set of feature vectors v S such that v A = a. For attributes that take on many values, using the gain ratio IG(S,A) SI(S,A) instead of IG(S, A) can help avoid favoring many-valued attributes which may not perform well in classifying the data. Here IG(S, A) is defined as H(S) H(S A). Example 3. Suppose attribute A has 8 possible values. If selecting A for the next node of a decision tree yields I(S, A) = 2 bits of information, then compute IG(S,A) SI(S,A). 5

Classifier Selection and The Bias-Variance Tradeoff The hypothesis space of decision trees is complete in the sense that, for every set S of training vectors, there exists a decision tree T S which correctly classifies all of the training vectors. Moreover, that tree can be constructed quite easily, assuming discrete features and a finite number of classes. So why bother with ID3 when there is already a tree that will correctly classify all the training data? The problem of course is that this tree does not inform use about how to classify any of the data that is not part of the training set. For those cases let us suppose that (assuming two classes that are equally probable) the tree is designed so that non-training vectors are classified by the toss of a coin. Such a classifier that correctly classifies all training vectors and randomly classifies non-training vectors represents an extreme example of an unbiased classifier, in that it makes no assumptions about correlations between a vector s class and attribute values. In practice, being unbiased implies a lack of learning. For example, if you touch a very hot baking dish just removed from the oven, chances are you will think twice about touching it again the next time you see it on the counter. In other words, your next encounter with that dish on the counter will be biased from the past encounter. Another quality that T S suffers from is that of possessing a high amount of variance that exists in how a vector is classified based on the training set. When a vector is in the training set, it is correctly classified, but in all other cases it will receive a random classification that has a variance that grows according to the number of possible classes. Ideally, a good learning algorithm should keep the variance low, meaning that the classification of a vector does not change much from training set to training set. For example, given a large enough basket of fruit to learn from, we would expect that our concept of an orange (e.g. medium-sized, orange-colored, and smooth) would not change from basket to basket. In this case our learning algorithm would display low variance. It should also be noted that attempting to minimize variance can sometimes lead to an increase in bias, which in turn may increase the overall classification error. For example, suppose we are biased towards classifying medium-sized, orange-colored, smooth fruit as oranges. Doing so may cause the misclassification of some smaller orange-complexioned grapefruits. In other words, by increasing bias for the sake of reducing variance, we sometimes make errors on the exceptions to the rule. The following mathematical derivation suggests that the ideal learning algorithm is one that strikes an optimal balance when attempting to reduce both bias and variance. For a given vector x, let P (c x) denote the classification probability distribution associated with x. Let γ be a classifier, and let γ(x) denote the class that γ assigns to x. Then the mean squared error of γ, denoted mse(γ) is defined as E x (γ(x) P (c x)) 2, where the expectation is taken over a probability distribution over the data universe X. Now let Γ denote a learning algorithm, and Γ D denote a particular classifier that is derived by the algorithm upon input of a randomly drawn training-data sample D. Assume that all training samples have a fixed size, and that they are obtainined by independent sampling from the distribution over X. Define the learning error of Γ, denoted, learning-error(γ) as learning-error(γ) = E S [mse(γ D )] = E D E x [Γ D (x) P (c x)] 2 = 6

E x E D [Γ D (x) P (c x)] 2, where the last equality is a change in the order of summation. Now the inner summation can be simplified using the following claim. Claim. E[x k] 2 = (Ex k) 2 + E[x Ex] 2, where Ex denotes the expectancy of random variable x, and k is some constant. The term (Ex k) 2 is called the bias term. Here, we are thinking of k as representing a desired target value that x is attempting to attain, while Ex denotes the average of what x actually attained in practice. Finally, E[x Ex] 2 is the definition of variance for random variable x. Proof of Claim. By linearity of expectation, E[x k] 2 = Ex 2 2kEx + k 2 = [(Ex) 2 2kEx + k 2 ] + [Ex 2 2(Ex) 2 + (Ex) 2 ] = (Ex k) 2 + E[x Ex] 2. Applying the claim to the expectation E D [Γ D (x) P (c x)] 2, we get learning-error(γ) = E x [bias(γ, x) + variance(γ, x)], where and bias(γ, x) = (E D Γ D (x) P (c x)) 2, variance(γ, x) = E D [Γ D (x) E D Γ D (x)] 2. Example 4. Suppose X = {1, 2, 3, 4} and that 1 and 2 are in class 0, while 3 and 4 are in class +1. Suppose that D = 2 (one training vector from each class) and that our learning algorithm uses a nearest neighbor algorithm, in that the resulting classifier classifies a number based on which training point it is nearest, breaking ties by tossing a coin. Compute the learning-error for this nearest neighbor algorithm. 7

In the light of bias and variance tradeoff, we see that the ID3 algorithm attempts to reduce the length of of branches in the search tree. This has the effect of reducing variance, at the expense of increasing bias. To further achieve this, the ID3 algorithm is usually followed by a rule pruning phase in which it is attempted to shorten rules. Rule post-pruning steps. 1. develop the decision tree without any concern towards overfitting 2. convert tree into an equivalent set of rules 3. prune each rule by removing any preconditions whose removal improves accuracy 4. sort rules by their estimated accuracy and use them in this sequence 8

Exercises 1. Draw a minimum-sized decision tree for the three-input XOR function which produces a 1 iff an odd number of the inputs evaluate to one. 2. Provide decision trees to represent the following Boolean functions: A and not B, A or (B and C), A xor B, (A and B) or (C and D). 3. Consider the following sets of training examples: Instance Classification a 1 a 2 1 + T T 2 + T T 3 - T F 4 + F F 5 - F T 6 - F T Calculate the entropy of the collection with respect to the classification, and determine which of the two attributes provides the most information gain. 4. Repeat Example 2, but instead use the measure IG(S,A) SI(S,A) node of the tree. 5. Create a decision tree using the ID3 Algorithm for the following table of data. to calculate the attribute to use at a given Vector A 1 A 2 A 3 Class v 1 1 0 0 0 v 2 1 0 1 0 v 3 0 1 0 0 v 4 1 1 1 1 v 5 1 1 0 1 6. Suppose X = {1, 2, 3, 4, 5, 6} and that 1,2,3 are in class 0, while 3,4,5 are in class 1. Suppose that each training set S has S = 2 (one training vector from each class) and that the learning algorithm Γ is again the nearest neighbor algorithm (see Example 4). Compute learning-error(γ). Also, compute bias(γ, 1) and variance(γ, 1). Hint: there are only nine possible training sets to consider. 9