Learning Classification Trees. Sargur Srihari

Similar documents
Decision Tree Learning and Inductive Inference

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Learning Decision Trees

Learning Decision Trees

Artificial Intelligence. Topic

Lecture 3: Decision Trees

Decision Tree Learning

Decision Tree Learning

Decision Trees / NLP Introduction

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Decision Tree Learning - ID3

Decision Tree Learning

Lecture 3: Decision Trees

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Dan Roth 461C, 3401 Walnut


Classification: Decision Trees

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Chapter 3: Decision Tree Learning

Typical Supervised Learning Problem Setting

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Machine Learning 2nd Edi7on

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

CS 6375 Machine Learning

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Decision Tree Learning

Decision Trees Entropy, Information Gain, Gain Ratio

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Decision Trees.

Decision Trees Part 1. Rao Vemuri University of California, Davis

Decision Trees.

Classification and regression trees

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Decision Trees. Gavin Brown

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Tutorial 6. By:Aashmeet Kalra

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Rule Generation using Decision Trees

Machine Learning in Bioinformatics

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Classification and Prediction

Classification Using Decision Trees

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

the tree till a class assignment is reached

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Apprentissage automatique et fouille de données (part 2)

Decision Trees. Tirgul 5

Machine Learning Recitation 8 Oct 21, Oznur Tastan

ARTIFICIAL INTELLIGENCE. Supervised learning: classification

Artificial Intelligence Decision Trees

Machine Learning Chapter 4. Algorithms

Data classification (II)

Slides for Data Mining by I. H. Witten and E. Frank

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Chapter 3: Decision Tree Learning (part 2)

Machine Learning & Data Mining

EECS 349:Machine Learning Bryan Pardo

Bayesian Classification. Bayesian Classification: Why?

Algorithms for Classification: The Basic Methods

Classification and Regression Trees

Decision Support. Dr. Johan Hagelbäck.

Symbolic methods in TC: Decision Trees

The Solution to Assignment 6

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Decision Tree. Decision Tree Learning. c4.5. Example

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Decision Tree And Random Forest

Bayesian Learning Features of Bayesian learning methods:

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Classification II: Decision Trees and SVMs

Chapter 6: Classification

Machine Learning Alternatives to Manual Knowledge Acquisition

Administrative notes. Computational Thinking ct.cs.ubc.ca

Notes on Machine Learning for and

Decision Trees. Common applications: Health diagnosis systems Bank credit analysis

Decision Trees. Danushka Bollegala

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

UVA CS 4501: Machine Learning

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Einführung in Web- und Data-Science

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Leveraging Randomness in Structure to Enable Efficient Distributed Data Analytics

Mining Classification Knowledge

Introduction Association Rule Mining Decision Trees Summary. SMLO12: Data Mining. Statistical Machine Learning Overview.

Classification: Decision Trees

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Reminders. HW1 out, due 10/19/2017 (Thursday) Group formations for course project due today (1 pt) Join Piazza (

Data Mining in Bioinformatics

Chapter 4.5 Association Rules. CSCI 347, Data Mining

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Mining Classification Knowledge

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

Introduction to Machine Learning CMU-10701

Machine Learning: Symbolische Ansätze. Decision-Tree Learning. Introduction C4.5 ID3. Regression and Model Trees

Transcription:

Learning Classification Trees Sargur srihari@cedar.buffalo.edu 1

Topics in CART CART as an adaptive basis function model Classification and Regression Tree Basics Growing a Tree 2

A Classification Tree PlayTennis Attributes Day Outlook Temp Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Learned function is a tree 3

Components of Classification Tree Classify instances by sorting them down from the root to the leaf node, Each node specifies a test of an attribute Each branch descending from a node corresponds a possible value of this attribute 4

Classification Tree Learning Approximates discrete-valued target functions as trees Robust to noisy data and capable of learning disjunctive expressions A family of decision tree learning algorithms includes ID3, ASSISTANT and C4.5 Use a completely expressive hypothesis space 5

Graphical Model for PlayTennis Outlook (3) Temp (3) Humidity (2) Wind (2) Play Tennis (2) Need 76 Probabilities Can we do better? 6

Four Tree Stumps for PlayTennis 7

Good and Poor attributes An attribute is good when: for one value we get all instances as positive for other value we get all instances as negative An attribute is poor when: it provides no discrimination attribute is immaterial to the decision for each value we have same number of positive and negative instances 8

Entropy: Measure of Homogeneity of Examples Entropy: Characterizes the (im)purity of an arbitrary collection of examples Given a collection S of positive and negative examples, entropy of S relative to boolean classification is Entropy(S) p + log 2 p + p log 2 p Where p + is proportion of positive examples and p - is proportion of negative examples 9

Entropy Function Relative to a Boolean Classification 10

Entropy Illustration: S is a collection of 14 examples with 9 positive and 5 negative examples Entropy of S relative to the Boolean classification: Entropy (9+, 5-) = -(9/14)log 2 (9/14) - (5/14)log 2 (5/14) = 0.940 Entropy is zero if all members of S belong to the same class 11

Entropy for multi-valued target function If the target attribute can take on c different values, the entropy of S relative to this c-wise classification is Entropy (S) c i=1 p i log 2 p i 12

Information Gain Entropy measures the impurity of a collection Information Gain is defined in terms of Entropy expected reduction in entropy caused by partitioning the examples according to this attribute 13

Information Gain Measures the Expected Reduction in Entropy Information gain of attribute A is the reduction in entropy caused by partitioning the set of examples S Gain(S,A) Entropy(S) S v S Entropy(S ) v v Values(A) where Values (A) is the set of all possible values for attribute A and S v is the subset of S for which attribute A has value v 14

Measure of purity: information (bits) Info[2,3]=entropy(2/5, 3/5) = -2/5 log 2/5-3/5 log 3/5 = 0.97 bits Info[2,3]=0.97 bits Info[4,0]=0 bits Info[3,2]=0.97 bits Average info of subtree(weighted)= 0.97x5/14 + 0 x 4/14 + 0.97 x 5/14 = 0.693 bits Info of all training samples, info[9,5] = 0.94 gain(outlook) = 0.94-0.693 = 0.247 bits 15

Expanded Tree Stumps for PlayTennis for outlook=sunny Info(2,3)=0.97 Info(0,2)=info(1,0)=0 Info(1,1)=0.5 Ave Info=0+0+(1/5) Gain(temp)=0.97-0.2 =0.77 bits Info(3,3)=1 Info(2,2)=Info(1,1)=1 Gain(windy)=1-1 =0 bits Info (3,2)= 0.97 Info (0,3)=Info(2,0)=0 Gain(humidity)=0.97 bits Since Gain(humidity) is highest,select humidity as splitting attribute. No need to split further 16

Information gain for each attribute Gain(outlook) = 0.94 0.693 = 0.247 Gain(temperature)= 0.94 0.911 = 0.029 Gain(humidity)= 0.94 0.788 = 0.152 Gain(windy)= 0.94 0.892 = 0.048 arg Max {0.247, 0.029, 0.152, 0.048} = outlook Select outlook as the splitting attribute of tree 17

Decision Tree for the Weather Data 18

The Basic Decision Tree Learning Algorithm (ID3) Top-down, greedy search (no backtracking) through space of possible decision trees Begins with the question which attribute should be tested at the root of the tree? Answer evaluate each attribute to see how it alone classifies training examples Best attribute is used as root node descendant of root node is created for each possible value of this attribute 19

Which Attribute Is Best for the Classifier? Select attribute that is most useful for classification ID3 uses Information gain as a quantitative measure of an attribute Information Gain: A statistical property that measures how well a given attribute separates the training examples according to their target classification. 20

ID3 Algorithm Notation Attribute A that best classifies the examples (Target Attribute for ID3) {Attributes} - A Root node values v1, v2 and v3 of Attribute A Label = - Label = + 21

ID3 Algorithm to learn boolean-valued functions ID3 (Examples, Target_attribute, Attributes) Examples are the training examples. Target_attribute is the attribute (or feature) whose value is to be predicted by the tree. Attributes is a list of other attributes that may be tested by the learned decision tree. Returns a decision tree (actually the root node of the tree) that correctly classifies the given Examples. Create a Root node for the tree If all Examples are positive, Return the single-node tree Root, with label = + If all Examples are negative, Return the single-node tree Root, with label = - If Attributes is empty, Return the single-node tree Root, with label = the most common value of Target_attribute in Examples %Note that we will return the name of a feature at this point 22

ID3 Algorithm, continued Otherwise Begin A the attribute from Attributes that best* classifies Examples The decision attribute (feature) for Root A For each possible value v i, of A, Add a new tree branch below Root, corresponding to test A = v i Let Examples vi the subset of Examples that have value v i for A If Examples vi is empty Then below this new branch, add a leaf node with label = most common value of Target_attribute in Examples Else, below this new branch add the subtree ID3(Examples vi,target_attribute, Attributes - {A}) End Return Root Sv Gain( S, A) Entropy ( S) Entropy ( Sv) v Values( A) S The best attribute is the one with the highest information gain, as defined in Equation: 23

Perfect feature If feature outlook has two values: sunny and rainy If for sunny all 5 values of playtennis are yes If for rainy all 9 values of playtennis are no Gain(S,outlook) = 0.94 - (5/9).0 - (9/9).0 = 0.94 24

Training Examples for Target Concept PlayTennis Table 3.2 Day Outlook Temp Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No 25

Stepping through ID3 for the example Top Level (with S={D1,..,D14}) Gain(S, Outlook) = 0.246 Gain(S,Humidity)= 0.151 Best prediction of target attribute Gain(S,Wind) = 0.048 Gain(S,Temperature) = 0.029 Example computation of Gain for Wind Values(Wind) = Weak, Strong S =[9+,5-] S weak <-- [6+,2-], S strong <--[3+,3-] Gain(S,Wind) = 0.940 - (8/14)0.811 - (6/14)1.00 = 0.048 26

Sample calculations for Humidity and Wind Figure 3.3 27

The Partially Learned Decision Tree Humidity is a perfect feature since there is no uncertainty left when we know its value 28

Hypothesis Space Search in Decision Tree Learning ID3 searches a hypothesis space for one that fits training examples Hypothesis space searched is set of possible decision trees ID3 performs hill-climbing, starting with empty tree, considering progressively more elaborate hypotheses (to find tree to correctly classify training data) Hill climbing is guided by evaluation function which is the gain measure 29

Hypothesis Space Search by ID3 Information gain heuristic guides search of hypothesis space by ID3 Figure 3.5 30

Decision Trees and Rule-Based Systems Learned trees can also be re-represented as sets of if-then rules to improve human readability 31

Decision Trees represent disjunction of conjunctions (Outlook = Sunny Humidity=Normal) (Outlook=Overcast) (Outlook=Rain Wind=Weak) 32

Appropriate Problems for Decision Tree Learning Instances are represented by attribute-value pairs each attribute takes on a small no of disjoint possible values, eg, hot, mild, cold extensions allow real-valued variables as well, eg temperature The target function has discrete output values eg, Boolean classification (yes or no) easily extended to multiple-valued functions can be extended to real-valued outputs as well 33

Appropriate Problems for Decision Tree Learning (2) Disjunctive descriptions may be required naturally represent disjunctive expressions The training data may contain errors robust to errors in classifications and in attribute values The training data may contain missing attribute values eg, humidity value is known only for some training examples 34

Appropriate Problems for Decision Tree Learning (3) Practical problems that fit these characteristics are: learning to classify medical patients by their disease equipment malfunctions by their cause loan applications by by likelihood of defaults on payments 35

Capabilities and Limitations of ID3 Hypothesis space is a complete space of all discrete valued functions Cannot determine how many alternative trees are consistent with training data (follows from maintaining a single current hypothesis) ID3 in its pure form performs no backtracking (usual risks of hill-climbing- converges to local optimum) ID3 uses all training examples at each step to make statistically based decisions regarding how to refine its current hypothesis more robust than Find-S and Candidate Elimination which are incrementally-based 36