Lecture 7 Decision Tree Classifier

Similar documents
Decision Tree And Random Forest

CS145: INTRODUCTION TO DATA MINING

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Classification and Prediction

the tree till a class assignment is reached

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Tree Learning

Machine Learning 2nd Edi7on

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

CHAPTER-17. Decision Tree Induction

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Learning Decision Trees

Lecture 3: Decision Trees

Learning Decision Trees

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Decision Trees. Tirgul 5

C4.5 - pruning decision trees

Decision T ree Tree Algorithm Week 4 1

Machine Learning 3. week

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Classification: Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS 6375 Machine Learning

EECS 349:Machine Learning Bryan Pardo

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

CS 412 Intro. to Data Mining

Induction of Decision Trees

Lecture 7: DecisionTrees

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

CSE 5243 INTRO. TO DATA MINING

Decision Trees. Gavin Brown

Lecture 3: Decision Trees

CSCI 5622 Machine Learning


Decision Trees Entropy, Information Gain, Gain Ratio

Information Gain, Decision Trees and Boosting ML recitation 9 Feb 2006 by Jure

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Decision Tree. Decision Tree Learning. c4.5. Example

Decision Tree Learning Lecture 2

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Machine Learning & Data Mining

Decision Tree Learning

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Data Mining and Analysis: Fundamental Concepts and Algorithms

Chapter 3: Decision Tree Learning

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Dan Roth 461C, 3401 Walnut

Einführung in Web- und Data-Science

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Decision Trees.

Data Mining: Concepts and Techniques

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Classification: Decision Trees

UVA CS 4501: Machine Learning

Chapter 6: Classification

Supervised Learning via Decision Trees

15-381: Artificial Intelligence. Decision trees

Lecture VII: Classification I. Dr. Ouiem Bchir

26 Chapter 4 Classification

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Decision Tree Learning and Inductive Inference

Jeffrey D. Ullman Stanford University

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Knowledge Discovery and Data Mining

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Learning Decision Trees

Decision Trees.

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Decision Tree Learning

Classification and Regression Trees

Decision Tree Learning - ID3

Cse352 AI Homework 2 Solutions

Learning Classification Trees. Sargur Srihari

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Informal Definition: Telling things apart

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Artificial Intelligence Decision Trees

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Data Mining: Concepts and Techniques

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Decision trees COMS 4771

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Tutorial 6. By:Aashmeet Kalra

Decision Trees Part 1. Rao Vemuri University of California, Davis

Naïve Bayesian. From Han Kamber Pei

Classification and regression trees

Parts 3-6 are EXAMPLES for cse634

Transcription:

Machine Learning Dr.Ammar Mohammed Lecture 7 Decision Tree Classifier

Decision Tree A decision tree is a simple classifier in the form of a hierarchical tree structure, which performs supervised classification using a divide-andconquer strategy Branches corresponds to possible answer root Leaf node indicates the class (category) Classification is performed by routing from the root node until arriving at a leaf node.

Decision Tree: Example decision tree for determining whether to play tennis.

Decision Tree: Training Dataset Training Example age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31 40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31 40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 40 medium no excellent yes 31 40 high yes fair yes >40 medium no excellent no

Decision Tree: Training Dataset Output: A Decision Tree for buys_computer age? <=30 overcast 31..40 >40 student? yes credit rating? no yes excellent fair no yes yes no

Decision Tree: How to build the Tree Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on best selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf There are no samples left

Example

Example: possible answer EyeColor brown blue yes Married no football HairLength HairLength long short short long football Netball football Netball

Example: Another answer Sex male female football Netball The Question is I.What is the best Tree? II. Which Attribute to be selected for splitting branches?

How to Construct a Tree Algorithm : BuildSubtree Require: Node n, Data D 1: (n split,d L,D R )=FindBestSplit(D) 2: if StoppingCriteria(D L ) then 3: n left_prediction=findprediction(d L ) 4: else 5: BuildSubtree (n left,d L ) 6: if StoppingCriteria(D R ) then 7: n right_prediction=findprediction(d R ) 8: else 9: BuildSubtree (n right,d R )

Information, Entropy and Impurity Entropy is a measure of the disorder or unpredictability in a system. Given a binary (two-class) classification, C, and a set of examples, S, the class distribution at any node can be written as (p 0, p 1 ), where p 1 =1 - p 0, and the entropy, of S is the sum of the information: Info(S) = - p 0 log 2 p 0 - p1 log 2 p 1 10 10 10 Attribute A Attribute A Attribute A 5 5 6 4 0 10 Entropy =1 Highest impurity Useless classifier Entropy =0.97 High impurity Poor classifier Entropy =0 Impurity= 0 Good classifier

Information, Entropy and Impurity In the general case, the target attribute can take on m different values multi-way split) and the entropy of S relative to this m-wise classification is given by m Inf (S)= p i log 2 ( p i ) i Info (y) = - 1/2* log 2 1/2 1/2 log 2 1/2= 1 bit X AI Math CS AI AI CS AI Math Y Yes No Yes No No Yes Yes No P(x=AI)=0.5, P(X=Math)=0.25, P(X=CS)=0.25 Info (y x=ai )= - 2/4* log 2 2/4 2/4 log 2 2/4= 1 bit Info (y x=math )= - 2/2* log 2 2/2 0/4 log 2 0/4= 0 bit Info (y x=cs )= - 0/2* log 2 0/2 2/2 log 2 2/2= 0 bit Info(Y X) = p(x=ai) Info(Y X=AI) + p(x=math) Info(Y X=Math) + p(x=cs) Info(Y X=CS) v P(X =v) Info(Y X=v) AI 0.5 1 Math.25 0 CS.25 0 Info(Y X) =1*0.5+0*.25+0*.25 =0.5

Information, Entropy and Impurity Information gain: measure how much a given attribute X tells us about the Class Y Info(Y X )= j P(X=v j ) Info(Y X=v j ) We can write info (Y X) shortly as Info X (Y) Also, the term Info(Y X=v j ) denotes the number of possible partitions of the attribute X

Attribute Selection Measure: Information Gain Select the attribute with the highest information gain Let p i be the probability that an arbitrary tuple in D (a data set) belongs to class C i, estimated by C i, D / D Expected information (entropy) needed to classify a tuple in D: m Info( D)= p i log 2 ( p i ) i=1 Information needed (after using attribute A to split D into v partitions) to classify D: Info A ( D)= j= 1 v D j D Info( D j ) Information gained by branching on attribute A Gain( A )=Info( D) Info A ( D)

Example:

Example: Another answer Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, Gain(credit rating) = 0.048 bits - Because age has the highest information gain among the attributes, it is selected as the splitting attribute.

Attribute Selection: Information Gain The attribute age has the highest information gain and therefore becomes the splitting attribute at the root node of the decision tree. Branches are grown for each outcome of age. The tuples are shown partitioned accordingly.

Information Gain for Continuous Value attributes Let attribute A be a continuous-valued attribute Must determine the best split point for A Sort the value A in increasing order Typically, the midpoint between each pair of adjacent values is considered as a possible split point (a i +a i+1 )/2 is the midpoint between the values of a i and a i+1 Split: The point with the minimum expected information requirement for A is selected as the split-point for A D1 is the set of tuples in D satisfying A split-point, and D2 is the set of tuples in D satisfying A > split-point

Gain Ration for Attribute Section (C4.5) Information gain measure is biased towards attributes with a large number of values C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain) v D SplitInfo A ( D j )= j=1 D log 2 ( D j D ) GainRatio(A) = Gain(A)/SplitInfo(A) Ex. SplitInfo A ( D )= 4 14 log 2 ( 4 14 ) 6 14 log 2 ( 6 14 ) 4 14 log 2 ( 4 14 )=0.926 gain_ratio(income) = 0.029/0.926 = 0.031 The attribute with the maximum gain ratio is selected as the splitting attribute

GINI Index: IBM IntellgentMiner If a data set D contains examples from n classes, gini index, gini(d) is defined n as 2 gini( D)=1 p j j=1 where p j is the relative frequency of class j in D If a data set D is split on A into two subsets D 1 and D 2, the gini index gini(d) is defined as Reduction in Impurity: gini A ( D )= D 1 D gini ( D 1 )+ D 2 D gini( D 2 ) Δgini( A )=gini( D ) gini A ( D) The attribute provides the smallest gini split (D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)

GINI Index Ex. D has 9 tuples in buys_computer = yes and 5 in no gini( D)=1 ( 9 14 ) 2 ( 5 14 ) 2 =0.459 Suppose the attribute income partitions D into 10 in D 1 : {low, medium} and 4 in D 2 gini income {low,medium} (D )= ( 10 14 ) Gini (D 1 )+ ( 4 14 ) Gini( D 1 ) but gini {medium,high} is 0.30 and thus the best since it is the lowest All attributes are assumed continuous-valued May need other tools, e.g., clustering, to get the possible split values Can be modified for categorical attributes

Comparing Attribute Selection Measures The three measures, in general, return good results but Information gain: biased towards multivalued attributes Gain ratio: tends to prefer unbalanced splits in which one partition is much smaller than the others Gini index: biased to multivalued attributes has difficulty when # of classes is large tends to favor tests that result in equal-sized partitions and purity in both partitions

Overfitting and Tree Pruning Overfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a fully grown tree get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the best pruned tree

Exercise Suppose that the probability of five events are P(1)= 0.5, and P(2)= P(3)= P (4)=P(5)=0.125. Calculate the entropy. Three binary nodes, N 1, N 2, and N 3, split examples into (0, 6), (1,5), and (3,3),respectively. For each node, calculate its entropy, Gini impurity Build a decision tree that computes the logical AND function One the side, a list of everything someone has done in the past 10 days. Which feature do you use as the starting (root) Node? For this you need to compute the entropy, and then find out which feature has the maximal information gain