Lecture 7 Decision Tree Classifier

Machine Learning Dr.Ammar Mohammed Lecture 7 Decision Tree Classifier

Decision Tree A decision tree is a simple classifier in the form of a hierarchical tree structure, which performs supervised classification using a divide-andconquer strategy Branches corresponds to possible answer root Leaf node indicates the class (category) Classification is performed by routing from the root node until arriving at a leaf node.

Decision Tree: Example decision tree for determining whether to play tennis.

Decision Tree: Training Dataset Training Example age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31 40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31 40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 40 medium no excellent yes 31 40 high yes fair yes >40 medium no excellent no

Decision Tree: Training Dataset Output: A Decision Tree for buys_computer age? <=30 overcast 31..40 >40 student? yes credit rating? no yes excellent fair no yes yes no

Decision Tree: How to build the Tree Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on best selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf There are no samples left

Example

Example: possible answer EyeColor brown blue yes Married no football HairLength HairLength long short short long football Netball football Netball

Example: Another answer Sex male female football Netball The Question is I.What is the best Tree? II. Which Attribute to be selected for splitting branches?

How to Construct a Tree Algorithm : BuildSubtree Require: Node n, Data D 1: (n split,d L,D R )=FindBestSplit(D) 2: if StoppingCriteria(D L ) then 3: n left_prediction=findprediction(d L ) 4: else 5: BuildSubtree (n left,d L ) 6: if StoppingCriteria(D R ) then 7: n right_prediction=findprediction(d R ) 8: else 9: BuildSubtree (n right,d R )

Information, Entropy and Impurity Entropy is a measure of the disorder or unpredictability in a system. Given a binary (two-class) classification, C, and a set of examples, S, the class distribution at any node can be written as (p 0, p 1 ), where p 1 =1 - p 0, and the entropy, of S is the sum of the information: Info(S) = - p 0 log 2 p 0 - p1 log 2 p 1 10 10 10 Attribute A Attribute A Attribute A 5 5 6 4 0 10 Entropy =1 Highest impurity Useless classifier Entropy =0.97 High impurity Poor classifier Entropy =0 Impurity= 0 Good classifier

Information, Entropy and Impurity In the general case, the target attribute can take on m different values multi-way split) and the entropy of S relative to this m-wise classification is given by m Inf (S)= p i log 2 ( p i ) i Info (y) = - 1/2* log 2 1/2 1/2 log 2 1/2= 1 bit X AI Math CS AI AI CS AI Math Y Yes No Yes No No Yes Yes No P(x=AI)=0.5, P(X=Math)=0.25, P(X=CS)=0.25 Info (y x=ai )= - 2/4* log 2 2/4 2/4 log 2 2/4= 1 bit Info (y x=math )= - 2/2* log 2 2/2 0/4 log 2 0/4= 0 bit Info (y x=cs )= - 0/2* log 2 0/2 2/2 log 2 2/2= 0 bit Info(Y X) = p(x=ai) Info(Y X=AI) + p(x=math) Info(Y X=Math) + p(x=cs) Info(Y X=CS) v P(X =v) Info(Y X=v) AI 0.5 1 Math.25 0 CS.25 0 Info(Y X) =1*0.5+0*.25+0*.25 =0.5

Information, Entropy and Impurity Information gain: measure how much a given attribute X tells us about the Class Y Info(Y X )= j P(X=v j ) Info(Y X=v j ) We can write info (Y X) shortly as Info X (Y) Also, the term Info(Y X=v j ) denotes the number of possible partitions of the attribute X

Attribute Selection Measure: Information Gain Select the attribute with the highest information gain Let p i be the probability that an arbitrary tuple in D (a data set) belongs to class C i, estimated by C i, D / D Expected information (entropy) needed to classify a tuple in D: m Info( D)= p i log 2 ( p i ) i=1 Information needed (after using attribute A to split D into v partitions) to classify D: Info A ( D)= j= 1 v D j D Info( D j ) Information gained by branching on attribute A Gain( A )=Info( D) Info A ( D)

Example:

Example: Another answer Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, Gain(credit rating) = 0.048 bits - Because age has the highest information gain among the attributes, it is selected as the splitting attribute.

Attribute Selection: Information Gain The attribute age has the highest information gain and therefore becomes the splitting attribute at the root node of the decision tree. Branches are grown for each outcome of age. The tuples are shown partitioned accordingly.

Information Gain for Continuous Value attributes Let attribute A be a continuous-valued attribute Must determine the best split point for A Sort the value A in increasing order Typically, the midpoint between each pair of adjacent values is considered as a possible split point (a i +a i+1 )/2 is the midpoint between the values of a i and a i+1 Split: The point with the minimum expected information requirement for A is selected as the split-point for A D1 is the set of tuples in D satisfying A split-point, and D2 is the set of tuples in D satisfying A > split-point

Gain Ration for Attribute Section (C4.5) Information gain measure is biased towards attributes with a large number of values C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain) v D SplitInfo A ( D j )= j=1 D log 2 ( D j D ) GainRatio(A) = Gain(A)/SplitInfo(A) Ex. SplitInfo A ( D )= 4 14 log 2 ( 4 14 ) 6 14 log 2 ( 6 14 ) 4 14 log 2 ( 4 14 )=0.926 gain_ratio(income) = 0.029/0.926 = 0.031 The attribute with the maximum gain ratio is selected as the splitting attribute

GINI Index: IBM IntellgentMiner If a data set D contains examples from n classes, gini index, gini(d) is defined n as 2 gini( D)=1 p j j=1 where p j is the relative frequency of class j in D If a data set D is split on A into two subsets D 1 and D 2, the gini index gini(d) is defined as Reduction in Impurity: gini A ( D )= D 1 D gini ( D 1 )+ D 2 D gini( D 2 ) Δgini( A )=gini( D ) gini A ( D) The attribute provides the smallest gini split (D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)

GINI Index Ex. D has 9 tuples in buys_computer = yes and 5 in no gini( D)=1 ( 9 14 ) 2 ( 5 14 ) 2 =0.459 Suppose the attribute income partitions D into 10 in D 1 : {low, medium} and 4 in D 2 gini income {low,medium} (D )= ( 10 14 ) Gini (D 1 )+ ( 4 14 ) Gini( D 1 ) but gini {medium,high} is 0.30 and thus the best since it is the lowest All attributes are assumed continuous-valued May need other tools, e.g., clustering, to get the possible split values Can be modified for categorical attributes

Comparing Attribute Selection Measures The three measures, in general, return good results but Information gain: biased towards multivalued attributes Gain ratio: tends to prefer unbalanced splits in which one partition is much smaller than the others Gini index: biased to multivalued attributes has difficulty when # of classes is large tends to favor tests that result in equal-sized partitions and purity in both partitions

Overfitting and Tree Pruning Overfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a fully grown tree get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the best pruned tree

Exercise Suppose that the probability of five events are P(1)= 0.5, and P(2)= P(3)= P (4)=P(5)=0.125. Calculate the entropy. Three binary nodes, N 1, N 2, and N 3, split examples into (0, 6), (1,5), and (3,3),respectively. For each node, calculate its entropy, Gini impurity Build a decision tree that computes the logical AND function One the side, a list of everything someone has done in the past 10 days. Which feature do you use as the starting (root) Node? For this you need to compute the entropy, and then find out which feature has the maximal information gain