Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Decision Trees Supervised approach Used for Classification (Categorical values) or regression (continuous values). The learning of decision trees is from class-labeled training tuples. Flowchart like structure. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Decision tree to decide whether a student will attend the lecture or no? PQR College? Not PQR Yes Teacher? Prof. ABC No No Industry? Yes No No Yes

How to classify with Decision trees? For a given tuple X, trace its path with the attributes values against the decision trees. Decision trees can be converted to classification rules.

Available information : Where When Sachin opening Dhoni wicketkepper Against Home 5 pm Yes Yes Australia Lost Away 7 pm No Yes Sri Lanka Won Home 9 pm Yes Yes Australia Won. What we know: Away 4 pm No No Australia????? Outcome What we want? : Classify. Generalize the rules to new examples

Why use decision trees? Does not require domain knowledge. Can handle multi-dimensional data Representation is simple and easy for user to understand. Fast and have good classification accuracy. Robust in terms of outliers. Non parametric no assumptions about classifier structure.

Decision tree algorithms Initially in 1980 s ID3: Iterative Dichtomiser algorithm was developed. C4.5 and CART were presented. ID3 and CART: Classification and Regression trees follow the same approach.

Basic decision tree algorithm: parameters Algo(D, attribute_list, attribute_selection_method) D as data partition. (initially it is complete) Attribute list Attri_selec_method specifies heuristic procedure for selecting attribute that best discriminates given tuples.

Algorithm issues.

Random split The tree can grow huge These trees are hard to understand. Larger trees are typically less accurate than smaller trees.

Principle criteria Selection of an attribute to test at each node - choosing the most useful attribute for classifying examples. Information gain measures how well a given attribute separates the training examples according to their target classification This measure is used to select among the candidate attributes at each step while growing the tree

What information gain actually tells? Which oval would you think can be described in simple way?? Why??

Answer?? The first one its homogeneous. More pure Information gain is a measure to define degree of disorganization in system called as entropy. So, If the sample is completely homogenous, then the entropy is 0. It sample is equally divided the it is 1.

Entropy Given a set S of positive and negative examples of some target concept (a 2-class problem), the entropy of set S relative to this binary classification is E(S) = - p(p)log2 p(p) p(n)log2 p(n) Where P are positive samples and N are negative sample. Generally represented as H(attribute)

Entropy Suppose S (sample space) has 25 examples, 15 positive and 10 negatives [15+, 10-]. Then the entropy of S relative to this classification is E(S)=-(15/25) log2(15/25) - (10/25) log2 (10/25) The entropy is 0 (zero) if the outcome is ``certain. The entropy is maximum if we have no knowledge of the system (or any outcome is equally possible).

Steps to calculate Entropy for split 1. Calculate entropy of parent node 2. (i) Calculate entropy of each individual node of split (ii) calculate weighted average of all sub-nodes available in split.

Information Gain Information gain measures the expected reduction in entropy, or uncertainty. Sv Gain( S, A) Entropy( S) Entropy( Sv) S vvalues ( A) Values(A) is the set of all possible values for attribute A, and Sv the subset of S for which attribute A has value v. Sv = {s in S A(s) = v}. the first term in the equation for Gain is just the entropy of the original collection S the second term is the expected value of the entropy after S is partitioned using attribute A So, in short Gain = entropy (original collection) - expected / weighted Entropy

Example for Information gain and Entropy calculation 1) Entropy of parent node: H(10/20, 10/20) = - 10/20 log(10/20) - 10/20 log(10/20) = 1 2)(i) Entropy of individual node of split : Using the ``where attribute, divide into 2 subsets * H(home) = - 6/12 log(6/12) - 6/12 log(6/12) = 1 * H(away) = - 4/8 log(4/8) - 4/8 log(4/8) = 1 Weighted / Expected entropy after partitioning 12/20 * H(home) + 8/20 * H(away) = 1 The expected entropy for the sample space is the sum of the probability of each event in the sample space times its entropy. Gain = 1 Expected/Weighted Entropy = 0!!!

Using the ``when attribute, divide into 3 subsets Entropy of 5 pm H(5pm) = - 1/4 log(1/4) - 3/4 log(3/4) Entropy of 7 pm H(7pm) = - 9/12 log(9/12) - 3/12 log(3/12) Entropy of 9 pm H(9pm) = - 0/4 log(0/4) - 4/4 log(4/4) = 0 Expected entropy after partitioning 4/20 * H(5 pm) + 12/20 * H(7 pm) + 4/20 * H(9 pm) = 0.65 Information gain 1-0.65 = 0.35. Gain higher than where.calculate for all.. So we will select when as root node

Another example ID code Outlook Temperature Humidity Windy Play a b c d e f g h i j k l m n Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Rainy Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High False True False False False True True False False False True True False True No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No

Building decision tree Select an attribute to place at the root of the decision tree and make one branch for every possible value. Repeat the process recursively for each branch.

Entropy prior to partitioning In the weather data example, there are 9 instances of which the decision to play is yes and there are 5 instances of which the decision to play is no. Then, the information gained by knowing the result of the decision is 9 14 log 9 14 5 14 log 5 14 0.940

Entropy for outlook: Outlook sunny overcast rainy yes yes no no no yes yes yes yes yes yes yes no no 5 14 0.971 4 14 0 5 14 0.971 0.693 0.971= - (2/5)* log (2/5) - (3/5)* log(3/5) Always remember it is measured in bits. i.e. the unit of measurement is bits.

Information Gained by Placing Each of the 4 Attributes Gain(outlook) = 0.940 0.693 = 0.247 Gain(temperature) = 0.029 Gain(humidity) = 0.152 Gain(windy) = 0.048 Outlook highest: Select it

Decision tree step 1: Outlook sunny overcast rainy 2 yes 3 no 4 yes 3 yes 2 no

The Recursive Procedure for Constructing a Decision Tree The operation discussed above is applied to each branch recursively to construct the decision tree. For example, for the branch Outlook = Sunny, we evaluate the information gained by applying each of the remaining 3 attributes. Gain(Outlook=sunny;Temperature) = 0.971 0.4 = 0.571 Gain(Outlook=sunny;Humidity) = 0.971 0 = 0.971 Gain(Outlook=sunny;Windy) = 0.971 0.951 = 0.02

Similarly, we also evaluate the information gained by applying each of the remaining 3 attributes for the branch Outlook = rainy. Gain(Outlook=rainy;Temperature) = 0.971 0.951 = 0.02 Gain(Outlook=rainy;Humidity) = 0.971 0.951 = 0.02 Gain(Outlook=rainy;Windy) =0.971 0 = 0.971

Further the tree generated Outlook sunny overcast rainy humidity yes windy high normal false true no yes yes no

When to stop? Stopping rule Every attribute has already been included along this path through the tree, or The training examples associated with this leaf node all have the same target attribute value (i.e., their entropy is zero). A branch with Entropy more than 0 requires further splitting. Its ID3 algorithm that we studied. Other than Information Gain Gini / Chi square can also be used. Decision trees can be converted to decision rules If(Outlook) = and Windy = False then

Problems with decision trees If multiple classes exist and the data size is small Replication and repetition. Can a same attribute be repeated across the branches?? Yes!!!.. A major drawback of decision tree Can end up creating leaf node for every observation So, if the tree is fully grown, it looses its generalization capability - Over fitting!

How to handle overfitting? Set constraints on tree size Tree pruning : pre or post. Pre pruning: stop growing of a branch when information becomes unreliable Post pruning: take up fully grown decision tree and discard the unreliable parts.

Which options are available? To set constraints on tree size Minimum samples for node split Minimum samples for a leaf node Maximum depth for the tree Maximum features to consider for split and so on

Difference between setting constraints and pruning?? Constraints option is short term Pruning is from long term perspective

More about Post Pruning Grow the tree. Check on the validation data. (Training data is split into training and validation) Remove the leaves which are leading to negative results. Subtree replacement or subtree raising.

Why use decision trees? Does not require domain knowledge. Can handle multi-dimensional data Representation is simple and easy for user to understand. Fast and have good classification accuracy.

References Han, Jiawei, Micheline Kamber, and Jian Pei. Data mining: concepts and techniques: concepts and techniques. Elsevier, 2011. Mitchell, Tom M. "Machine learning. WCB." www.tutorialspot.com/datamining/dm_classification_ prediction http://www.ccs.neu.edu/home/mirek/classes/2011-s- CS6220/Slides/Lecture2-ClassificationPrediction- Large.pdf