Decision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany) Spring 2019 Contact: mailto: Ammar@cu.edu.eg Drammarcu@gmail.com
Decision Tree A decision tree is a simple classifier in the form of a hierarchical tree structure, which performs supervised classification using a divide-andconquer strategy Branches corresponds to possible answer root Leaf node indicates the class (category) Classification is performed by routing from the root node until arriving at a leaf node.
Decision Tree: Example decision tree for determining whether to play tennis. Outlook sunny Humidity (Outlook ==overcast) -> yes (Outlook==rain) and (not Windy==false) ->yes (Outlook==sunny) and (Humidity=normal) ->yes rain overcast Yes Windy high normal true false No Yes No Yes
Decision Tree: Training Dataset Training Example age <=30 <=30 31 40 >40 >40 >40 31 40 <=30 <=30 >40 <=30 31 40 31 40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no
Decision Tree: Training Dataset Output: A Decision Tree for buys_computer age? <=30 overcast 31..40 student? no no >40 credit rating? yes yes yes excellent no fair yes
Decision Tree: How to build the Tree Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on best selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf There are no samples left
Example
Example: possible answer EyeColor blue brown football Married no yes HairLength long football HairLength short short football Netball long Netball
Example: Another answer Sex male football female Netball The Question is I.What is the best Tree? II. Which Attribute to be selected for splitting branches?
How to Construct a Tree Algorithm : BuildSubtree Required: Node n, Data D 1: (n split,dl,dr)=findbestsplit(d) 2: if StoppingCriteria(DL) then 3: n left_prediction=findprediction(dl) 4: else 5: BuildSubtree (n left,dl) 6: if StoppingCriteria(DR) then 7: n right_prediction=findprediction(dr) 8: else 9: BuildSubtree (n right,dr)
Information, Entropy and Impurity - Entropy is a measure of the disorder or unpredictability in a system. - Given a binary (two-class) classification, C, and a set of examples, S, the class distribution at any node can be written as (p0, p1 ), where p1 =1 - p0, and the entropy of S is the sum of the information: Entropy of : E(S) = - p0 log2 p0 - p1 log2 p1 10 10 Attribute A 5 10 Attribute A 5 Entropy =1 Highest impurity Useless classifier 6 Entropy =0.97 High impurity Poor classifier Attribute A 4 0 Entropy =0 Impurity= 0 Good classifier 10
Information, Entropy and Impurity In the general case, the target attribute can take on m different values multi-way split) and the entropy of T (one feature) relative to this m-wise m classification is given by E ( T )= pi log 2 ( pi ) i E (y) = - 1/2* log2 1/2 1/2 log2 1/2= 1 bit Entropy of two Features: X Y AI Yes Math No CS Yes AI No AI No CS Yes AI Yes Math No For c in Feature X Example: P(x=AI)=0.5, P(X=Math)=0.25, P(X=CS)=0.25 E (y x=ai )= - 2/4* log2 2/4 2/4 log2 2/4= 1 bit E (y x=math )= - 2/2* log2 2/2 0/2 log2 0/2= 0 bit E (y x=cs )= - 0/2* log2 0/2 2/2 log2 2/2= 0 bit E(Y, X) = p(x=ai) E(Y X=AI) + p(x=math) E(Y X=Math) + p(x=cs) E(Y X=CS) v AI P(x-v) 0.5 E(Y X=v) 1 Math 0.25 0 CS 0.25 0 E(Y X) =1*0.5+0*.25+0*.25 =0.5 measure how much a given attribute X tells us about the Class Y
Information, Entropy and Impurity Entropy ( Y,X )= P ( X=v j ) E ( Y X=v j ) j Also, the term E(Y X=vj) denotes Entropy of all possible partitions of the attribute X Information gain: measure how much a given attribute X tells us about the Class Y The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches). Gain(X)= Entropy(T)- Entropy(T,X)
Attribute Selection Measure: Information Gain Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D (a data set) belongs to class Ci, estimated by Ci, D / D Expected information (entropy) needed to classify a tuple in D: m Entropy ( D)= p i log 2 ( pi ) i=1 Information needed (after using attribute A to split D into v v partitions) to classify D: D Entropy ( D,A )= j=1 j D Entropy ( D j ) Information gained by branching on attribute A Gain(A)= Entropy(D)- Entropy(D,A)
Example:
Example: Answer Entropy ( D) Entropy(D,Age) Gain(age)= Entropy(D)- Entropy(D,age) = 0.940-0.694 = 0.246 bits Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, Gain(credit rating) = 0.048 bits - Because age has the highest information gain among the attributes, it is selected as the splitting attribute.
Attribute Selection: Information Gain The attribute age has the highest information gain and therefore becomes the splitting attribute at the root node of the decision tree. Branches are grown for each outcome of age. The tuples are shown partitioned accordingly.
Information Gain for Continuous Value attributes Let attribute A be a continuous-valued attribute Must determine the best split point for A Sort the value A in increasing order Typically, the midpoint between each pair of adjacent values is considered as a possible split point (ai+ai+1)/2 is the midpoint between the values of ai and ai+1 The point with the minimum expected information requirement for A is selected as the split-point for A Split: D1 is the set of tuples in D satisfying A split-point, and D2 is the set of tuples in D satisfying A > split-point
Gain Ration for Attribute Section (C4.5) Information gain is biased towards attributes with a large number of values C4.5 Decision Tree Algorithm uses gain ratio to overcome the problem (normalization to information gain). Gain ratio is the ratio of information gain to the intrinsic information (split info) v SplitInfo A ( D)= j=1 D j D log 2 ( GainRatio(A) = Gain(A)/SplitInfo(A) Ex. D j D ) gain_ratio(income) = 0.029/0.926 = 0.031 The attribute with the maximum gain ratio is selected as the splitting attribute 4 4 6 6 4 4 SplitInfo A ( D)= log 2 ( ) log 2 ( ) log 2 ( )=0. 926 14 14 14 14 14 14
GINI Index: IBM IntellgentMiner If a data set D contains examples from n classes, gini index, gini(d) is defined n as gini( D )=1 j=1 where pj is the relative frequency of class j in D If a data set D is split on a feature A into two subsets D1 and D2, the gini index gini(d) is defined as gini A ( D )= 2 pj Reduction in Impurity: D 1 D gini( D 1 )+ D 2 D gini ( D2 ) Δginigini ( A )=gini ( D) gini A ( D ) The attribute provides the smallest ginisplit(d) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)
GINI Index Ex. D has 9 tuples in buys_computer = yes and 5 in no 2 2 9 5 gini( D )=1 =0. 459 14 14 ( ) ( ) Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2 10 4 (14 ) Gini( D )+( 14 ) Gini( D ) giniincome {low,medium } ( D)= 1 1 but gini{medium,high} is 0.30 and thus the best since it is the lowest All attributes are assumed continuous-valued May need other tools, e.g., clustering, to get the possible split values Can be modified for categorical attributes
Comparing Attribute Selection Measures The three measures, in general, return good results but Information gain: Gain ratio: biased towards multivalued attributes tends to prefer unbalanced splits in which one partition is much smaller than the others Gini index: biased to multivalued attributes has difficulty when # of classes is large tends to favor tests that result in equal-sized partitions and purity in both partitions
Decision Tree Over-fitting Overfitting: An induced tree may overfit the training data You can perfectly fit to any training data Zero bias, high variance Too many branches, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samples Two approaches: 1. Stop growing the tree when further splitting the data does not yield an improvement 2. Grow a full tree, then prune the tree, by eliminating nodes.
Exercise Suppose that the probability of five events are P(1)= 0.5, and P(2)= P(3)= P (4)=P(5)=0.125. Calculate the entropy. Three binary nodes, N 1, N 2, and N 3, split examples into (0, 6), (1,5), and (3,3),respectively. For each node, calculate its entropy, Gini impurity Build a decision tree that computes the logical AND function One the side, a list of everything someone has done in the past 10 days. Which feature do you use as the starting (root) Node? For this you need to compute the entropy, and then find out which feature has the maximal information gain
Bagging Bagging or bootstrap aggregation a technique for reducing the variance of an estimated prediction function. For classification, a committee of trees each cast a vote for the predicted class.
Bagging Create bootstrap samples from the training data... M features
Bagging Construct decision tree for each sample... Take the majority vote... N examples M features
Weakness of Bagging - bagging gives distinct advance in machine learning when it is used - Analyzing the details of many models, it was found that the trees in the bagger were too similar to each other. - This asks for a way to make the trees dramatically more different. Introducing Randomness - new idea was to introduce randomness not just into the training samples but also into the actual tree growing as well - Suppose that instead of always picking the best splitter we picked the splitter at random - This would guarantee that different trees would be quite dissimilar to each other
Random Forests N examples Random Forests are one of the most powerful, fully automated, machine learning techniques. Random Forest is a tool that leverages Training Data the power of many decision trees, judicious randomization, and ensemble learning to produce astonishingly M features accurate predictive models
Random Forest Create bootstrap samples from the training data... N examples M features
Random Forest At each node in choosing the split feature choose only among m<m features... N examples M features
Random Forest...... N examples M features Take he majority vote
Random Forest Algorithm Given a training set S For i = 1 to k do: Build subset Si by sampling with replacement from S Learn tree Ti from Si At each node: Choose best split from random subset of F features Each tree grows to the largest extend, and no pruning Make predictions according to majority vote of the set of k trees.
Random Forest vs Bagging Pros: It is more robust. It is faster to train (no reweighting, each split is on a small subset of data and feature). Cons: The feature selection process is not explicit. Feature fusion is also less obvious. Has weaker performance on small size training data.