Decision Tree And Random Forest

Similar documents
Lecture 7 Decision Tree Classifier

CS145: INTRODUCTION TO DATA MINING

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

UVA CS 4501: Machine Learning

Decision Tree Learning

the tree till a class assignment is reached

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Classification and Prediction

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Learning Decision Trees

Learning Decision Trees

Machine Learning 2nd Edi7on

Lecture 3: Decision Trees

Classification: Decision Trees

Decision Tree Learning and Inductive Inference

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology


EECS 349:Machine Learning Bryan Pardo

Classification and Regression Trees

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision Trees. Tirgul 5

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Learning Classification Trees. Sargur Srihari

Lecture 3: Decision Trees

CS 6375 Machine Learning

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision Trees. Gavin Brown

Machine Learning & Data Mining

Decision Tree Learning - ID3

CHAPTER-17. Decision Tree Induction

Dan Roth 461C, 3401 Walnut

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Decision Tree Learning

Classification Using Decision Trees

1 Handling of Continuous Attributes in C4.5. Algorithm

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Decision T ree Tree Algorithm Week 4 1

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Decision Trees.

Artificial Intelligence. Topic

Classification and regression trees

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. Danushka Bollegala

Rule Generation using Decision Trees

CSE 5243 INTRO. TO DATA MINING

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Induction of Decision Trees

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Decision Trees Entropy, Information Gain, Gain Ratio

Data classification (II)

Chapter 6: Classification

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Information Gain, Decision Trees and Boosting ML recitation 9 Feb 2006 by Jure

1 Handling of Continuous Attributes in C4.5. Algorithm

Decision Tree Learning

Chapter 3: Decision Tree Learning

Decision Trees Part 1. Rao Vemuri University of California, Davis

Bayesian Classification. Bayesian Classification: Why?

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Decision Trees.

Decision Tree Learning Lecture 2

Machine Learning 3. week

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Introduction to Machine Learning CMU-10701

CS 412 Intro. to Data Mining

Decision Tree Learning

Classification: Decision Trees

Artificial Intelligence Decision Trees

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Einführung in Web- und Data-Science

Tutorial 6. By:Aashmeet Kalra

Decision Tree. Decision Tree Learning. c4.5. Example

Induction on Decision Trees

Supervised Learning via Decision Trees

Lecture 7: DecisionTrees

Final Exam, Machine Learning, Spring 2009

Learning with multiple models. Boosting.

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Decision Support. Dr. Johan Hagelbäck.

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Jeffrey D. Ullman Stanford University

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

The Solution to Assignment 6

CSCI 5622 Machine Learning

Decision Tree Learning

Supervised Learning: Regression & Classifiers. Fall 2010

Decision Trees: Overfitting

Machine Learning Alternatives to Manual Knowledge Acquisition

Transcription:

Decision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany) Spring 2019 Contact: mailto: Ammar@cu.edu.eg Drammarcu@gmail.com

Decision Tree A decision tree is a simple classifier in the form of a hierarchical tree structure, which performs supervised classification using a divide-andconquer strategy Branches corresponds to possible answer root Leaf node indicates the class (category) Classification is performed by routing from the root node until arriving at a leaf node.

Decision Tree: Example decision tree for determining whether to play tennis. Outlook sunny Humidity (Outlook ==overcast) -> yes (Outlook==rain) and (not Windy==false) ->yes (Outlook==sunny) and (Humidity=normal) ->yes rain overcast Yes Windy high normal true false No Yes No Yes

Decision Tree: Training Dataset Training Example age <=30 <=30 31 40 >40 >40 >40 31 40 <=30 <=30 >40 <=30 31 40 31 40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no

Decision Tree: Training Dataset Output: A Decision Tree for buys_computer age? <=30 overcast 31..40 student? no no >40 credit rating? yes yes yes excellent no fair yes

Decision Tree: How to build the Tree Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on best selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf There are no samples left

Example

Example: possible answer EyeColor blue brown football Married no yes HairLength long football HairLength short short football Netball long Netball

Example: Another answer Sex male football female Netball The Question is I.What is the best Tree? II. Which Attribute to be selected for splitting branches?

How to Construct a Tree Algorithm : BuildSubtree Required: Node n, Data D 1: (n split,dl,dr)=findbestsplit(d) 2: if StoppingCriteria(DL) then 3: n left_prediction=findprediction(dl) 4: else 5: BuildSubtree (n left,dl) 6: if StoppingCriteria(DR) then 7: n right_prediction=findprediction(dr) 8: else 9: BuildSubtree (n right,dr)

Information, Entropy and Impurity - Entropy is a measure of the disorder or unpredictability in a system. - Given a binary (two-class) classification, C, and a set of examples, S, the class distribution at any node can be written as (p0, p1 ), where p1 =1 - p0, and the entropy of S is the sum of the information: Entropy of : E(S) = - p0 log2 p0 - p1 log2 p1 10 10 Attribute A 5 10 Attribute A 5 Entropy =1 Highest impurity Useless classifier 6 Entropy =0.97 High impurity Poor classifier Attribute A 4 0 Entropy =0 Impurity= 0 Good classifier 10

Information, Entropy and Impurity In the general case, the target attribute can take on m different values multi-way split) and the entropy of T (one feature) relative to this m-wise m classification is given by E ( T )= pi log 2 ( pi ) i E (y) = - 1/2* log2 1/2 1/2 log2 1/2= 1 bit Entropy of two Features: X Y AI Yes Math No CS Yes AI No AI No CS Yes AI Yes Math No For c in Feature X Example: P(x=AI)=0.5, P(X=Math)=0.25, P(X=CS)=0.25 E (y x=ai )= - 2/4* log2 2/4 2/4 log2 2/4= 1 bit E (y x=math )= - 2/2* log2 2/2 0/2 log2 0/2= 0 bit E (y x=cs )= - 0/2* log2 0/2 2/2 log2 2/2= 0 bit E(Y, X) = p(x=ai) E(Y X=AI) + p(x=math) E(Y X=Math) + p(x=cs) E(Y X=CS) v AI P(x-v) 0.5 E(Y X=v) 1 Math 0.25 0 CS 0.25 0 E(Y X) =1*0.5+0*.25+0*.25 =0.5 measure how much a given attribute X tells us about the Class Y

Information, Entropy and Impurity Entropy ( Y,X )= P ( X=v j ) E ( Y X=v j ) j Also, the term E(Y X=vj) denotes Entropy of all possible partitions of the attribute X Information gain: measure how much a given attribute X tells us about the Class Y The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches). Gain(X)= Entropy(T)- Entropy(T,X)

Attribute Selection Measure: Information Gain Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D (a data set) belongs to class Ci, estimated by Ci, D / D Expected information (entropy) needed to classify a tuple in D: m Entropy ( D)= p i log 2 ( pi ) i=1 Information needed (after using attribute A to split D into v v partitions) to classify D: D Entropy ( D,A )= j=1 j D Entropy ( D j ) Information gained by branching on attribute A Gain(A)= Entropy(D)- Entropy(D,A)

Example:

Example: Answer Entropy ( D) Entropy(D,Age) Gain(age)= Entropy(D)- Entropy(D,age) = 0.940-0.694 = 0.246 bits Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, Gain(credit rating) = 0.048 bits - Because age has the highest information gain among the attributes, it is selected as the splitting attribute.

Attribute Selection: Information Gain The attribute age has the highest information gain and therefore becomes the splitting attribute at the root node of the decision tree. Branches are grown for each outcome of age. The tuples are shown partitioned accordingly.

Information Gain for Continuous Value attributes Let attribute A be a continuous-valued attribute Must determine the best split point for A Sort the value A in increasing order Typically, the midpoint between each pair of adjacent values is considered as a possible split point (ai+ai+1)/2 is the midpoint between the values of ai and ai+1 The point with the minimum expected information requirement for A is selected as the split-point for A Split: D1 is the set of tuples in D satisfying A split-point, and D2 is the set of tuples in D satisfying A > split-point

Gain Ration for Attribute Section (C4.5) Information gain is biased towards attributes with a large number of values C4.5 Decision Tree Algorithm uses gain ratio to overcome the problem (normalization to information gain). Gain ratio is the ratio of information gain to the intrinsic information (split info) v SplitInfo A ( D)= j=1 D j D log 2 ( GainRatio(A) = Gain(A)/SplitInfo(A) Ex. D j D ) gain_ratio(income) = 0.029/0.926 = 0.031 The attribute with the maximum gain ratio is selected as the splitting attribute 4 4 6 6 4 4 SplitInfo A ( D)= log 2 ( ) log 2 ( ) log 2 ( )=0. 926 14 14 14 14 14 14

GINI Index: IBM IntellgentMiner If a data set D contains examples from n classes, gini index, gini(d) is defined n as gini( D )=1 j=1 where pj is the relative frequency of class j in D If a data set D is split on a feature A into two subsets D1 and D2, the gini index gini(d) is defined as gini A ( D )= 2 pj Reduction in Impurity: D 1 D gini( D 1 )+ D 2 D gini ( D2 ) Δginigini ( A )=gini ( D) gini A ( D ) The attribute provides the smallest ginisplit(d) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)

GINI Index Ex. D has 9 tuples in buys_computer = yes and 5 in no 2 2 9 5 gini( D )=1 =0. 459 14 14 ( ) ( ) Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2 10 4 (14 ) Gini( D )+( 14 ) Gini( D ) giniincome {low,medium } ( D)= 1 1 but gini{medium,high} is 0.30 and thus the best since it is the lowest All attributes are assumed continuous-valued May need other tools, e.g., clustering, to get the possible split values Can be modified for categorical attributes

Comparing Attribute Selection Measures The three measures, in general, return good results but Information gain: Gain ratio: biased towards multivalued attributes tends to prefer unbalanced splits in which one partition is much smaller than the others Gini index: biased to multivalued attributes has difficulty when # of classes is large tends to favor tests that result in equal-sized partitions and purity in both partitions

Decision Tree Over-fitting Overfitting: An induced tree may overfit the training data You can perfectly fit to any training data Zero bias, high variance Too many branches, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samples Two approaches: 1. Stop growing the tree when further splitting the data does not yield an improvement 2. Grow a full tree, then prune the tree, by eliminating nodes.

Exercise Suppose that the probability of five events are P(1)= 0.5, and P(2)= P(3)= P (4)=P(5)=0.125. Calculate the entropy. Three binary nodes, N 1, N 2, and N 3, split examples into (0, 6), (1,5), and (3,3),respectively. For each node, calculate its entropy, Gini impurity Build a decision tree that computes the logical AND function One the side, a list of everything someone has done in the past 10 days. Which feature do you use as the starting (root) Node? For this you need to compute the entropy, and then find out which feature has the maximal information gain

Bagging Bagging or bootstrap aggregation a technique for reducing the variance of an estimated prediction function. For classification, a committee of trees each cast a vote for the predicted class.

Bagging Create bootstrap samples from the training data... M features

Bagging Construct decision tree for each sample... Take the majority vote... N examples M features

Weakness of Bagging - bagging gives distinct advance in machine learning when it is used - Analyzing the details of many models, it was found that the trees in the bagger were too similar to each other. - This asks for a way to make the trees dramatically more different. Introducing Randomness - new idea was to introduce randomness not just into the training samples but also into the actual tree growing as well - Suppose that instead of always picking the best splitter we picked the splitter at random - This would guarantee that different trees would be quite dissimilar to each other

Random Forests N examples Random Forests are one of the most powerful, fully automated, machine learning techniques. Random Forest is a tool that leverages Training Data the power of many decision trees, judicious randomization, and ensemble learning to produce astonishingly M features accurate predictive models

Random Forest Create bootstrap samples from the training data... N examples M features

Random Forest At each node in choosing the split feature choose only among m<m features... N examples M features

Random Forest...... N examples M features Take he majority vote

Random Forest Algorithm Given a training set S For i = 1 to k do: Build subset Si by sampling with replacement from S Learn tree Ti from Si At each node: Choose best split from random subset of F features Each tree grows to the largest extend, and no pruning Make predictions according to majority vote of the set of k trees.

Random Forest vs Bagging Pros: It is more robust. It is faster to train (no reweighting, each split is on a small subset of data and feature). Cons: The feature selection process is not explicit. Feature fusion is also less obvious. Has weaker performance on small size training data.