Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Similar documents
Classification: Decision Trees

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology


Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Learning Classification Trees. Sargur Srihari

Decision Tree Learning and Inductive Inference

Learning Decision Trees

Decision Trees. Gavin Brown

Learning Decision Trees

Decision Trees Entropy, Information Gain, Gain Ratio

Classification and Prediction

Machine Learning 3. week

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

C4.5 - pruning decision trees

Lecture 3: Decision Trees

Classification Using Decision Trees

Machine Learning 2nd Edi7on

Administrative notes. Computational Thinking ct.cs.ubc.ca

Decision Tree Learning

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Decision Trees. Tirgul 5

the tree till a class assignment is reached

Rule Generation using Decision Trees

Decision Trees Part 1. Rao Vemuri University of California, Davis

Lecture 3: Decision Trees

The Solution to Assignment 6

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Machine Learning Alternatives to Manual Knowledge Acquisition

Decision Support. Dr. Johan Hagelbäck.

Decision Tree Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Classification and Regression Trees

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

CS 6375 Machine Learning

Machine Learning & Data Mining

Decision Trees.

Artificial Intelligence Decision Trees

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Knowledge Discovery and Data Mining

Einführung in Web- und Data-Science

Decision T ree Tree Algorithm Week 4 1

EECS 349:Machine Learning Bryan Pardo

UVA CS 4501: Machine Learning

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Chapter 6: Classification

CS145: INTRODUCTION TO DATA MINING

Dan Roth 461C, 3401 Walnut

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Chapter 3: Decision Tree Learning

Decision Trees.

Decision Tree Learning

Symbolic methods in TC: Decision Trees

Decision Tree And Random Forest

Introduction to Machine Learning CMU-10701

Data classification (II)

Lecture 7 Decision Tree Classifier

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Data Mining Project. C4.5 Algorithm. Saber Salah. Naji Sami Abduljalil Abdulhak

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Decision Trees. Danushka Bollegala

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Machine Learning: Symbolische Ansätze. Decision-Tree Learning. Introduction C4.5 ID3. Regression and Model Trees

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Bayesian Classification. Bayesian Classification: Why?

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Lecture 7: DecisionTrees

10-701/ Machine Learning: Assignment 1

Artificial Intelligence. Topic

Algorithms for Classification: The Basic Methods

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Decision Trees / NLP Introduction

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Apprentissage automatique et fouille de données (part 2)

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Decision Tree Learning Lecture 2

Induction on Decision Trees

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Decision Tree Learning - ID3

Notes on Machine Learning for and

Reminders. HW1 out, due 10/19/2017 (Thursday) Group formations for course project due today (1 pt) Join Piazza (

Symbolic methods in TC: Decision Trees

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

Inductive Learning. Chapter 18. Why Learn?

Tutorial 6. By:Aashmeet Kalra

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Modern Information Retrieval

Slides for Data Mining by I. H. Witten and E. Frank

Transcription:

Decision Trees Supervised approach Used for Classification (Categorical values) or regression (continuous values). The learning of decision trees is from class-labeled training tuples. Flowchart like structure. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Decision tree to decide whether a student will attend the lecture or no? PQR College? Not PQR Yes Teacher? Prof. ABC No No Industry? Yes No No Yes

How to classify with Decision trees? For a given tuple X, trace its path with the attributes values against the decision trees. Decision trees can be converted to classification rules.

Available information : Where When Sachin opening Dhoni wicketkepper Against Home 5 pm Yes Yes Australia Lost Away 7 pm No Yes Sri Lanka Won Home 9 pm Yes Yes Australia Won. What we know: Away 4 pm No No Australia????? Outcome What we want? : Classify. Generalize the rules to new examples

Why use decision trees? Does not require domain knowledge. Can handle multi-dimensional data Representation is simple and easy for user to understand. Fast and have good classification accuracy. Robust in terms of outliers. Non parametric no assumptions about classifier structure.

Decision tree algorithms Initially in 1980 s ID3: Iterative Dichtomiser algorithm was developed. C4.5 and CART were presented. ID3 and CART: Classification and Regression trees follow the same approach.

Basic decision tree algorithm: parameters Algo(D, attribute_list, attribute_selection_method) D as data partition. (initially it is complete) Attribute list Attri_selec_method specifies heuristic procedure for selecting attribute that best discriminates given tuples.

Algorithm issues.

Random split The tree can grow huge These trees are hard to understand. Larger trees are typically less accurate than smaller trees.

Principle criteria Selection of an attribute to test at each node - choosing the most useful attribute for classifying examples. Information gain measures how well a given attribute separates the training examples according to their target classification This measure is used to select among the candidate attributes at each step while growing the tree

What information gain actually tells? Which oval would you think can be described in simple way?? Why??

Answer?? The first one its homogeneous. More pure Information gain is a measure to define degree of disorganization in system called as entropy. So, If the sample is completely homogenous, then the entropy is 0. It sample is equally divided the it is 1.

Entropy Given a set S of positive and negative examples of some target concept (a 2-class problem), the entropy of set S relative to this binary classification is E(S) = - p(p)log2 p(p) p(n)log2 p(n) Where P are positive samples and N are negative sample. Generally represented as H(attribute)

Entropy Suppose S (sample space) has 25 examples, 15 positive and 10 negatives [15+, 10-]. Then the entropy of S relative to this classification is E(S)=-(15/25) log2(15/25) - (10/25) log2 (10/25) The entropy is 0 (zero) if the outcome is ``certain. The entropy is maximum if we have no knowledge of the system (or any outcome is equally possible).

Steps to calculate Entropy for split 1. Calculate entropy of parent node 2. (i) Calculate entropy of each individual node of split (ii) calculate weighted average of all sub-nodes available in split.

Information Gain Information gain measures the expected reduction in entropy, or uncertainty. Sv Gain( S, A) Entropy( S) Entropy( Sv) S vvalues ( A) Values(A) is the set of all possible values for attribute A, and Sv the subset of S for which attribute A has value v. Sv = {s in S A(s) = v}. the first term in the equation for Gain is just the entropy of the original collection S the second term is the expected value of the entropy after S is partitioned using attribute A So, in short Gain = entropy (original collection) - expected / weighted Entropy

Example for Information gain and Entropy calculation 1) Entropy of parent node: H(10/20, 10/20) = - 10/20 log(10/20) - 10/20 log(10/20) = 1 2)(i) Entropy of individual node of split : Using the ``where attribute, divide into 2 subsets * H(home) = - 6/12 log(6/12) - 6/12 log(6/12) = 1 * H(away) = - 4/8 log(4/8) - 4/8 log(4/8) = 1 Weighted / Expected entropy after partitioning 12/20 * H(home) + 8/20 * H(away) = 1 The expected entropy for the sample space is the sum of the probability of each event in the sample space times its entropy. Gain = 1 Expected/Weighted Entropy = 0!!!

Using the ``when attribute, divide into 3 subsets Entropy of 5 pm H(5pm) = - 1/4 log(1/4) - 3/4 log(3/4) Entropy of 7 pm H(7pm) = - 9/12 log(9/12) - 3/12 log(3/12) Entropy of 9 pm H(9pm) = - 0/4 log(0/4) - 4/4 log(4/4) = 0 Expected entropy after partitioning 4/20 * H(5 pm) + 12/20 * H(7 pm) + 4/20 * H(9 pm) = 0.65 Information gain 1-0.65 = 0.35. Gain higher than where.calculate for all.. So we will select when as root node

Another example ID code Outlook Temperature Humidity Windy Play a b c d e f g h i j k l m n Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Rainy Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High False True False False False True True False False False True True False True No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No

Building decision tree Select an attribute to place at the root of the decision tree and make one branch for every possible value. Repeat the process recursively for each branch.

Entropy prior to partitioning In the weather data example, there are 9 instances of which the decision to play is yes and there are 5 instances of which the decision to play is no. Then, the information gained by knowing the result of the decision is 9 14 log 9 14 5 14 log 5 14 0.940

Entropy for outlook: Outlook sunny overcast rainy yes yes no no no yes yes yes yes yes yes yes no no 5 14 0.971 4 14 0 5 14 0.971 0.693 0.971= - (2/5)* log (2/5) - (3/5)* log(3/5) Always remember it is measured in bits. i.e. the unit of measurement is bits.

Information Gained by Placing Each of the 4 Attributes Gain(outlook) = 0.940 0.693 = 0.247 Gain(temperature) = 0.029 Gain(humidity) = 0.152 Gain(windy) = 0.048 Outlook highest: Select it

Decision tree step 1: Outlook sunny overcast rainy 2 yes 3 no 4 yes 3 yes 2 no

The Recursive Procedure for Constructing a Decision Tree The operation discussed above is applied to each branch recursively to construct the decision tree. For example, for the branch Outlook = Sunny, we evaluate the information gained by applying each of the remaining 3 attributes. Gain(Outlook=sunny;Temperature) = 0.971 0.4 = 0.571 Gain(Outlook=sunny;Humidity) = 0.971 0 = 0.971 Gain(Outlook=sunny;Windy) = 0.971 0.951 = 0.02

Similarly, we also evaluate the information gained by applying each of the remaining 3 attributes for the branch Outlook = rainy. Gain(Outlook=rainy;Temperature) = 0.971 0.951 = 0.02 Gain(Outlook=rainy;Humidity) = 0.971 0.951 = 0.02 Gain(Outlook=rainy;Windy) =0.971 0 = 0.971

Further the tree generated Outlook sunny overcast rainy humidity yes windy high normal false true no yes yes no

When to stop? Stopping rule Every attribute has already been included along this path through the tree, or The training examples associated with this leaf node all have the same target attribute value (i.e., their entropy is zero). A branch with Entropy more than 0 requires further splitting. Its ID3 algorithm that we studied. Other than Information Gain Gini / Chi square can also be used. Decision trees can be converted to decision rules If(Outlook) = and Windy = False then

Problems with decision trees If multiple classes exist and the data size is small Replication and repetition. Can a same attribute be repeated across the branches?? Yes!!!.. A major drawback of decision tree Can end up creating leaf node for every observation So, if the tree is fully grown, it looses its generalization capability - Over fitting!

How to handle overfitting? Set constraints on tree size Tree pruning : pre or post. Pre pruning: stop growing of a branch when information becomes unreliable Post pruning: take up fully grown decision tree and discard the unreliable parts.

Which options are available? To set constraints on tree size Minimum samples for node split Minimum samples for a leaf node Maximum depth for the tree Maximum features to consider for split and so on

Difference between setting constraints and pruning?? Constraints option is short term Pruning is from long term perspective

More about Post Pruning Grow the tree. Check on the validation data. (Training data is split into training and validation) Remove the leaves which are leading to negative results. Subtree replacement or subtree raising.

Why use decision trees? Does not require domain knowledge. Can handle multi-dimensional data Representation is simple and easy for user to understand. Fast and have good classification accuracy.

References Han, Jiawei, Micheline Kamber, and Jian Pei. Data mining: concepts and techniques: concepts and techniques. Elsevier, 2011. Mitchell, Tom M. "Machine learning. WCB." www.tutorialspot.com/datamining/dm_classification_ prediction http://www.ccs.neu.edu/home/mirek/classes/2011-s- CS6220/Slides/Lecture2-ClassificationPrediction- Large.pdf