Decision Trees. Common applications: Health diagnosis systems Bank credit analysis

Similar documents
The Bayesian Learning

Machine Learning 3. week

Learning Decision Trees

Learning Decision Trees

Rule Generation using Decision Trees

Classification Using Decision Trees

Artificial Intelligence. Topic

Machine Learning 2nd Edi7on

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Tree Learning and Inductive Inference

Classification: Decision Trees

Learning Classification Trees. Sargur Srihari

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Decision Tree Learning - ID3

Decision Trees / NLP Introduction

Decision Trees Part 1. Rao Vemuri University of California, Davis

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Leveraging Randomness in Structure to Enable Efficient Distributed Data Analytics

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Dan Roth 461C, 3401 Walnut

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision Trees.

Decision Tree Learning

day month year documentname/initials 1

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

The Quadratic Entropy Approach to Implement the Id3 Decision Tree Algorithm

Decision Trees. Danushka Bollegala

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Decision Trees. Gavin Brown

Lecture 3: Decision Trees

Lecture 3: Decision Trees

Decision Trees. Tirgul 5

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Typical Supervised Learning Problem Setting

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Classification and regression trees

B555 - Machine Learning - Homework 4. Enrique Areyan April 28, 2015

Introduction to Machine Learning CMU-10701

Classification and Regression Trees

the tree till a class assignment is reached

Notes on Machine Learning for and

Machine Learning & Data Mining

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Decision Tree Learning

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision Tree Learning

Decision Trees.

Lecture 7: DecisionTrees

CS 6375 Machine Learning

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Machine Learning Review

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Machine Learning Alternatives to Manual Knowledge Acquisition

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Supervised Learning via Decision Trees

The Naïve Bayes Classifier. Machine Learning Fall 2017

Classification: Decision Trees

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Apprentissage automatique et fouille de données (part 2)

Decision Tree Learning

Answers Machine Learning Exercises 2

Induction on Decision Trees

Decision T ree Tree Algorithm Week 4 1

Chapter 3: Decision Tree Learning

Data classification (II)

Administrative notes. Computational Thinking ct.cs.ubc.ca

Decision Tree Learning

EECS 349:Machine Learning Bryan Pardo

Decision Tree Learning

ARTIFICIAL INTELLIGENCE. Supervised learning: classification

UVA CS 4501: Machine Learning

CSCI 5622 Machine Learning

CS145: INTRODUCTION TO DATA MINING

Concept Learning Mitchell, Chapter 2. CptS 570 Machine Learning School of EECS Washington State University

Symbolic methods in TC: Decision Trees

Decision Tree. Decision Tree Learning. c4.5. Example

Classification and Prediction

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Chapter 4.5 Association Rules. CSCI 347, Data Mining

Decision Trees. Robot Image Credit: Viktoriya Sukhanova 123RF.com

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

The Solution to Assignment 6

EVALUATING RISK FACTORS OF BEING OBESE, BY USING ID3 ALGORITHM IN WEKA SOFTWARE

Artificial Intelligence Decision Trees

Tutorial 6. By:Aashmeet Kalra

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Chapter 6: Classification

Generative v. Discriminative classifiers Intuition

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Decision Tree And Random Forest

Symbolic methods in TC: Decision Trees

Induction of Decision Trees

Weather Prediction Using Historical Data

Transcription:

Decision Trees Rodrigo Fernandes de Mello Invited Professor at Télécom ParisTech Associate Professor at Universidade de São Paulo, ICMC, Brazil http://www.icmc.usp.br/~mello mello@icmc.usp.br

Decision Trees They build up trees to organize the attributes as internal nodes and answers as leaves The importance of each attribute is related to its tree height Intended to be used on discrete data If not, one needs to discretize it Common applications: Health diagnosis systems Bank credit analysis

Decision Trees For example, consider the problem of Play some Sport Let s classify if a day is good to play Outlook Sunny Cloudy Rainy Moisture Yes Wind High Normal Strong Weak No Yes No Yes For example: Having the query instance: Output: No <Outlook=Sunny, Temperature=Hot, Moisture=High>

Decision Trees After building up the tree, one can organize the rules employed in a next classification stage: Sunny Outlook Cloudy Rainy Moisture Yes Wind High Normal Strong Weak No Yes No Yes <Outlook=Sunny AND Moisture=Normal> OR <Outlook=Cloudy> OR <Outlook=Rainy AND Wind=Weak>

Decision Trees Most well-known algorithms: ID3 (Quinlan, 1986) C4.5 (Quinlan, 1993) J48 The ID3 algorithm It builds up a tree based on a top-down approach It attempts to position the most discriminative attribute as close as possible to the tree root Every attribute must be tested in order to define their relevances For each attribute, branches are created according to its possible values

Decision Trees ID3 employs the Information Gain approach to decide on the most important attributes That depends on the Shannon s Entropy

Shannon s Entropy Entropy: A Thermodynamic property to determine the amount of useful energy in a given system Gibbs stated that the best interpretation for the Entropy in the Statistical Mechanics is as an Uncertainty Measure A little about the History: Entropy starts with Lazare Carnot (1803) Rudolf Clausius (1850s-1860s) brings new interpretations in physics Claude Shannon (1948) designs the concept of Entropy in the context of Information Theory

Shannon s Entropy Let the system start as follows: 1 And change to: 0 1 1 0.5 0.5 0 1 1

Shannon s Entropy This is the Equation proposed by Shannon: It measures the total energy of a system: It considers the system is at a given state i and transitions to j The function log 2 supports to find the number of bits of Information

Shannon s Entropy Thus: 1 0 1 1 0.5 0 1 1 0.5 After modifying the behavior, the system agregated a greater level of uncertainty or energy (Ex: Win the Lottery) 1

Let a collection of instances S, including positive and negative examples i.e., two distinct and discrete classes/labels ID3 using the Entropy Let the probability of pertaining to each class of S The Entropy is given by:

ID3 using the Entropy For illustration purposes, consider S has 14 examples: 9 positives 5 negatives Then, the Entropy of such a set is: Observe a binary-class scenario has maximum Entropy equals to 1

ID3 using the Entropy Other scenarios: Having [7+, 7-] Having [0+, 14-] or [14+, 0-] Entropy measures the level of uncertainty about some event/ source

We may generalize Entropy to more classes: ID3 using the Entropy Several other systems can be studied using Entropy: For instance: Time Series, after some quantization process Why the Log function is used? It allows to measure information in bits A great example is about codifying some message Codify: TORONTO

After defining Entropy, we can define Information Gain It measures how effective an attribute is to classify some dataset A different point of view: ID3 uses Information Gain It measures the Entropy reduction when partitioning the dataset using a given attribute The second term measures the Entropy after partitioning the dataset with attribute A Therefore: IG measures the Entropy reduction, i.e. the uncertainty decrease, after selecting attribute A to compose the tree

To illustrate, take S and the attribute Wind (Weak or Strong) S contains 14 instances [9+, 5-] Now consider that: ID3 uses Information Gain 6 of the positive examples and 2 of the negative ones have Wind=Weak (8 in total) There are 3 instances with Wind=Strong for both the positive and the negative classes (6 in total) Then, the Information Gain while selecting the attribute Wind to take the root of our decision tree is given by: weak strong

So that: ID3 uses Information Gain weak strong weak weak strong weak

ID3 uses Information Gain Therefore: weak strong weak weak strong weak This is the Information Gain measure used by ID3 along the iterations to build up a decision tree Observe this scenario has only reduced the uncertainty a little Thus, is this attribute good enough to take the root? You will see it is not!

Running ID3 Consider the concept of play some sport Outlook Temperature Moisture Wind Play Tennis Sunny Warm High Weak No Sunny Warm High Strong No Cloudy Warm High Weak Yes Rainy Pleasant High Weak Yes Rainy Cold Normal Weak Yes Rainy Cold Normal Strong No Cloudy Cold Normal Strong Yes Sunny Sunny Pleasant Cold High Normal Weak Weak No Yes Rainy Pleasant Normal Weak Yes Sunny Pleasant Normal Strong Yes Cloudy Pleasant High Strong Yes Cloudy Warm Normal Weak Yes Rainy Pleasant High Strong No

Running ID3 First step: Compute the Information Gain for each of its attributes: Outlook weak Moisture Wind Temperature The attribute with greater IG is selected to be the tree root This is the one that reduces the most the uncertainty! The children nodes are created according to the possible values for the root attribute

Running ID3 Having the tree root: We must proceed in the same way to each of its branches We only consider the examples filtered by each branch If there is divergence S = {D1,..., D14} Outlook Sunny S = {D1, D2, D8, D9, D11} Cloudy S = {D3, D7, D12, D13} Rainy S = {D4, D5, D6, D10, D14}???

Running ID3 Observe there is no divergence for one of the branches, so its Entropy is zero and a leaf label is defined: S = {D1,..., D14} Outlook Sunny S = {D1, D2, D8, D9, D11} Cloudy S = {D3, D7, D12, D13} Rainy S = {D4, D5, D6, D10, D14}? Yes? The attributes already incorporated throughout the path to reach a node are not considered in the Information Gain Observe the Outlook will never be assessed again for both branches

Running ID3 Carrying on... S = {D1,..., D14} [9+, 5-] Outlook Sunny S = {D1, D2, D8, D9, D11} [2+, 3-] Cloudy S = {D3, D7, D12, D13} [4+, 0-] Rainy? Yes? S = {D4, D5, D6, D10, D14} [3+, 2-] Computing the Information Gain to the Sunny branch: Entropy is E(S = Sunny) Sunny weak

Computing the Information Gain to the Sunny branch: Running ID3 Moisture Temperature Wind Observe the Moisture is the best choice!

Carrying on... Running ID3 S = {D1,..., D14} [9+, 5-] Outlook Sunny S = {D1, D2, D8, D9, D11} [2+, 3-] Cloudy S = {D3, D7, D12, D13} [4+, 0-] Rainy Moisture Yes? S = {D4, D5, D6, D10, D14} [3+, 2-] High S = {D1, D2, D8} [0+, 3-] No Normal S = {D9, D11} [2+, 0-] Yes

And then... Running ID3 S = {D1,..., D14} [9+, 5-] Outlook Sunny S = {D1, D2, D8, D9, D11} [2+, 3-] Umidade Cloudy S = {D3, D7, D12, D13} [4+, 0-] Yes Rainy S = {D4, D5, D6, D10, D14} [3+, 2-] Wind High S = {D1, D2, D8} [0+, 3-] Normal S = {D9, D11} [2+, 0-] Strong S = {D6, D14} [0+, 2-] Weak S = {D4, D5, D10} [3+, 0-] No Yes No Yes

Running ID3 ID3 carries on until one of the following conditions is satisfied: (1) Either all attributes were included in the path from the root to the leaves (2) Or the training examples at a given branch have a single class/label Observe ID3 starts with a null tree Then it builds the tree up from the scratch And It builds up a single tree

Discussion C4.5 is an extension of ID3 It proceeds in the same way but it has a backtracking process afterwards in attempt to reduce the number of tree nodes It is a copyrighted solution J48 is based on the C4.5 documentation and is open The Information Gain is a way of reducing the tree height But why? William of Occam, 1320

Suggestion Implement ID3 Discrete version Continuous version Test it with the UCI datasets http://archive.ics.uci.edu/ml/

Complementary references Tom Mitchell, Machine Learning, 1994 Hudson (2006) Signal Processing Using Mutual Information, IEEE Signal Processing Magazine Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. Statistics/Probability Series. Wadsworth Publishing Company, Belmont, California, U.S.A., 1984.