Decision Trees Part 1. Rao Vemuri University of California, Davis

Similar documents
Learning Decision Trees

Learning Decision Trees

Decision Trees. Gavin Brown

Decision Tree Learning and Inductive Inference

Learning Classification Trees. Sargur Srihari

Machine Learning 2nd Edi7on

the tree till a class assignment is reached

Machine Learning 3. week

CS 6375 Machine Learning

Decision Trees. Tirgul 5

Lecture 3: Decision Trees

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Decision Tree Learning

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Rule Generation using Decision Trees

Decision Trees.

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Classification Using Decision Trees

Classification: Decision Trees


Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Lecture 3: Decision Trees

Artificial Intelligence. Topic

Dan Roth 461C, 3401 Walnut

Classification and Regression Trees

Decision Trees. Danushka Bollegala

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Trees.

Classification and Prediction

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Support. Dr. Johan Hagelbäck.

Classification: Decision Trees

The Solution to Assignment 6

Decision Tree Learning - ID3

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision Tree Learning

Decision Tree Learning

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Symbolic methods in TC: Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

EECS 349:Machine Learning Bryan Pardo

Data classification (II)

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Tutorial 6. By:Aashmeet Kalra

Machine Learning & Data Mining

Classification and regression trees

Administrative notes. Computational Thinking ct.cs.ubc.ca

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Chapter 6: Classification

Chapter 3: Decision Tree Learning

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Decision Trees / NLP Introduction

Machine Learning Recitation 8 Oct 21, Oznur Tastan

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Artificial Intelligence Decision Trees

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Decision Trees Entropy, Information Gain, Gain Ratio

Decision Tree Learning

Apprentissage automatique et fouille de données (part 2)

UVA CS 4501: Machine Learning

Induction on Decision Trees

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Introduction to Machine Learning CMU-10701

Decision Trees. Common applications: Health diagnosis systems Bank credit analysis

Decision T ree Tree Algorithm Week 4 1

Bayesian Classification. Bayesian Classification: Why?

Machine Learning Alternatives to Manual Knowledge Acquisition

Reminders. HW1 out, due 10/19/2017 (Thursday) Group formations for course project due today (1 pt) Join Piazza (

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Algorithms for Classification: The Basic Methods

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Decision Tree And Random Forest

The Quadratic Entropy Approach to Implement the Id3 Decision Tree Algorithm

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Symbolic methods in TC: Decision Trees

Concept Learning Mitchell, Chapter 2. CptS 570 Machine Learning School of EECS Washington State University

Einführung in Web- und Data-Science

Informal Definition: Telling things apart

10-701/ Machine Learning: Assignment 1

Data Mining Project. C4.5 Algorithm. Saber Salah. Naji Sami Abduljalil Abdulhak

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Decision Tree. Decision Tree Learning. c4.5. Example

CS145: INTRODUCTION TO DATA MINING

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

Notes on Machine Learning for and

Induction of Decision Trees

Transcription:

Decision Trees Part 1 Rao Vemuri University of California, Davis

Overview What is a Decision Tree Sample Decision Trees How to Construct a Decision Tree Problems with Decision Trees Classification Vs Regression Summary

What is a Decision Tree? An inductive learning task Use particular facts to make generalized conclusions. That is why we say inductive A predictive model based on a branching series of Boolean tests These smaller Boolean tests are less complex than a one-stage classifier Let us look at a sample decision tree

Predicting Commute Time Root Node Stall? Hour 10 AM 9 AM 8 AM Accident? No Yes Long No Yes Short Long Medium Long Leaf Nodes If we leave home at 10 AM & there are no cars stalled on the road, what will commute time be?=> Short

Inductive Learning In this decision tree, we made a series of Boolean decisions and followed the corresponding branch Did we leave at 10 AM? Yes/No Did a car stall on the road? Yes/No Is there an accident on the road? Yes/No By answering each of these YES/NO questions, we came to a conclusion on how long our commute might take

Decision Trees as Set of Rules It is not necessary to represent this tree graphically, as we did We could have represented it as a set of rules. However, this may be much harder to read

Decision Tree as a Rule Set if hour == 8am commute time = long else if hour == 9am if accident == yes commute time = long else commute time = medium else if hour == 10am if stall == yes commute time = long else commute time = short Notice that all attributes need not have to be used in each path of the decision. As we will see, all attributes may not even appear in the tree.

How to Create a Decision Tree We first make a list of attributes that we can measure These attributes (for now) must be discrete, categorical or Boolean We then choose a target attribute that we want to predict Then create an experience table that lists what we have seen in the past

Start with Experience Table Creating Experience Table is the first step for any machine learning method.

Experience Table Commute Time Dataset Example Attributes Target Hour Weather Accident Stall Commute D1 8 AM Sunny No No Long D2 8 AM Cloudy No Yes Long D3 10 AM Sunny No No Short D4 9 AM Rainy Yes No Long D5 9 AM Sunny Yes Yes Long D6 10 AM Sunny No No Short D7 10 AM Cloudy No No Short D8 9 AM Rainy No No Medium D9 9 AM Sunny Yes No Long D10 10 AM Cloudy Yes Yes Long D11 10 AM Rainy No No Short D12 8 AM Cloudy Yes No Long D13 9 AM Sunny No No Medium

Exp. Table: Play Tennis Data Day Outlook Temp Humidity Wind Tennis? D1 Sunny Hot High Weak NO D2 Sunny Hot High Strong NO D3 Cloudy Hot High Weak YES D4 Rain Mild High Weak YES D5 Rain Cool Normal Weak YES D6 Rain Cool Normal Strong NO D7 Cloudy Cool Normal Strong YES D8 Sunny Mild High Weak NO D9 Sunny Cool Normal Weak YES D10 Rain Mild Normal Weak YES D11 Sunny Mild Normal Strong YES D12 Cloudy Mild High Strong YES D13 Cloudy Hot Normal Weak YES D14 Rain Mild High Strong NO

Choosing Attributes The Commute Time experience table showed 4 attributes: hour, weather, accident and stall But the decision tree only showed 3 attributes: hour, accident and stall No weather. Why is that?

Choosing Attributes -1 Methods for selecting attributes (which will be described later) show that weather is not a discriminating attribute That is, it doesn t matter what the weather is (in this example)

Choosing Attributes - 2 The basic method of creating a decision tree is the same for most decision tree algorithms The difference lies in how we select the attributes for the nodes of the tree

Decision Tree Algorithms The basic idea behind any decision tree algorithm is as follows: Choose the best attribute to split and make that attribute a decision node Repeat this process for each child node Stop when: All the instances have the same target attribute value There are no more attributes There are no more instances

Identifying the Best Attributes Refer back to our original decision tree Hour No 10 AM 9 AM 8 AM Stall? Accident? Long Yes No Yes Short Long Medium Long How did we know to split on Hour first and then on stall and accident and not weather?

ID3 Heuristic: Entropy To determine the best attribute, we look at the ID3 heuristic ID3 splits attributes based on their entropy. Entropy is the measure of impurity (or, non-homogeneity). A term borrowed from physics (ID3: Iterative Dichotomizer 3)

Entropy Entropy is minimized when all values of the target attribute are of the same class (homogeneous) If we know that commute time will always be short, then entropy = 0 Entropy is maximized when there is an equal chance for the target attribute to assume any value (i.e. the result is random) If commute time = short in 3 instances, medium in 3 instances and long in 3 instances, entropy is 1 (maximized) (not homogeneous)

Entropy Formula in Boolean Classification Given a collection S, containing some positive and some negative examples of a target concept Play Tennis? YES/NO In this case entropy of S is: E(S) = -(p + )log 2 (p + )-(p - )log 2 (p - ) p+ = proportion in positive class p- = proportion in negative class

Entropy Formula for Many Classes E(S) = i=c å i=1 -p i S = cardinality of set of examples S i = cardinality of subset of S that belong to class i C = number of classes * log 2 p i i=c E(S) = - S i S å * log i 2 S S i=1

A Useful Formua for log 2 If you use a hand calculator, this formula will be useful log (x) = ln(x) = log 10(x) 2 ln(2) log (2) 10

Graph of log

Review of log to base 2 log 2 (0) = (the formula is no good for a probability of 0) log 2 (1) = 0 log 2 (2) = 1 log 2 (4) = 2 log 2 (1/2) = -1 log 2 (1/4) = -2 (1/2)log 2 (1/2) = (1/2)(-1) = -1/2

Play Tennis Data Set Day Outlook Temp Humidity Wind Tennis? D1 Sunny Hot High Weak NO D2 Sunny Hot High Strong NO D3 Cloudy Hot High Weak YES D4 Rain Mild High Weak YES D5 Rain Cool Normal Weak YES D6 Rain Cool Normal Strong NO D7 Cloudy Cool Normal Strong YES D8 Sunny Mild High Weak NO D9 Sunny Cool Normal Weak YES D10 Rain Mild Normal Weak YES D11 Sunny Mild Normal Strong YES D12 Cloudy Mild High Strong YES D13 Cloudy Hot Normal Weak YES D14 Rain Mild High Strong NO

Entropy of Play Tennis Data The data set S has 14 instances: 9 positive, 5 negative: Use the notation S[9+, 5-] Entropy S[9+,5-] = - (9/14)log 2 (9/14) (5/14) log 2 (5/14) = 0.940 (verify!)

Information Gain How do we use the entropy idea to build a decision tree? Ask the question: How much entropy reduction is achieved by partitioning the data set on a given attribute A. Call this entropy reduction by the name Information Gain (S, A)

Formula for Gain (S,A) Gain (S, A) = Entropy of the original data set S Expected value of the entropy after the data set S is partitioned using attribute A: Gain(S, A) = E(S) - å S v S E(S ) v vîvalues( A) where S v is the subset of S for which A has a value v

Example Calculation Consider the attribute wind. It has two values: strong, weak When wind = weak, 6 of the examples are + and two are S weak = [6+,2-] Similarly, S strong = [3+, 3-] (Refer back to the data set S)

Gain Calculation Gain (S, wind) = E(S) (8/14) E(S weak ) (6/14) E(S strong ) = 0.940 (8/14) 0.811 (6/14) 1.00 = 0.048 (verify!) (Look back at the formula for Gain)

Home Work 1 For the Play Tennis data set, calculate Gain (S, Humidity) and Gain (S, Wind) and Gain (S, Outlook). Using these gains, tell what attribute should be used to split the table. Using the above attribute as the root node, draw a partial decision tree At this stage, are there any leaf nodes?

Summary Decision trees can be used to help predict the future The trees are easy to understand Decision trees work more efficiently with discrete attributes The trees may suffer from error propagation

Example 2: (Lab Session) Build a decision tree to classify the following (x 1,x 2,x 3 ) Class Label (0, 0, 0) 0 (0, 0, 1) 0 (0, 1, 0) 0 (0, 1, 1) 0 (1, 1, 0) 0 (1, 0, 1) 1 (1, 1, 0) 0 (1, 1, 1) 1

Solution: Step 1 6 rows are Class 0; 2 rows are Class 1 The initial entropy of all 8 points is -(6/8) log 2 (6/8) (2/8) log 2 (2/8) = 0.81 Suppose we divide the points in half by drawing a plane parallel to the x 2 -x 3 plane. Then the leftbranch has 4 points all belonging to the same class. So the entropy of the left branch is -(4/4) log 2 (4/4) (0/4) log 2 (0/4) = 0

Solution: Step 2 The right branch has two members of each class. The uncertainty of the right branch is -(2/4) log 2 (2/4) (2/4) log 2 (2/4) = 1 Average entropy after the first test (on x 1 ) is (1/2) 0 + (1/2) (1) = ½ Entropy reduction achieved is 0.81 0.5 = 0.31

Solution: Step 3 Do a similar thing along x 2 and x 3 You will find out that test along x 3 gives exactly the same entropy and a test along x 2 gives no improvement at all. So first choose either x 1 or x 2. (Do the rest of the calculations!) The decision tree really implements f = x 1 x 3.

Another Entropy Notation Entropy characterizes the (im)purity, or homogeneity, of a collection of items n b = number of instances in branch b. n bc = number of instances in branch b of class c. (Note: n bc is less than or equal to n b ) n t = total number of instances in all branches.

Another Entropy Formula E = å æ - n ö æ ç bc log 2 ç n bc è ø è c n b n b As you move from perfect balance and perfect homogeneity, entropy varies smoothly between zero and one. ö ø The entropy is zero when the set is perfectly homogeneous. The entropy is one when the set is perfectly inhomogeneous.

References For incremental induction of DThttp://wwwlrn.cs.umass.edu/papers/mlj-id5r.pdf For Massive Online Analysis, http://moa.cms.waikato.ac.nz/ For extensions of the Massive Online Analysis package and particularly the iovfdt (Incrementally Optimized Very Fast Decision Tree), http://www.cs.washington.edu/dm/vfml/ http://www.cse.unsw.edu.au/~billw/cs9414/notes/ml /06prop/id3/id3.html

Home Work 2: Python Program A sample Python program for the Play Tennis example can be found at https://github.com/lbntly/decision-tree Learn to run this program Try the program on the Iris data set You can get this data set from UC Irvine archive

Other Python Codes for DTs http://www.patricklamle.com/tutorials/de cision%20tree%20python/tuto_decision% 20tree.html http://codereview.stackexchange.com/qu estions/109089/id3-decision-tree-inpython http://kldavenport.com/pure-pythondecision-trees/ https://machinelearningmastery.com/impl ement-decision-tree-algorithm-scratchpython/ 41

Classification Vs Regression In the tool box of predictive analytics there are many types of trees Classification trees (Decision trees) Regression trees

Classification Trees Given a data set {x, y}, a classification tree is used to separate (split) the data into classes that y (the target variable) can belong If y belongs to one of two classes, Class 1 (YES) or Class 0 (NO), then we use ID3 algorithm If y belongs to one of K classes, then we use C4.5 algorithm Classification trees are used when y is categorical in nature

Regression Trees Regression trees are used when the target variable y is numeric or continuous The independent variables x may be categorical or numeric It is the target variable that determines the types of tree needed

Decision Trees: Using WEKA In this laboratory session, I want you to learn WEKA WEKA is a data mining/machine learning tool developed by Department of Computer Science, University of Waikato, New Zealand. Learn WEKA and solve the Play Tennis example using WEKA

Special Situations (not covered in class) Large Data Sets Incremental decision trees Windowing Pruning Concepts that change over time Noisy Data Different Class Distributions

Entropy and Gini Index There are rigorous methods of evaluating the degree of impurity to quantify the homogeneity of each group and subgroup before and after the split Entropy we have seen Gini Index I ve not covered