Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Similar documents
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Decision Trees.

Decision Trees.

Learning Decision Trees

Learning Decision Trees

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Lecture 3: Decision Trees

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Chapter 3: Decision Tree Learning

Decision Tree Learning

the tree till a class assignment is reached

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision Tree Learning

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Machine Learning 2nd Edi7on

Classification: Decision Trees

Lecture 3: Decision Trees

CS 6375 Machine Learning

Decision Trees / NLP Introduction

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Machine Learning 3. week

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Trees. Tirgul 5

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Rule Generation using Decision Trees

The Solution to Assignment 6

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Introduction to Machine Learning CMU-10701

Classification and Prediction

EECS 349:Machine Learning Bryan Pardo

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Decision Trees. Gavin Brown

CS145: INTRODUCTION TO DATA MINING

Decision Tree Learning - ID3

Classification Using Decision Trees

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Dan Roth 461C, 3401 Walnut

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Bayesian Learning (II)

Machine Learning & Data Mining

Algorithms for Classification: The Basic Methods

Lecture 7 Decision Tree Classifier

Decision Tree Learning and Inductive Inference

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

The Naïve Bayes Classifier. Machine Learning Fall 2017

Notes on Machine Learning for and

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Learning Classification Trees. Sargur Srihari

Decision Tree Learning Lecture 2

Chapter 6: Classification

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Decision Trees Part 1. Rao Vemuri University of California, Davis


Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision Tree. Decision Tree Learning. c4.5. Example

Classification and Regression Trees

Classification and regression trees

Artificial Intelligence. Topic

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Data classification (II)

Administrative notes. Computational Thinking ct.cs.ubc.ca

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Decision Trees. Danushka Bollegala

Typical Supervised Learning Problem Setting

Decision trees COMS 4771

Decision Tree And Random Forest

Decision Trees. Common applications: Health diagnosis systems Bank credit analysis

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Induction of Decision Trees

Decision Support. Dr. Johan Hagelbäck.

The Quadratic Entropy Approach to Implement the Id3 Decision Tree Algorithm

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Induction on Decision Trees

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

10-701/ Machine Learning: Assignment 1

CC283 Intelligent Problem Solving 28/10/2013

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Artificial Intelligence Decision Trees

Moving Average Rules to Find. Confusion Matrix. CC283 Intelligent Problem Solving 05/11/2010. Edward Tsang (all rights reserved) 1

Lecture 7: DecisionTrees

Machine Learning Alternatives to Manual Knowledge Acquisition

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Chapter 3: Decision Tree Learning (part 2)

Decision Tree Learning

CSCI 5622 Machine Learning

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Learning Decision Trees

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Bayesian Classification. Bayesian Classification: Why?

Leveraging Randomness in Structure to Enable Efficient Distributed Data Analytics

Classification II: Decision Trees and SVMs

Models, Data, Learning Problems

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

Transcription:

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Intelligent Data Analysis Decision Trees Paul Prasse, Niels Landwehr, Tobias Scheffer

Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit report? Collateral > 2x disposable income Unemployed? Collateral > 5x disposable income Student? Accepted Accepted 2

Decision Trees Why? Simple to interpret. Provides a classification plus a justification for it., because subject has been employed less than 3 months and has collateral < 2 x disposable income. Can be learned from samples. Simple learning algorithm. Efficient, scalable. Classification and regression trees. Classification, regression & model trees often are components of complex (e.g., risk) models. 3

Classification Input: Instance (Object) x X. Instances are represented as a vector of attributes. An instance is an assignment to the attributes. Instances will also be called feature vectors. x 1 x = x m Output: Class y Y; finite set Y. e.g., {accepted, rejected}; {spam, t spam}. e.g. features like color, test results, The class is also referred to as the target attribute. 4

Classifier learning Input: Training data. L = x 1, y 1,, x n, y n x i = x i1 x in Output: Classifier. f X Y e.g. a decision tree: path along the edges to a leaf provides the classification 5

Regression Input: Instance (Object) x X. Instances are represented as vectors (bold letters) of attributes (italic letters). An instance is an assignment of attribute values. Output: Function value y Y; continuous valued. Learning Problem: continuous-valued training data. L = x 1, y 1,, x n, y n e.g. x 1, 3.5,, x n, 2.8 6

Decision Trees Test des: Test if the value of an attribute satisfies a condition; follow the branch corresponding to the outcome of this test. Terminal des: Provides a value to output. Collateral > 2x disposable income Unemployed? Accepted Employed longer than 3 months Positive credit report? Accepted Collateral > 5x disposable income Student? 7

Decision Trees Decision trees can be represented as logical operations / sentences, like:,, XOR A B C D E Collateral > 2x disposable income Unemployed? Accepted Employed longer than 3 months Positive credit report? Accepted Collateral > 5x disposable income Student? 8

Application of Decision Trees Test des: conduct a test, choose the appropriate branching, & proceed recursively. Terminal des: return the value as a class. Collateral > 2x disposable income Unemployed? Accepted Employed longer than 3 months Positive credit report? Accepted Collateral > 5x disposable income Student? 9

Application of Decision Trees Decision trees make sense if Instances can be described as attribute-value pairs. Target attribute has a discrete range of values. It is also possible to extend model trees to a range of continuous values. Interpretability of predictions is desirable. Application areas: Medical Diagsis Credit (Risk) Assessment Object Recognition in Computer Vision 10

Learning of Decision Trees Day Outlook Temperature Humidity Wind Tennis? 1 sunny hot high light 2 sunny hot high strong 3 overcast hot high light 4 rainy mild high light 5 rainy cool rmal light 6 rainy cool rmal strong 7 overcast cool rmal strong 8 sunny mild high light 9 sunny cool rmal light 10 rainy mild rmal light Find the decision tree that at least provides the correct class for the training data. Trivial Way: create a tree that merely reproduces the training data. 11

Learning of Decision Trees Day Outlook Temperature Wind Tennis? 1 sunny hot light 2 sunny hot strong 3 overcast hot light 4 overcast cool strong 5 overcast cool light Outlook sunny overcast Temperature Temperature hot cool hot cool Wind Wind Wind Wind light strong light strong light strong light??? strong 12

Learning of Decision Trees A more elegant way: From the trees that are consistent with the training data, choose the smallest (as few des as possible). Small trees are good because: they are easier to interpret; they generalize better in many cases. There are more samples per leaf de. Hence, the class decision in each leaf is supported by more samples. 13

Complete Search for the Smallest Tree How many decision trees are there? Assume m binary attributes and two classes. What is the complexity of a complete search for the smallest tree? 14

Learning of Decision Trees Greedy algorithm that finds a small tree (instead of the smallest tree) but is polymial in the number of attributes. Idea for Algorithm? 15

Greedy Algorithm Top-Down Construction So long as the training data are t perfectly classified: Choose the best attribute A that separates the training data. Create a new successor de for every value of attribute A and repartition the training data accordingly into the resulting tree. Problem: Which attribute is the best? [29 +, 35 -] x 1 x 2 [29 +, 35 -] [21+, 5 -] [8+, 30 -] [18+, 33 -] [11+, 2 -] 16

Split Criterium: Entropy Entropy = a measure of uncertainty. S is a set of training data. p + is the fraction of positive samples in S. p is the fraction of negative samples in S. Entropy measure the uncertainty in S (by +/-): H S = p + log 2 p + p log 2 p 17

Split Criterium: Entropy H S = Expected number of bits required to encode the target class for samples drawn randomly from the set S. Optimal, shortest code Entropy of random variable y: H y = p y = v log 2 p y = v v Empirical entropy of random variable y on data L: H L, y = p L y = v log 2 p L y = v v 18

Split Criterium: Conditional Entropy Conditional entropy of random variable y given event x = v: H x=v y = H y x = v = p y = v x = v log 2 p y = v x = v v Empirical conditional entropy on data L: H x=v L, y = H L, y x = v = p L y = v x = v log 2 p L y = v x = v v 19

Split Criterium: Information Gain Entropy of the class variable: Uncertainty about the correct classification. Information Gain: Mutual information of an attribute. Reduction of the entropy in the class variable after the test of the value of this attribute. IG x = H y p x = v H x=v y v Information Gain on Data L: IG L, x = H L, y p L x = v H x=v L, y v 20

Example Information Gain Sample IG = 0,05 IG = 0,46 IG = 0,05 x 1 : Credit > 3 x Income? 0.92 x 2 : Employed Longer than 3 Months? x 3 : Student? 1 Yes Yes No Yes 2 Yes No No No 3 No Yes Yes Yes 4 No No Yes No 5 No Yes No Yes 6 No No No Yes H L, y = p L y = v log 2 p L y = v v y: Credit Repaid? IG L, x = H L, y p L x = v H x=v L, y v 21

Example II Information Gain IG L, x = Expected reduction in the uncertainty through the split using attribute x. Which split is better? [29 +, 35 -] x 1 x 2 H L, y = 29 64 log 2 29 64 + 35 64 log 2 35 64 = 0,99 [29 +, 35 -] [21+, 5 -] [8+, 30 -] [18+, 33 -] [11+, 2 -] IG L, x 1 = 0,99 26 64 H L x1=ja,y + 38 64 H L x1=nein,y = 0,26 IG L, x 2 = 0,99 51 64 H L x2=ja,y + 13 64 H L x2=nein,y = 0,11 22

Information Gain / Gain Ratio Motivation: Predicting whether a student will pass an exam. How high is the information gain of the attribute Matriculation number? The information content of this test is huge. 23

Information Gain / Gain Ratio Motivation: Predicting whether a student will pass an exam. How high is the information gain of the attribute Matriculation number? The information content of this test is huge. Idea: Penalize information content of a test. GainRatio L, x = SplitInfo L, x = IG L,x SplitInfo L,x v L x=v L log 2 L x=v L 24

Example: Info Gain Ratio Which split is better? [2+, 2-] 1 2 [1+, 0-] [1+, 0 -] IG L, x 1 = 1 x 1 3 4 [0+, 1-] [0+, 1-] SplitInfo L, x 1 = 4 1 4 log 2 1 4 = 2 GainRatio L, x 1 = 0,5 [2+, 2-] x 2 [2+, 1 -] [0+, 1 -] IG L, x 2 = 1 3 4 0,92 = 0,68 SplitInfo L, x 2 = 3 4 log 2 3 4 +1 4 log 2 1 4 = 0,81 GainRatio L, x 2 = 0,84 25

Algorithm ID3 Preconditions: Classifier learning, All attributes have a fixed, discrete range of values. Idea: recursive algorithm. Choose the attribute that maximally reduces the uncertainty with respect to the target class using Information Gain, Gain Ratio, or Gini Index Then recursively call the algorithm by splitting the data according to the values of the chosen attribute. Continue splitting a branch until all the samples remaining for that branch are all of the same class. Original reference: J.R. Quinlan: Induction of Decision Trees. 1986 26

Algorithm ID3 Input: L = x 1, y 1,, x n, y n, available attributes = x 1,, x n If all samples in L have the same class y, Return a terminal de with class y. If available attributes=, Return a terminal de with the most prevalent class in L. Otherwise construct & return a test de: Determine the best available attribute: x = IG L, x i argmax x i available For all values v j of this attribute: Split the training set, L j = x k, y k L x k = v j Recursion: Branch for value v j = ID3 L j, available\x. 27

Algorithm ID3: Example Input: L = x 1, y 1,, x n, y n, available attributes = x 1,, x n If all samples in L have the same class y, Return a terminal de with class y. If available attributes=, Return a terminal de with the most prevalent class in L. Otherwise construct & return a test de: Sample x 1 : Credit > 3 x Income? Determine the best available attribute: x = For all values v j of this attribute: argmax x i available x 2 : Employed Longer than 3 Months? IG L, x i x 3 : Student? 1 Yes Yes No Yes 2 Yes No No No 3 No Yes Yes Yes 4 No No Yes No 5 No Yes No Yes 6 No No No Yes y: Credit Repaid? Split the training set, L j = x k, y k L x k = v j Recursion: Branch for value v j = ID3 L j, available\x. 28

Continuous Attributes ID3 chooses an attribute with the largest information content and then constructs a branch for every value of this attribute. Problem: that only works for discrete attributes. Attributes like size, income, & distance have infinitely many values. Ideas? 29

Continuous Attributes Idea: Binary decision trees with -tests. Problem: Infinitely many thresholds for binary tests. Idea: Only finitely many values occur in the training data. Example: A continuous attribute has the following values: 0,2; 0,4; 0,7; 0,9 Possible Splits: 0,2; 0,4; 0,7; 0,9 Other Possibilities: Map the attribute into a discrete range of values. 30

Algorithm C4.5 A further enhancement on ID3 Improvements: Allows for continuously-valued attributes Allows for training data with missing attribute values Allows for attributes with costs Pruning Not the last version: see C5.0 Original Reference: J.R. Quinlan: C4.5: Programs for Machine Learning. 1993 31

Algorithm C4.5 Input: L = x 1, y 1,, x n, y n If all samples in L have the same class y, Return a terminal de with class y. If all samples in L are identical: Return a terminal de with the most prevalent class in L. Otherwise construct the best test de, by iterating over all attributes. Discrete attributes: treated as in ID3. Continuous attributes: x v = argmax IG L, x i v x i & v L If the best attribute is discrete, recursively split like ID3. If the best attribute is continuous, split the training set as: L left = x k, y k L x k v & L right = x k, y k L x k > v Recursion: left branch= C4.5 L left, right branch= C4.5 L right 32

Information Gain for Continuous Attributes Information Gain of a Tests x v : IG x v = H y p x v H x v y p x > v H x>v y Empirical Information Gain: IG L, x v = H L, y p L x v H x v L, y p L x > v H x>v L, y 33

C4.5: Example x 1 0,7 0,5 0,5 1 1,5 x 2 t t x 2 0,5 f x 2 1,5 x 1 0,5 x 1 0,6 t t f f x 1 0,7 t f f 34

Pruning Problem: Leaf des that only are supported by a single (or very few) samples, often do t provide a good classification. Pruning: Remove test des whose leaves have less than a minimum number of samples. This policy results in leaf des labeled with the most prevalently occurring class. 35

Pruning with a Threshold For all leaf des: If fewer than r training samples are placed into a leaf de, Remove the parent test de of this leaf. Replace it with a new leaf de that predicts the majority class from its training data. Regularization parameter r. This parameter is adjusted via cross validation. 36

Reduced Error Pruning Splitting of the training data into a training set and a pruning validation set. After the construction of the tree with the training set: Attempt to merge two leaf des by removing their adjoining test de, Continue as long as the error rate is reduced with respect to the pruning validation set. 37

Conversion of Trees into Rules Path through the tree: yields the conditions for the rule Class: specifies the rules conclusion Pruning of Rules: Test which conditions can be omitted without increasing the error rate. 38

Conversion of Trees into Rules Humidity Outlook sunny overcast rainy high rmal strong light No Yes Yes No Wind Yes R 1 = If Outlook = sunny Humidity = high, then don t play tennis R 2 = If Outlook = sunny Humidity = rmal, then play tennis R 3 = If Outlook = overcast, then play tennis R 4 = If Outlook = rainy Wind = strong, then don t play tennis R 5 = If Outlook = rainy Wind = light, then play tennis 39