Lecture 3: Decision Trees

Similar documents
Lecture 3: Decision Trees

Decision Tree Learning

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Trees.

Decision Trees.

Decision Tree Learning

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Decision Tree Learning and Inductive Inference

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Learning Classification Trees. Sargur Srihari

Chapter 3: Decision Tree Learning

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Learning Decision Trees

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Classification and Prediction

Decision Tree Learning

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Learning Decision Trees

Machine Learning 2nd Edi7on

Machine Learning & Data Mining

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

CS 6375 Machine Learning

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Notes on Machine Learning for and

Decision Trees. Tirgul 5

Typical Supervised Learning Problem Setting

the tree till a class assignment is reached

Chapter 3: Decision Tree Learning (part 2)

EECS 349:Machine Learning Bryan Pardo

Decision Tree Learning

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Dan Roth 461C, 3401 Walnut

Decision Tree Learning - ID3

Decision Tree. Decision Tree Learning. c4.5. Example

Decision Trees / NLP Introduction

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Classification: Decision Trees

Apprentissage automatique et fouille de données (part 2)

Decision Trees. Gavin Brown

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Introduction Association Rule Mining Decision Trees Summary. SMLO12: Data Mining. Statistical Machine Learning Overview.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Induction of Decision Trees

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Lecture 9: Bayesian Learning

Classification II: Decision Trees and SVMs

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision Trees. Danushka Bollegala

Decision Tree Learning

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

CSCI 5622 Machine Learning

Introduction to Machine Learning CMU-10701

Decision Trees Part 1. Rao Vemuri University of California, Davis

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Data classification (II)

Decision Tree Learning Lecture 2

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

CSCE 478/878 Lecture 6: Bayesian Learning

Artificial Intelligence. Topic

Classification and regression trees

Symbolic methods in TC: Decision Trees


Chapter 6: Classification

Classification and Regression Trees

Tutorial 6. By:Aashmeet Kalra

Lecture 2: Foundations of Concept Learning

The Solution to Assignment 6

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Induction on Decision Trees

10-701/ Machine Learning: Assignment 1

Decision Trees Entropy, Information Gain, Gain Ratio

Learning Decision Trees

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Decision Tree And Random Forest

Algorithms for Classification: The Basic Methods

Decision Tree Learning

Lecture 7: DecisionTrees

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Machine Learning 3. week

Machine Learning Alternatives to Manual Knowledge Acquisition

Classification Using Decision Trees

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Lecture 7 Decision Tree Classifier

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

CS145: INTRODUCTION TO DATA MINING

Classification: Decision Trees

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Information Theory & Decision Trees

Statistical Learning. Philipp Koehn. 10 November 2015

Transcription:

Lecture 3: Decision Trees Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning Lecture 3: Decision Trees p.

Decision Tree Representation classification of instances by sorting them down the tree from the root to some leaf note node test of some attribute branch one of the possible values for the attribute decision trees represent a disjunction of conjunctions of constraints on the attribute values of instances i.e., (.........) (.........)... equivalent to a set of if-then-rules each branch represents one if-then-rule if-part: conjunctions of attribute tests on the nodes then-part: classification of the branch Lecture 3: Decision Trees p.

Decision Tree Representation Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes This decision tree is equivalent to: if (Outlook = Sunny) (Humidity = Normal) then Y es; if (Outlook = Overcast) then Y es; if (Outlook = Rain) (Wind = Weak) then Y es; Lecture 3: Decision Trees p.

Appropriate Problems Instances are represented by attribute-value pairs, e.g. (Temperature, Hot) Target function has discrete output values, e.g. yes or no Disjunctive descriptions may be required Training data may contain errors Training data may contain missing attribute values last three points make Decision Tree Learning more attractive than CANDIDATE-ELIMINATION Lecture 3: Decision Trees p.

ID3 learns decision trees by constructing them top-down employs a greedy search algorithm without backtracking through the space of all possible decision trees finds the shortest but not neccessarily the best decision tree key idea: selection of the next attribute according to a statistical measure all examples are considered at the same time (simultaneous covering) recursive application with reduction of selectable attributes until each training example can be classified unambiguously Lecture 3: Decision Trees p.

ID3 algorithm ID3(Examples, T arget_attribute, Attributes) Create a Root for the tree If all examples are positive, Return single-node tree Root, with label = + If all examples are negative, Return single-node tree Root, with label = If Attributes is empty, Return single-node tree Root, with label = most common value of T arget_attribute in Examples otherwise, Begin A attribute in Attributes that best classifies Examples decision attribute for Root A For each possible value v i of A Add new branch below Root with A = v i Let Examples vi be the subset of Examples with v i for A If Examples is empty Then add a leaf node with label = most common value of T arget_attribute in Examples Else add ID3(Examples vi, Target_Attribute, Attributes {A}) Return Root Lecture 3: Decision Trees p.

The best classifier central choice: Which attribute classifies the examples best? ID3 uses the information gain statistical measure that indicates how well a given attribute separates the training examples according to their target classification Gain(S, A) = Entropy(S) }{{} original entropy of S S v S Entropy(S v) v values(a) }{{} relative entropy of S interpretation: denotes the reduction in entropy caused by partitioning S according to A alternative: number of saved yes/no questions (i.e., bits) attribute with max A Gain(S, A) is selected! Lecture 3: Decision Trees p.

Entropy statistical measure from information theory that characterizes (im-)purity of an arbitrary collection of examples S for binary classification: H(S) p log 2 p p log 2 p for n-ary classification: H(S) n i=1 p i log 2 p i interpretation: specification of the minimum number of bits of information needed to encode the classification of an arbitrary member of S alternative: number of yes/no questions Lecture 3: Decision Trees p.

Entropy 1.0 Entropy(S) 0.5 minimum of H(S) 0.0 0.5 1.0 p + for minimal impurity point distribution H(S) = 0 maximum of H(S) for maximal impurity uniform distribution for binary classification: H(S) = 1 for n-ary classification: H(S) = log 2 n Lecture 3: Decision Trees p.

example days Illustrative Example Day Sunny T emp. Humidity W ind P layt ennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Lecture 3: Decision Trees p. 1

Illustrative Example entropy of S S = {D1,..., D14} = [9+, 5 ] H(S) = 9 14 log 2 9 14 5 14 log 2 5 14 = 0.940 information gain (e.g. W ind) S Weak = {D1, D3, D4, D5, D8, D9, D10, D13} = [6+, 2 ] S Strong = {D2, D6, D7, D11, D12, D4} = [3+, 3 ] Gain(S, Wind) = H(S) v W ind S v S H(S v) = H(S) 8 14 H(S Weak) 6 14 H(S Strong) = 0.940 8 14 0.811 6 14 1.000 = 0.048 Lecture 3: Decision Trees p. 1

Illustrative Example Which attribute is the best classifier? S: [9+,5-] S: [9+,5-] E =0.940 E =0.940 Humidity Wind High Normal Weak Strong [3+,4-] [6+,1-] [6+,2-] [3+,3-] E =0.985 E =0.592 E =0.811 E =1.00 Gain (S, Humidity ) Gain (S, Wind) =.940 - (7/14).985 - (7/14).592 =.151 =.940 - (8/14).811 - (6/14)1.0 =.048 Lecture 3: Decision Trees p. 1

Illustrative Example informations gains for the four attributes: Gain(S, Outlook) = 0.246 Gain(S, Humidity) = 0.151 Gain(S, Wind) = 0.048 Gain(S, T emperature) = 0.029 Outlook is selected as best classifier and is therefore Root of the tree now branches are created below the root for each possible value because every example for which Outlook = Overcast is positive, this node becomes a leaf node with the classification Y es the other descendants are still ambiguous (e.g. H(S) 0) hence, the decision tree has to be further elaborated below Lecture 3: Decision Trees p. 1

Illustrative Example Lecture 3: Decision Trees p. 1

Illustrative Example Resulting decision tree Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes Lecture 3: Decision Trees p. 1

Hypothesis Space Search + + A1 + + + A2 + +...... + + A2 A3 + A2 + + A4...... H complete space of finite discrete functions, relative to the available attributes (i.e. all possible decision trees) capabilites and limitations: returns just one single consistent hypothesis performs greedy search (i.e., max Gain(S, A)) A susceptible to the usual risks of hill-climbing without backtracking uses all training examples at each step simultaneous covering Lecture 3: Decision Trees p. 1

Inductive Bias as mentioned above, ID3 searches complete space of possible, but not completely Preference Bias Inductive bias: Shorter trees are prefered to longer trees. Trees that place high information gain attributes close to the root are also prefered. Why prefer shorter hypotheses? Occam s Razor: Prefer the simplest hypothesis that fits the data! see Minimum Description Length Principle (Bayesian Learning) e.g., if there are two decision trees, one with 500 nodes and another with 5 nodes, the second one should be prefered better chance to avoid overfitting Lecture 3: Decision Trees p. 1

Overfitting 0.9 0.85 0.8 Accuracy 0.75 0.7 0.65 0.6 0.55 On training data On test data 0.5 0 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes) Given a hypothesis space H, a hypothesis h H is said to overfit the training data if there exists some alternative hypothesis h H, such that h has smaller error than h over the training, but h has smaller error than h over the entire distribution of instances. Lecture 3: Decision Trees p. 1

Overfitting reasons for overfitting: noise in the data number of training examples is to small too produce a representative sample of the target function how to avoid overfitting: stop the tree grow earlier, before it reaches the point where it perfectly classifies the training data allow overfitting and then post-prune the tree (more successful in practice!) how to determine the perfect tree size: separate validation set to evaluate utility of post-pruning apply statistical test to estimate whether expanding (or pruning) produces an improvement Lecture 3: Decision Trees p. 1

Reduced Error Pruning each of the decision nodes is considered to be candidate for pruning x i Reduce α α α pruning a decision node consists of removing the subtree rootet at the node, making it a leaf node and assigning the most common classification of the training examples affiliated with that node nodes are removed only if the resulting tree performs not worse than the original tree over the validation set pruning starts wit the node whose removal most increases accuracy and continues until further pruning is harmful Lecture 3: Decision Trees p. 2

Reduced Error Pruning effect of reduced error pruning: 0.9 0.85 0.8 0.75 Accuracy 0.7 0.65 0.6 0.55 0.5 On training data On test data On test data (during pruning) 0 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes) any node added to coincidental regularities in the training set is likely to be pruned Lecture 3: Decision Trees p. 2

Rule Post-Pruning rule post-pruning involves the following steps: 1. Infer the decision tree from the training set (Overfitting allowed!) 2. Convert the tree into a set of rules 3. Prune each rule by removing any preconditions that result in improving its estimated accuracy 4. Sort the pruned rules by their estimated accuracy one method to estimate rule accuracy is to use a separate validation set pruning rules is more precise than pruning the tree itself Lecture 3: Decision Trees p. 2

Alternative measures natural bias in information gain favors attributes with many values over those with few values e.g. attribute Date very large number of values (e.g. March 21, 2005) inserted in the above example, it would have the highest information gain, because it perfectly separates the training data but the classification of unseen examples would be impossible alternative measure: GainRatio GainRatio(S, A) = Gain(S,A) SplitInf ormation(s,a) SplitInformation(S, A) n i=1 S i S log 2 S i S SplitInf ormation(s, A) is sensitive to how broadly and uniformly A splits S (entropy of S with respect to the values of A) GainRatio penalizes attributes such as Date Lecture 3: Decision Trees p. 2

Summary practical and intuitively understandable method for concept learning able to learn disjunctive, discrete-valued conepts noise in the data is allowed ID3 is a simultaneous covering algorithm based on information gain that performs a greedy top-down search through the space of possible decision trees Inductive Bias: Short trees are prefered (Occam s razor) overfitting is an important issue and can be reduced by pruning Lecture 3: Decision Trees p. 2