Decision Trees (Cont.)

Similar documents
CS 6375 Machine Learning

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Learning Decision Trees

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Classification: Decision Trees

Notes on Machine Learning for and

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Lecture 3: Decision Trees

Holdout and Cross-Validation Methods Overfitting Avoidance

Introduction to Machine Learning CMU-10701

Decision Tree Learning

Jeffrey D. Ullman Stanford University

the tree till a class assignment is reached

C4.5 - pruning decision trees

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Decision Tree Learning

ML (cont.): DECISION TREES

Empirical Risk Minimization, Model Selection, and Model Assessment

Decision Trees. Danushka Bollegala

day month year documentname/initials 1

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Artificial Intelligence Roman Barták

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Decision Trees. Tirgul 5

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Lecture 3: Decision Trees

Classification and Prediction

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Analysis of Classification in Interval-Valued Information Systems

MODULE -4 BAYEIAN LEARNING

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Tree Learning

CSCI 5622 Machine Learning

Decision Tree Learning

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Tutorial 6. By:Aashmeet Kalra

CS 6375: Machine Learning Computational Learning Theory

From inductive inference to machine learning

Machine Learning & Data Mining

Generalization, Overfitting, and Model Selection

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Learning from Observations. Chapter 18, Sections 1 3 1

Classification and Regression Trees

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision trees COMS 4771

CHAPTER-17. Decision Tree Induction

Decision Trees.

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Statistical Learning. Philipp Koehn. 10 November 2015

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Introduction to Artificial Intelligence. Learning from Oberservations

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Bayesian learning Probably Approximately Correct Learning

Generative v. Discriminative classifiers Intuition

CS 380: ARTIFICIAL INTELLIGENCE

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Machine Learning

Learning from Observations

Machine Learning 2010

Method for Optimizing the Number and Precision of Interval-Valued Parameters in a Multi-Object System

A Course in Machine Learning

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Decision Trees Lecture 12

Typical Supervised Learning Problem Setting

Decision Trees.

CS145: INTRODUCTION TO DATA MINING

Introduction to Machine Learning

16.4 Multiattribute Utility Functions

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Decision Tree Learning Lecture 2

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Hypothesis Tests and Estimation for Population Variances. Copyright 2014 Pearson Education, Inc.

CS 498F: Machine Learning Fall 2010

Jialiang Bao, Joseph Boyd, James Forkey, Shengwen Han, Trevor Hodde, Yumou Wang 10/01/2013

Machine Learning 2nd Edi7on

Learning Decision Trees

The Perceptron algorithm

Lecture 7 Decision Tree Classifier

Name (NetID): (1 Point)

Introduction to Machine Learning Midterm Exam

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups

Decision Trees: Overfitting

Generalization Error on Pruning Decision Trees

CSC 411 Lecture 3: Decision Trees

Classification Trees

Dan Roth 461C, 3401 Walnut

Informal Definition: Telling things apart

Bias-Variance in Machine Learning

Transcription:

Decision Trees (Cont.) R&N Chapter 18.2,18.3 Side example with discrete (categorical) attributes: Predicting age (3 values: less than 30, 30-45, more than 45 yrs old) from census data. Attributes (split in that order): Married Have a child Widowed Wealth (rich/poor) Employment type (private or not private), etc. 1

Side example with both discrete and continuous attributes: Predicting MPG ( GOOD or BAD ) from attributes: Cylinders Horsepower Acceleration Maker (discrete) Displacement The Overfitting Problem: Example Class B Class A Suppose that, in an ideal world, class B is everything such that X 2 >= 0.5 and class A is everything with X 2 < 0.5 Note that attribute X 1 is irrelevant Seems like generating a decision tree would be trivial 2

The Overfitting Problem: Example However, we collect training examples from the perfect world through some imperfect observation device As a result, the training data is corrupted by noise. The Overfitting Problem: Example Because of the noise, the resulting decision tree is far more complicated than it should be This is because the learning algorithm tries to classify all of the training set perfectly This is a fundamental problem in learning: overfitting 3

The Overfitting Problem: Example The effect of overfitting is that the tree is guaranteed to classify the training data perfectly, but it may do a terrible job at classifying new test data. Example: (0.6,0.9) is classified as A A The Overfitting Problem: Example It would be nice to identify automatically that splitting this node is stupid. Possible criterion: figure out that splitting this node will lead to a complicated tree suggesting noisy data The effect of overfitting is that the tree is guaranteed to classify the training data perfectly, but it may do a terrible job at classifying new test data. Example: (0.6,0.9) is classified as A A 4

The Overfitting Problem: Example Note that, even though the attribute X 1 is completely irrelevant in the original distribution, it is used to make the decision at that node The effect of overfitting is that the tree is guaranteed to classify the training data perfectly, but it may do a terrible job at classifying new test data. Example: (0.6,0.9) is classified as A A Possible Overfitting Solutions Grow tree based on training data (unpruned tree) Prune the tree by removing useless nodes based on: Additional test data (not used for training) Statistical significance tests 5

Training Data Unpruned decision tree from training data Training data with the partitions induced by the decision tree (Notice the tiny regions at the top necessary to correctly classify the A outliers!) Unpruned decision tree from training data 6

Training data Test data Unpruned decision tree from training data Performance (% correctly classified) Training: 100% Test: 77.5% Training data Test data Pruned decision tree from training data Performance (% correctly classified) Training: 95% Test: 80% 7

Training data Test data Pruned decision tree from training data Performance (% correctly classified) Training: 80% Test: 97.5% % of data correctly classified Tree with best performance on test set Performance on training set Size of decision tree Performance on test set 8

Using Test Data % Correct classification In this region, the tree overfits the training data (including the noise!) and start doing poorly on the test data Classification rate on training data Classification rate on test data Size of tree General principle: As the complexity of the classifier increases (depth of the decision tree), the performance on the training data increases and the performance on the test data decreases when the classifier overfits the training data. Decision Tree Pruning Construct the entire tree as before Starting at the leaves, recursively eliminate splits: Evaluate performance of the tree on test data (also called validation data, or hold out data set) Prune the tree if the classification performance increases by removing the split Prune node if classification performance on test set is greater for (2) than for (1) (1) (2) 9

Possible Overfitting Solutions Grow tree based on training data (unpruned tree) Prune the tree by removing useless nodes based on: Additional test data (not used for training) Statistical significance tests A Criterion to Detect Useless Splits The problem is that we split whenever the IG increases, but we never check if the change in entropy is statistically significant Reasoning: The proportion of the data going to the left node is p L = (N AL + N BL )/(N A +N B ) = 5/9 Suppose now that the data is completely randomly distributed (i.e., it does not make sense to split): The expected number of class A in the The number of class A in the root node is N A = 2 The number of class B in the root node is N B = 7 The number of class A in the left node is N AL = 1 The number of class B in the left node is N BL = 4 left node would be N AL = N A x p L = 10/9 The expected number of class B in the left node would be N BL = N B x p L = 35/9 Question: Are N A and N B sufficiently different from N A and N B. If not, it means that the split is not statistically significant and we should not split the root The resulting children are not significantly different from what we would get by splitting a random distribution at the root node. 10

A Criterion to Detect Useless Splits Measure of statistically significance: K = (N AL - N AL ) 2 /N AL + (N BL - N BL ) 2 /N BL + (N AR - N AR ) 2 /N AR + (N BR - N BR ) 2 /N BR K measures how much the split deviates from what we would get if the data where random K small The increase in IG of the split is not significant In this example: K = (10/9 1) 2 /(10/9) +(35/9 4) 2 /(35/9) + = 0.0321 χ 2 Criterion: General Case P L N data points P R N L data points N R data points K = all classes children j i ( N N ij = Number of points from class i in child j N ij = Number of points from class i in child j assuming a random selection N ij = N i x P j ij N N ' ij ' ij ) 2 11

χ 2 Criterion: General Case Small (Chi-square) values indicate low statistical significance Remove the splits that are lower P than Na data points L P R threshold Κ < t. Lower t bigger trees N L data points N R data points (more overfitting). Larger t smaller trees (less overfitting, but worse classification error). K = all classes i children j Difference between the distribution of class i from the proposed split and the distribution from randomly drawing data points in the same proportions as the proposed split ( N ij N N ' ij ' ij ) 2 Decision Tree Pruning Construct the entire tree as before Starting at the leaves, recursively eliminate splits: At a leaf N: Compute the K value for N and its parent P. If the K values is lower than the threshold t: Eliminate all of the children of P P becomes a leaf Repeat until no more splits can be eliminated 12

K = 10.58 K = 0.0321 K = 0.83 The gains obtained by these splits are not significant By thresholding K we end up with the decision tree that we would expect (i.e., one that does not overfit the data) Note: The approach is presented with continuous attributes in this example but it works just as well with discrete attributes 13

χ 2 Pruning The test on K is a version of a standard statistical test, the χ 2 ( chi-square ) test. The value of t is retrieved from statistical tables. For example, K > t means that, with confidence 95%, the information gain due to the split is significant. If K < t, with high confidence, the information gain will be 0 over very large training samples Reduces overfitting Eliminates irrelevant attributes Example Class Sepal Length (SL) Sepal Width (SW) Petal Length (PL) Petal Width (PW) Setosa 5.1 3.5 1.4 0.2 Setosa 4.9 3 1.4 0.2 Setosa 5.4 3.9 1.7 0.4 Versicolor 5.2 2.7 3.9 1.4 Versicolor 5 2 3.5 1 Versicolor 6 2.2 4 1 Virginica 6.4 2.8 5.6 2.1 Virginica 7.2 3 5.8 1.6 50 examples from each class 14

Full Decision Tree 2.5 2 Petal width (PW) 1.5 1 0.5 0 1 2 3 4 5 6 7 Petal length (PL) 15

Pruning One Level 16

Pruning Two Levels 2.5 2 Petal width (PW) 1.5 1 0.5 0 1 2 3 4 5 6 7 Petal length (PL) 17

Unpruned Unpruned 27% probability that this is a chance node according to χ 2 test. Node should be pruned. 18

Pruned Pruned 19

Note: Inductive Learning The decision tree approach is one example of an inductive learning technique: Suppose that data x is related to output y by a unknown function y = f(x) Suppose that we have observed training examples {(x 1,y 1 ),..,(x n,y n )} Inductive learning problem: Recover a function h (the hypothesis ) such that h(x) f(x) y = h(x) predicts y from the input data x The challenge: The hypothesis space (the space of all hypothesis h of a given form; for example the space of all of the possible decision trees for a set of M attributes) is huge + many different hypotheses may agree with the training data. y Inductive Learning y x 1 x 2 Training data x n x Hypothesis h(x) x What property should h have? It should agree with the training data 20

y Inductive Learning y x Two stupid hypotheses that fit the training data perfectly What property should h have? It should agree with the training data But that can lead to arbitrarily complex hypotheses and there are many of them; which one should we choose? x y Inductive Learning Predicted y by hypothesis h, h(x o ) is here f(x o ) is here x o What property should h have? It should agree with the training data But that can lead to arbitrarily complex hypotheses Which leads to completely wrong prediction on new test data The model does not generalize beyond the training data it overfits the training data x 21

y Inductive Learning Simpler hypothesis with better generalization x o Complex hypothesis with poor generalization x Simplicity principle (Occam s razor): entities are not to be multiplied beyond necessity The simpler hypothesis is preferred Compromise between: Error on data under hypothesis h Complexity of hypothesis h Inductive Learning Complex hypothesis with poor generalization y = A y = B Simpler hypothesis with better generalization Different illustration, same concept. 22

Inductive Learning Decision tree is one example of inductive learning To be covered next: Instance-based learning and clustering Neural networks In all cases, minimize: Error on data + complexity of model Decision Trees Information Gain (IG) criterion for choosing splitting criteria at each level of the tree. Versions with continuous attributes and with discrete (categorical) attributes Basic tree learning algorithm leads to overfitting of the training data Pruning with: Additional test data (not used for training) Statistical significance tests Example of inductive learning 23