Generalization Error on Pruning Decision Trees

Similar documents
CS 6375 Machine Learning

Decision Tree Learning

Machine Learning & Data Mining

Learning Decision Trees

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

the tree till a class assignment is reached

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision Tree Learning Lecture 2

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS145: INTRODUCTION TO DATA MINING

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Holdout and Cross-Validation Methods Overfitting Avoidance

Lecture 3: Decision Trees

Machine Learning 3. week

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Induction of Decision Trees

Learning Decision Trees

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Machine Learning 2nd Edi7on

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Classification Using Decision Trees

day month year documentname/initials 1

Chapter 6: Classification

Lecture 3: Decision Trees

Notes on Machine Learning for and

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Trees Entropy, Information Gain, Gain Ratio

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Dyadic Classification Trees via Structural Risk Minimization

Rule Generation using Decision Trees

Decision Trees / NLP Introduction

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees (Cont.)

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

Decision Trees: Overfitting

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Decision Trees. Tirgul 5

Statistical Machine Learning from Data

Decision Trees Part 1. Rao Vemuri University of California, Davis

Decision Trees.

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Statistics and learning: Big Data

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Classification: Decision Trees

Decision Trees.

Decision Trees. Danushka Bollegala

Ensemble Methods for Machine Learning

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Performance of Cross Validation in Tree-Based Models

EECS 349:Machine Learning Bryan Pardo

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Machine Learning for NLP

Classification and Prediction

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Symbolic methods in TC: Decision Trees

Introduction to Machine Learning CMU-10701

Informal Definition: Telling things apart

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Learning Decision Trees

C4.5 - pruning decision trees

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Decision Tree Learning

Classification and regression trees

Jeffrey D. Ullman Stanford University

Ensembles of Classifiers.

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Supervised Learning via Decision Trees

Dan Roth 461C, 3401 Walnut

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Algorithm-Independent Learning Issues

Generalization and Overfitting

Decision trees COMS 4771

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Generalization, Overfitting, and Model Selection

Classification and Regression Trees

Machine Learning for NLP

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Statistical Consulting Topics Classification and Regression Trees (CART)

Apprentissage automatique et fouille de données (part 2)

Ensemble Methods and Random Forests

Lecture 7: DecisionTrees

1 Handling of Continuous Attributes in C4.5. Algorithm

Discriminative v. generative

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Artificial Intelligence Decision Trees

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Knowledge Discovery and Data Mining

Introduction to Machine Learning Midterm, Tues April 8

Data Mining and Knowledge Discovery: Practice Notes

10-701/ Machine Learning: Assignment 1

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Recap from previous lecture

Transcription:

Generalization Error on Pruning Decision Trees Ryan R. Rosario Computer Science 269 Fall 2010 A decision tree is a predictive model that can be used for either classification or regression [3]. Decision tree construction algorithms take a set of observations X and a vector of labels Y, and constructs a tree that attempts to build a model that describes the data based on specific splits in values of the input variables. Each interior node of such a tree represents a single input variable and each leaf of the tree represents the predicted class label of an observation whose values follow the path from root to that leaf. One of the major problems with decision trees is the tendency for a raw decision tree to overfit the training data. In these cases, the decision tree contains many paths from the root to each leaf that are not of any statistical validity, and provide little predictive power to the model [2]. That is, the decision tree model does not generalize well to new, unseen values of X. If T is a decision tree and S is the sample of observations used to build T, then pruning is the method by which one discovers a subtree of T that provides a generalization error bound not much worse than the optimal pruning that actually minimizes the generalization error associated with T [4]. In this paper, I describe two benchmark pruning methods whose error bounds depend on the data itself, and not just the size of the tree or sample. The layout of this paper is as follows. In section 1, a short taxonomy of decision trees and the mechanics behind their construction is presented. In section 2, two methods for pruning decision trees are discussed as well as an analysis of the generalization error associated with each. In section 3, I wrap up with a comparison of the algorithms and the generalization error between the pruning methods. 1 Introduction to Decision Trees In the decision tree problem, we have a set of attributes X and a set of discrete labels Y. In this paper, I focus on classification trees, where each observation has 1

associated with it a categorical label. Another form of decision trees, the regression tree, attempts to predict a real number label from each observation [3]. There are several ways to construct decision trees and different metrics that can be used. At each iteration of a decision tree algorithm, the algorithm chooses an attribute to use to split the data and assigns it to a node. There are several different loss functions that can be used to determine which variable to assign to a node including entropy [7], and Gini [6]. Some common algorithms for constructing decision trees are ID3 [7], CART [1] and C4.5 [8]. Figure 1 shows an example of a trivial decision tree that classifies weather outlooks into weather for playing tennis or not playing tennis. Figure 1: a decision tree that takes as input a weather outlook and outputs whether or not to play tennis. Source: http://www.cs.cmu.edu/afs/cs/project/theo-20/www/mlbook/ch3.pdf 2 Pruning Decision Trees Much research has been done in the pruning of decision trees to prevent the overfitting problem, where one uses too many parameters (variable splits) to build a model. According to [5], the standard technique for dealing with this problem is to prove a bound on the generalization error as function of both the training error, and the concept size or VC dimension. Then, we wish to select the concept minimizing this error bound. A typical error bound, which was proven in [5] is given by δ (log 2) T + log 1 δ S T ɛ(t ) ˆɛ(T ) + (1) 2 S 2

which basically states that for any arbitrary concept class C where every concept T can be represented by some bitstring of length T, with probability 1 δ over the sample S we have that each concept have their true error is bounded above by the sum of the training error ˆɛ and some penalty that depends on the sizes S, T and δ. 2.1 Pruning by Finding Optimal Root Fragments The major problem with the bound presented in (1) is that it relies only on the size of T and the number of observations S. Statisticians, particularly those that study non-parametric statistics, would argue that such error bounds should be constructed from the data itself, and should not solely depend on the size of the model or the size of the data alone. 2.1.1 Error Bound on Pruning In [5] an error bound that depends on the concept T as well as the sample S was shown. First, they arbitrarily divide the nodes in some tree T into shallow nodes and deep nodes. Take R to be an arbitrary set of shallow nodes such that R forms a subtree of T where each node in R is either a leaf, or a parent of two other nodes also in R. Also, let L(R) be the set of all leaves of this subtree R. Note that R is also called a root fragment. Then we have that δ T T (H, S), ɛ(t ) ˆɛ(T ) + min R f(t, S, R, δ) (2) where f(t, S, R, δ) = v L(T ) S v S T v log 2 2 S v A v + 2 S v + 2 log 2 δ S v Let v represent some vertex in the tree, and A v represents the path from the root to v, and S v is the number of observations that reach v. Additionally, S is the size of the sample, T v is the number of nodes contained in the subtree of T rooted at v and A v is the number of terms in the predicate that lead to v. To efficiently prune the decision tree T, we want to select trees T v that minimize ˆɛ(T ) + f(t, S, R, δ) and we can optimize each tree T v independently. I feel that being able to optimize these subtrees independently is a huge win because these optimizations can be parallelized to speed up computation. The authors indicate that this penalty that they have devised is meant to penalize prunings that result in decision tree models that are 3

overly complex, and favors models that are smaller and less complex. The first S term, v seems to have been chosen as to penalize trees where S S v > S ; the algorithm favors smaller trees. The first term in the parentheses adds an additional penalty on trees containing nodes where the size of the tree rooted at v is larger than the number of observations reaching node v. If S v >> T v, only a small (perhaps negligible) amount is added to the penalty. The second term in parentheses penalizes subtrees containing nodes with a large number of predicates relative to the number of observations reaching v. In statistical modeling, it is preferable to have n >> p [3]. That is, we want the number of observations to be much greater than the number of parameters otherwise we suffer from overfitting. The requirement that n >> p is analogous to S v >> A v. The final term relates the number of observations reaching v with δ. Since δ is very small, this term would generally be negligible if S v is large. 2.1.2 Choosing an Optimal Root Fragment The error bound I have discussed is based on knowing some best root fragment R. To take advantage of this error bound, we must find the best root fragment of T. The authors note that the best root fragment of T is either the best root fragment of the left subtree of T combined with the right subtree of T, or just contains the root itself. This observation yields a bottom-up algorithm for finding the optimal root fragment. Let compute R(T v, S, A v, δ) compute the best root segment of subtree T v and let eval(t v, S v, A v, δ) be a function that computes the penalty incurred by terminating the root fragment at the root of T. Then the authors define compute R(IF(B, T l, T r ), S, d, δ) = min eval(if(b, T l, T r ), S, d, δ) S B S compute R(T l, S B, d + B + 2, δ) + S B S compute R(T r, S B, d + B + 2, δ) where B represents a predicate of the root, and T l and T r represent the left and right subtrees respectively. If one assumes that predicates in T can be computed in constant time, then this algorithm implemented as a recurrence relation requires time proportional to the sum of the number of observations reaching each node, computed over all nodes. Using this algorithm, we have the error bound (2). 2.2 Sample-Based Pruning based on Locality Kearns and Mansour propose an alternate method in [4] where the decision to prune or not to prune a subtree is made based solely on the data reaching the root of the subtree, and provides a new penalty that depends only the size of the pruned tree. The developers have also proven a strict performance guarantee from the algorithm 4

stemming from their work. First, the authors hypothesize some optimal pruning of T, called T opt that minimizes the generalization error and to which all pruning candidates are considered. Iterating through each node in the tree, the question of interest is whether or not we prune T v and replace it with a majority vote of the data reaching the node. Let P v be the probability distribution on only the observations x that form a path through T and pass through v. The observed error incurred by replacing T v with its best leaf is given by ˆɛ v (ø) = 1 S v min { {x S v : f(x) = 0}, {x S v : f(x) = 1} } which is essentially just a probability. Iteration through the tree occurs bottom-up and when node v is reached, every node below v has already been processed. The decision to prune is made if the following inequality holds ˆɛ v (T v ) + α(m v, s v, l v, δ) ˆɛ v (ø) where ˆɛ v is the fraction of errors T v makes on S v and { α(m v, s v, l v, δ) = (lv+s c v) log n+n log m δ m v for T finite where m v = S v, s v is the number of nodes in T v and l v is the depth of node v in T. An important fact about this algorithm is that the comparison between the subtree T v and the best leaf is is made based on how generalization error changes based on the decision to prune, and not on the entire tree. Using concepts of local uniform convergence, one gets a bound on the generalization error ( ( ˆɛ(T ) ɛ opt = O log s opt δ + l opt log n + log m ) ) sopt δ m I see that this bound has two major results. First, as the depth of the decision tree gets deeper, the order of the generalization error increases. This makes sense because deeper trees are more complex than shallower trees, and overfitting is typically associated with overly complex, and perhaps overly deep trees. Second, it appears that the larger the sample size m is, the smaller the generalization error. This again reminds of the classic model fitting rule in statistics, that we want to have n >> p, that is, we want a large sample size, and a smaller set of parameters. 5

3 Discussion and Conclusion In this paper, I presented two benchmark papers in the analysis of pruning for decision trees. Both methods used pruning as a vehicle for reducing the generalization error and reducing cases of overfitting the data S with some model T. Both methods feature a bottom-up sweep of the nodes in T in linear time. Also, both methods prune the decision tree in real time, and computer some loss for choosing when to prune. There are some subtle differences between both methods. The method proposed by Mansour and McAllester relies on first computing the optimal root fragment R of T which is some abstract method called compute R that accepts a subtree T v as an input. The method moves through the tree and computes a penalty eval for terminating the root fragment at T. In this sense, the algorithm has a generative feel; it is building up a subtree and terminating when some condition is met. The bound on the generalization error has embedded in it several penalties: a penalty on complicated predicates, a penalty on the size of the subtree versus the number of elements reaching a particular node v, and a holistic penalty on proportion of the observations reaching a node v. The method proposed by Kearns and Mansour does not rely on finding any root fragment. Rather, the algorithm makes a decision at each node whether or not to replace the current node with the best leaf. Unlike Mansour and McAllester, the loss is also based on the depth of the decision tree in addition to the size of the tree. In this sense, the algorithm has a more destructive feel; it collapses subtrees into leaf nodes when a certain condition is met. One important difference between these two approaches is the amount of information on which the error bound depends. [5] uses a lot more information about the intricacies of the data in the decision tree, whereas [4] relies on only the sample size and the size of the tree. I suspect that over the set of all trees T, the error bound presented in [5] is very jagged and getting a handle on the numerical bound would be somewhat unpredictable if data analysis is not performed beforehand. On the other hand, the error bound presented in [4] requires less information, appears to be smoother, and does not place such an emphasis on the intricacies of the data. From these observations, I suspect that Mansour and McAllester s method would yield superior performance on small decision trees, whereas Kearns and Mansour s method would yield superior performance on much larger decision trees with larger samples. 6

References [1] H.R. Bittencourt and R.T. Clarke. Use of classification and regression trees (CART) to classify remotely-sensed digital images. In Geoscience and Remote Sensing Symposium, 2003. IGARSS 03. Proceedings. 2003 IEEE International, volume 6, pages 3751 3753. IEEE, 2004. [2] M. Bramer. Using J-pruning to reduce overfitting in classification trees. Knowledge-Based Systems, 15(5-6):301 308, 2002. [3] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2):83 85, 2005. [4] M. Kearns and Y. Mansour. A fast, bottom-up decision tree pruning algorithm with near-optimal generalization. In Proceedings of the 15th International Conference on Machine Learning, pages 269 277. Citeseer, 1998. [5] Y. Mansour and D. McAllester. Generalization bounds for decision trees. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages 69 74. Citeseer, 2000. [6] J. Mingers. An empirical comparison of selection measures for decision-tree induction. Machine learning, 3(4):319 342, 1989. [7] J.R. Quinlan. Induction of decision trees. Machine learning, 1(1):81 106, 1986. [8] J.R. Quinlan. Bagging, boosting, and C4. 5. In Proceedings of the National Conference on Artificial Intelligence, pages 725 730, 1996. 7