SCALABLE CLASSIFICATION AND REGRESSION TREE CONSTRUCTION

Size: px

Start display at page:

Download "SCALABLE CLASSIFICATION AND REGRESSION TREE CONSTRUCTION"

Rebecca Osborne
5 years ago
Views:

1 SCALABLE CLASSIFICATION AND REGRESSION TREE CONSTRUCTION A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Alin Viorel Dobra August 2003

4 SCALABLE CLASSIFICATION AND REGRESSION TREE CONSTRUCTION Alin Viorel Dobra, Ph.D. Cornell University 2003 Automating the learning process is one of the long standing goals of Artificial Intelligence and its more recent specialization, Machine Learning. Supervised learning is a particular learning task in which the goal is to establish the connection between some of the attributes of the data made available for learning, called attribute variables, and the remaining attributes called predicted attributes. This thesis is concerned exclusively with supervised learning using tree structured models: classification trees for predicting discrete outputs and regression trees for predicting continuous outputs. In the case of classification and regression trees most methods for selecting the split variable have a strong preference for variables with large domains. Our first contribution is a theoretical characterization of this preference and a general corrective method that can be applied to any split selection method. We further show how the general corrective method can be applied to the Gini gain for discrete variables when building k-ary splits. In the presence of large amounts of data, efficiency of the learning algorithms with respect to the computational effort and memory requirements becomes very important. Our second contribution is a scalable construction algorithm for regression trees with linear models in the leaves. The key to scalability is to use the EM Algorithm for Gaussian Mixtures to locally reduce the regression problem to

5 a, much easier, classification problem. The use of strict split predicates in classification and regression trees has undesirable properties like data fragmentation and sharp decision boundaries, properties that result in decreased accuracy. Our third contribution is the generalization of the classic classification and regression trees by allowing probabilistic splits in a manner that significantly improves the accuracy but, at the same time, does not increase significantly the computational effort to build this types of models.

6 Biographical Sketch Alin Dobra was born on September 20th, 1974 in Bistriţa, Romania. He received a B.S in Computer Science from Technical University of Cluj-Napoca, Romania in June He expects to receive a Ph.D in Computer Science from Cornell University in August In the summers of 1991 and 1992, he interned at Bell-Laboratories in Murray Hill, NJ and worked with Minos Garofalakis and Rajeev Rastogi. He is joining, in the Fall 2003, the Department of Computer and Information Science and Engineering Department at University of Florida, Gainesville as an Assistant Professor. iii

7 Părinţilor mei iv

8 Acknowledgements First and foremost I would like to thank my thesis adviser, Professor Johannes Gehrke. This thesis would have not be possible without his guidance and support for the last three years. Many thanks an my love go to my wife, Delia, that has been there for me all these years. I do not even want to imagine how my life would have been without her and her support. My eternal gratitude goes to my parents, especially my father, that put my education above their personal comfort for more than 20 years. They encouraged and supported my scientific curiosity from an early age even though they never got the chance to pursue their own scientific dreams. I hope this thesis will bring them much personal satisfaction and pride. I met many great people during my five year stay at Cornell University. I thank them all. v

9 Table of Contents 1 Introduction Our Contributions Bias and bias correction in classification tree construction Scalable linear regression tree construction Probabilistic classification and regression trees Thesis Overview and Prerequisites Prerequisites Thesis Overview Classification and Regression Trees Classification Classification Trees Building Classification Trees Tree Growing Phase Pruning Phase Regression Trees Bias Correction in Classification Tree Construction Introduction Preliminaries Split Selection Bias in Split Selection A Definition of Bias Experimental Demonstration of the Bias Correction of the Bias A Tight Approximation of the distribution of Gini Gain Computation of the Expected Value and Variance of Gini Gain Approximating the Distribution of Gini Gain with Parametric Distributions Experimental Evaluation Discussion vi

10 4 Scalable Linear Regression Trees Introduction Preliminaries: EM Algorithm for Gaussian Mixtures Previous solutions to linear regression tree construction Quinlan s construction algorithm Karalic s construction algorithm Chaudhuri s et al. construction algorithm Scalable Linear Regression Trees Efficient Implementation of the EM Algorithm Split Point and Attribute Selection Empirical Evaluation Experimental testbed and methodology Experimental results: Accuracy Experimental results: Scalability Discussion Probabilistic Decision Trees Introduction Probabilistic Decision Trees (PDTs) Generalized Decision Trees(GDTs) From Generalized Decision Trees to Probabilistic Decision Trees Speeding up Inference with PDTs Learning PDTs Computing sufficient statistics for PDTs Adapting DT algorithms to PTDs Split Point Fluctuations Empirical Evaluation Experimental Setup Experimental Results: Accuracy Experimental Results: Running Time Related Work Discussion Conclusions 149 A Probability and Statistics Notions 152 A.1 Basic Probability Notions A.1.1 Probabilities and Random Variables A.2 Basic Statistical Notions A.2.1 Discrete Distributions A.2.2 Continuous Distributions vii

11 B Proofs for Chapter B.0.3 Variance of the Gini gain random variable B.0.4 Mean and Variance of χ 2 -test for two class case viii

12 List of Tables 1.1 Example Training Database P-values at point x for parametric distributions as a function of expected value, µ, and variance, σ Experimental moments and predictions of moments for N = 100, n = 2, p 1 =.5 obtained by Monte Carlo simulation with repetitions. -T are theoretical approximations, -E are experimental approximations Experimental moments and predictions of moments for N = 100, n = 10, p 1 =.5 obtained by Monte Carlo simulation with repetitions. -T are theoretical approximations, -E are experimental approximations Experimental moments and predictions of moments for N = 100, n = 2, p 1 =.01 obtained by Monte Carlo simulation with repetitions. -T are theoretical approximations, -E are experimental approximations Accuracy on real (upper part) and synthetic (lower part) datasets of GUIDE and SECRET. In parenthesis we indicate O for orthogonal splits. The winner is in bold font if it is statistically significant and in italics otherwise Datasets used in experiments; top for classification and bottom for regression Classification tree experimental results Constant regression trees experimental results Linear regression trees experimental results B.1 Formulae for expressions over random vector [X 1... X k ] distributed Multinomial(N, p 1,..., p k ) ix

13 List of Figures 1.1 Example of classification tree for training data in Table Classification Tree Induction Schema Summary of notation for Chapter Contingency table for a generic dataset D and attribute variable X The bias of the Gini gain The bias of the information gain The bias of the gain ratio The bias of the p-value of the χ 2 -test (using a χ 2 -distribution) The bias of the p-value of the G 2 -test (using a χ 2 -distribution) Experimental p-value of Gini gain with one standard deviation error bars against p-value of theoretical gamma approximation for N = 100, n = 2 and p 1 = Experimental p-value of Gini gain with one standard deviation error bars against p-value of theoretical gamma approximation for N = 100, n = 10 and p 1 = Experimental p-value of Gini gain with one standard deviation error bars against p-value of theoretical gamma approximation for N = 100, n = 2 and p 1 = Probability density function of Gini gain for attribute variables X 1 and X Bias of the p-value of the Gini gain using the gamma correction Example of situation where average based decision is different from linear regression based decision Example where classification on sign of residuals is unintuitive SECRET algorithm Projection on X r, Y space of training data Projection on X d, X r, Y space of same training data as in Figure Separator hyperplane for two Gaussian distributions in two dimensional space x

14 4.7 Tabular and graphical representation of running time (in seconds) of GUIDE, GUIDE with 0.01 of point as split points, SECRET and SECRET with oblique splits for synthetic dataset 3DSin (3 continuous attributes) Tabular and graphical representation of running time (in seconds) of GUIDE, GUIDE with 0.01 of point as split points, SECRET and SECRET with oblique splits for synthetic dataset Fried (11 continuous attributes) Running time of SECRET with linear regressors as a function of the number of attributes for dataset 3Dsin Accuracy of the best quadratic approximation of the running time for dataset 3Dsin Running time of SECRET with linear regressors as a function of the size of the 3Dsin dataset Accuracy as a function of learning time for SECRET and GUIDE with four sampling proportions Tabular and graphical representation of running time (in seconds) of vanilla SECRET and probabilistic SECRET(P), both with constant regressors, for synthetic dataset Fried (11 continuous attributes) Tabular and graphical representation of running time (in seconds) of vanilla SECRET and probabilistic SECRET(P), both with linear regressors, for synthetic dataset Fried (11 continuous attributes) B.1 Dependency of the function 1 6p 1+6p 2 1 p 1 (1 p 1 ) on p xi

15 Chapter 1 Introduction Automating the learning process is one of the long standing goals of Artificial Intelligence and its more recent specialization, Machine Learning but also the core goal of newer research areas like Data-mining. The ability to learn from examples has found numerous applications in the scientific and business communities the applications include scientific experiments, medical diagnosis, fraud detection, credit approval, and target marketing (Brachman et al., 1996; Inman, 1996; Fayyad et al., 1996) since it allows the identification of interesting patterns or connections either in the examples provided or, more importantly, in the natural or artificial process that generated the data. In this thesis we are only concerned with data presented in tabular format we call each column an attribute and we associate a name with it. Attributes whose domain is numerical are called numerical attributes, whereas attributes whose domain is not numerical are called categorical attributes. An example of a dataset about people leaving in a metropolitan area is depicted in Table 1.1. Attribute Car type of this dataset is categorical and attribute Age is numerical. Two types of learning tasks have been identified in the literature: unsupervised 1

16 2 Table 1.1: Example Training Database Car Type Driver Age Children Lives in Suburb? sedan 23 0 yes sports 31 1 no sedan 36 1 no truck 25 2 no sports 30 0 no sedan 36 0 no sedan 25 0 yes truck 36 1 no sedan 30 2 yes sedan 31 1 yes sports 25 0 no sedan 45 1 yes sports 23 2 no truck 45 0 yes and supervised learning. They differ in the semantics associated with the attributes of the learning examples and their goals. The general goal of unsupervised learning is to find interesting patterns in the data, patterns that are useful for a higher level understanding of the structure of the data. Types of interesting patterns that are useful are: groupings or clusters in the data as found by various clustering algorithms (see for example the excellent surveys (Berkhin, 2002; Jain et al., 1999)), and frequent item-sets, (Agrawal & Srikant, 1994; Hipp et al., 2000). Unsupervised learning techniques usually assign the same role to all the attributes. Supervised learning tries to determine a connection between a subset of the attributes, called the inputs or attribute variables, and the dependent attribute or outputs. 1 Two of the central problems in supervised learning the only ones we are concerned with in this thesis are classification and regression. Both problems 1 It is possible to have more dependent attributes, but for the purpose of this thesis we consider only one.

17 3 have as goal the construction of a succinct model that can predict the value of the dependent attribute from the attribute variables. The difference between the two tasks is the fact that the dependent attribute is categorical for classification and numerical for regression. Many classification and regression models have been proposed in the literature: Neural networks (Sarle, 1994; Kohonen, 1995; Bishop, 1995; Ripley, 1996), genetic algorithms (Goldberg, 1989), Bayesian methods (Cheeseman et al., 1988; Cheeseman & Stutz, 1996), log-linear models and other statistical methods (James, 1985; Agresti, 1990; Chirstensen, 1997), decision tables (Kohavi, 1995), and treestructured models, so-called classification and regression trees (Sonquist et al., 1971; Gillo, 1972; Morgan & Messenger, 1973; Breiman et al., 1984). Excellent overviews of classification and regression methods were given by Weiss and Kulikowski (1991), Michie et al. (1994) and Hand (1997). Classification and regression trees we call them collectively decision trees are especially attractive in a data mining environment for several reasons. First, due to their intuitive representation, the resulting model is easy to assimilate by humans (Breiman et al., 1984; Mehta et al., 1996). Second, decision trees are non-parametric and thus especially suited for exploratory knowledge discovery. Third, decision trees can be constructed relatively fast compared to other methods (Mehta et al., 1996; Shafer et al., 1996; Lim et al., 1997). And last, the accuracy of decision trees is comparable or superior to other classification and regression models (Murthy, 1995; Lim et al., 1997; Hand, 1997). In this thesis, we restrict our attention exclusively to classification and regression trees. Figure 1.1 depicts a classification tree, built based on data in Table 1.1, that predicts if a person lives in a suburb based on other information about the person. The pred-

18 4 icates, that label the edges (e.g. Age 30), are called split predicates and the attributes involved in such predicates, split attributes. In traditional classification and regression trees only deterministic split predicates are used (i.e. given the split predicate and the value of the the attributes, we can determine if the attribute is true or false). Prediction with classification trees is done by navigating the tree on true predicates until a leaf is reached, when the prediction in the leaf (YES or NO in our example) is returned. The regions of the attribute variable space where the decision is given by the same leaf will be called, throughout the thesis, decision regions and the boundaries between such regions decision boundaries. Age <= 30 >30 Car Type # Childr. sedan sports, truck 0 >0 YES NO Car Type Car Type sedan sports, truck sedan sports, truck NO YES YES NO Figure 1.1: Example of classification tree for training data in Table 1.1 As it can be observed from the figure, the classification trees are easy to understand we immediately observe, for example, that people younger than 30 which drive sports cars tend not to live in suburbs and have a very compact representation. For these reasons and others, detailed in Chapter 2, classification and

19 5 regression trees have been the subject of much research for the last two decades. Nevertheless, at least in our opinion, more research is still necessary to fully understand and develop these types of learning models, especially from a statistical perspective. The synergy of Statistics, Machine Learning and Data-mining methods, when applied to classification and regression tree construction, is the main theme in this thesis. The overall goal of our work was to designed learning algorithms that have good statistical properties, good accuracy and require reasonable computational effort, even for large data-sets. 1.1 Our Contributions Three problems in classification and regression tree construction received our attention: Bias and bias correction in classification tree construction Often, learning algorithms have undesirable preferences, especially in the presence of large amounts of noise. In the case of classification and regression trees most methods for selecting the split variable have a strong preference for variables with large domains. In this thesis we provide a theoretical characterization of this preference and a general corrective method that can be applied to any split selection criteria to remove this undesirable bias. We show how the general corrective method can be applied to the Gini gain for discrete variables when building k-ary splits.

20 Scalable linear regression tree construction In the presence of large amounts of data, efficiency of the learning algorithms with respect to the computational effort and memory requirements becomes very important. Part of this thesis is concerned with the scalable construction of regression trees with linear models in the leaves. The key to scalability is to use the EM Algorithm for Gaussian Mixtures to locally at the level of each node being built reduce the regression problem to a classification problem. As a side benefit, regression trees with oblique splits (involving a linear combination of predictor attributes instead of a single attribute) can be easily built Probabilistic classification and regression trees The use of strict split predicates in classification and regression trees has two undesirable consequences. First, data is fragmented at an exponential rate and therefore decisions in leaves are based on small number of samples. Second, decision boundaries are sharp because a single leaf is responsible for prediction. One principled way to address both these problems is to generalize classification and regression trees to make probabilistic decisions. More specifically, a probabilistic model is assigned to each branch and it is used to determine the probability to follow the branch. Instead of using a single leaf to predict the output for a given input, all leaves are used, but their contributions are weighted by the probability to reach them when starting from the root. In this thesis we show how to find well motivated probabilistic models and to design scalable algorithms for building such probabilistic classification and regression trees.

21 7 1.2 Thesis Overview and Prerequisites Prerequisites This thesis requires relatively few prerequisites. We assume the reader is familiar with basic linear algebra and calculus, in particular the notions of equations, vectors, matrices and Riemann integrals. Standard textbooks on Linear Algebra (our favorite reference is (Hefferon, 2003)) and Calculus (for example (Swokowski, 1991)) suffice. The thesis relies heavily on notions of Probability Theory and Statistics. In Appendix A we provide an overview of the necessary notions and results for reading this thesis. Certainly, readers familiar with these topics will find it easier to follow the presentation especially the proofs but the exposition in Appendix A should suffice Thesis Overview Chapter 2 provides a broad introduction to classification and regression tree construction. In the rest of the thesis we assume that the reader is familiar with these notions. In Chapter 3 we address the bias and bias correction problem for classification tree construction. We provide proofs of results in this chapter in Appendix B. Chapter 4 is dedicated to the linear regression tree construction problem, and Chapter 5 to probabilistic decision trees. Concluding remarks and directions of future research are given in Chapter 6.

22 Chapter 2 Classification and Regression Trees In this chapter we give an introduction to classification and regression trees. We first start by formally introducing the classification trees and present some construction algorithms for building such classifiers. Then, we explain how regression trees differ. As mentioned in the introduction, we collectively refer to these types of models as decision trees. 2.1 Classification Let X 1,..., X m, C be random variables where X i has domain Dom(X i ). The random variable C has domain Dom(C) = {1,..., k}. We call X 1... X m attribute variables m is the number of such attribute variables and C the class label or predicted attribute. A classifier C is a function C : Dom(X 1 ) Dom(X m ) Dom(C). Let Ω = Dom(X 1 ) Dom(X m ) Dom(C) be the set of events. The underlying 8

23 9 assumption in classification is the fact that the generative process for the data is probabilistic; it generates the datasets according to an unknown probability distribution P over the set of events Ω. For a given classifier C and a given probability distribution P over Ω we can introduce a functional R P (C) = P [C(X 1,..., X n ) C] called the generalization error of the classifier C. Given some information about P in the form of a set of samples, we would like to build a classifier that best approximates P. This leads us to the following: Classifier Construction Problem: Given a training dataset D of N independent identically distributed samples from Ω, sampled according to probability distribution P, find a function C that minimizes the functional R P (C), where P is the probability distribution used to generate D. In general, the classifier construction problem is very hard to solve if we allow the classifier to be an arbitrary function. Arguments rooted in statistical learning theory (Vapnik, 1998) suggest that we have to restrict the class of classifiers that we allow in order to hope to solve this problem. For this reason we restrict our attention to a special type of classifier classification trees. 2.2 Classification Trees A classification tree is a directed, acyclic graph T with tree shape. The root of the tree denoted by Root(T ) does not have any incoming edges. Every other node has exactly one incoming edge and may have 0, 2 or more outgoing edges. We call a node T without outgoing edges a leaf node, otherwise T is called an internal node. Each leaf node is labeled with one class label; each internal node T

24 10 is labeled with one attribute variable X T, called the split attribute. We denote the class label associated with a leaf node T by Label(T ). Each edge (T, T ) from an internal node T to one of its children T has a predicate q (T,T ) associated with it where q (T,T ) involves only the splitting attribute X T of node T. The set of predicates Q T on the outgoing edges of an internal node T must contain disjoint predicates involving the split attribute whose conjunction is true for any value of the split attribute exactly one of the predicates in Q T is true. We will refer to the set of predicates in Q T as splitting predicates of T Given a classification tree T, we can define the associated classifier C T (x 1,..., x m ) in the following recursive manner: Label(T ) if T is a leaf node C(x 1,..., x m, T ) = C(x 1,..., x m, T j ) if T is an internal node, X i is label of T, and q (T,Tj )(x i ) = true (2.1) C T (x 1,..., x m ) = C(x 1,..., x m, Root(T )) (2.2) thus, to make a prediction, we start at the root node and navigate the tree on true predicates until a leaf is reached, when the class label associated with it is returned as the result of the prediction. If the tree T is a well-formed classification tree (as defined above), then the function C T () is also well defined and, by our definition, a classifier which we call a classification tree classifier, or in short a classification tree. Two main variations have been proposed for classification trees both are in extensive use. If we allow at most two branches for any of the intermediate nodes we get a binary classification tree; otherwise we get a k-ary classification

25 11 tree. Binary classification trees were introduced by Breiman et al. (1984); k- ary classification trees were introduced by Quinlan (1986). The main difference between these types of trees is in what predicates are allowed for discrete attribute variables (for continuous attribute variables both allow only predicates of the form X > c where c is a constant). For binary classification trees, predicates of the form X S, with S a subset of the possible values of the attribute, are allowed. This means that for each node we have to determine both a split attribute and a split set. For discrete attributes in k-ary classification trees, there are as many split predicates as there are values for the attribute variable and all are of the form X = x i, with x i one of the possible value of X. In this situation, no split set has to be determined but the fanout of the tree can be very large. For continuous attribute variables, both types of classification trees split a node into two parts on predicates of the form X s and its complement X > s, where the real number s is called the split point. Figure 1.1 shows an example of a binary classification tree that is build to predict the data in the dataset in Table Building Classification Trees Now that we introduced the classification trees, we can formally state the classification tree construction problem by instantiating the general classifier construction problem: Classification Tree Construction Problem: Given a training dataset D of N independent identically distributed samples from Ω, sampled according to probability distribution P, find a classification tree T such that the misclassification

26 12 rate functional R P (C T ) of the corresponding classifier C T is minimized. The main issue with solving the classification tree problem in particular and the classifier problem in general, is the fact that the classifier has to be a good predictor for the distribution not for the sample made available from the distribution. This means that we cannot just simply build a classifier that is as good as possible with respect to the available sample it is easy to see that we can achieve zero error with arbitrary classification trees if we do not have contradicting examples since the noise in the data will be learned as well. This noise learning phenomena, called overfitting, is one of the main problems in classification. For this reason, classification trees are build in two phases. In the first phase a tree as large as possible is constructed in a manner that minimizes the error with respect to some subset of the available data subset that we call training data. In the second phase the remaining samples we call them the pruning data are used to prune the large tree by removing subtrees in a manner that reduces the estimate of the generalization error computed using the pruning data. We discuss each of these two phases individually in what follows Tree Growing Phase Several aspects of decision tree construction have been shown to be NP-hard. Some of these are: building optimal trees from decision tables (Hyafil & Rivest, 1976), constructing minimum cost classification tree to represent a simple function (Cox et al., 1989), and building optimal classification trees in terms of size to store information in a dataset (Murphy & Mccraw, 1991). In order to deal with the complexity of choosing the split attributes and split sets and points, most of the classification tree construction algorithms use the

27 13 Input: node T, data-partition D, split selection method V Output: classification tree T for D rooted at T Top-Down Classification Tree Induction Schema: BuildTree(Node T, data-partition D, split attribute selection method V) (1) Apply V to D to find the split attribute X for node T. (2) Let n be the number of children of T. (2) if (T splits) (3) Partition D into D 1,..., D n and label note T with split attribute X (4) Create children nodes T 1,..., T n of T and label the edge (T, T i ) with predicate q (T,Ti ) (5) foreach i {1,.., n} (6) BuildTree(T i, D i, V) (7) endforeach (8) else (9) Label T with the majority class label of D (10) endif Figure 2.1: Classification Tree Induction Schema greedy induction schema in Figure 2.1. It consists in deciding, at each step, upon a split attribute and split set or point, if necessary, partitioning the data according with the newly determined split predicates and recursively repeating the process on these partitions, one for each child. The construction process at a node is terminated when a termination condition is satisfied. The only difference between the two types of classification trees is the fact that for k-ary trees no split set needs to be determined for discrete attributes.

28 14 We now discuss how the split attribute and split set or point are picked at each step in the recursive construction process, then show some common termination conditions. Split Attribute Selection At each step in the recursive construction algorithm, we have to decide on what attribute variable to split. The purpose of the split is to separate, as much as possible, the class labels from each others. To make this intuition useful, we need a metric that estimates how much the separation of the classes is improved when a particular split is performed. We call such a metric a split criteria or a split selection method. There is extensive research in the machine learning and statistics literature on devising split selection criteria that produce classification trees with high predictive accuracy (Murthy, 1997). We briefly discuss here only the ones relevant for our work. A very popular class of split selection methods are impurity-based (Breiman et al., 1984; Quinlan, 1986). The popularity is well deserved since studies have shown that this class of split selection methods have high predictive accuracy (Lim et al., 1997), and at the same time they are simple and intuitive. Each impuritybased split selection criteria is based on an impurity function Φ(p 1,..., p k ), with p j interpreted as the probability of seeing the class label c j. Intuitively, the impurity function measures how impure the data is. It is required to have the following properties (Breiman et al., 1984): 1. to be concave: 2 Φ(p 1,..., p k ) p 2 i > 0

29 15 2. to be symmetric in all its arguments, i.e. for π a permutation, Φ(p 1,..., p k ) = Φ(p π1,..., p πk ) 3. to have unique maximum at (1/k,..., 1/k) when the mix of class labels is most impure 4. to achieve the minimum for (1, 0,..., 0), (0, 1, 0,..., 0),..., (0,..., 0, 1), when the mix of class labels is the most pure With this, for a node T of the classification tree being built, the impurity at node T is: i(t ) = Φ(P [C = c 1 T ],..., P [C = c k T ] where P [C = c j T ] is the probability that the class label is c j given that the data reaches node T. We defer the discussion on how these statistics are computed for the end of this section. Given a set Q of split predicates on attribute variable X that split a node T into nodes T 1,..., T n, we can define the reduction in impurity as: i(t, X, Q) = i(t ) = i(t ) n P [T i T ] i(t i ) i=1 n P [q (T,Ti )(X) T ] i(t i ) i=1 (2.3) Intuitively, the reduction in impurity is the amount of purity gained by splitting, where the impurity after split is the weighted sum of impurities of each child node. By instantiating the impurity function we get the first two split selection criteria:

30 16 Gini Gain. This split criterion was introduced by Breiman et al. (1984). By setting the impurity function to be the Gini index: k gini(t ) = 1 P [C = c j T ] and plugging it into Equation 2.3 we get the Gini gain split criteria: j=1 GG(T, X, Q) = gini(t ) n P [q (T,Ti )(X) T ] gini(t i ) (2.4) i=1 For two class labels, the Gini gain takes the more compact form: GG b (T, X, Q) = P [C = c 0 T ] 2 (P [C = C 0 T 1 ] P [T 1 T ]) 2 P [T 1 T ](1 P [T 1 T ]) (2.5) Information Gain. This split criterion was introduced by Quinlan (1986). By setting the impurity function to be the entropy of the dataset k entropy(t ) = P [C = c j T ] log P [C = c j T ] j=1 and plugging it into Equation 2.3 we get the information gain split criteria: n IG(T, X, Q) = entropy(t ) P [q j (X) T ] entropy(t j ) (2.6) j=1 Gain Ratio. Quinlan introduced this adjusted version of the information gain to remove the preference of information gain for attribute variables with large domains (Quinlan, 1986). GR(T, X, Q) = IG(T, X, Q) Dom(X) j=1 P [X = x j T ] log P [X = x j T ] (2.7) Two other popular split selection methods come from the statistics literature:

31 17 The χ 2 Statistic (test). χ 2 (T, X) = Dom(X) i=1 k (P [X = x i T ] P [C = c j T ] P [X = x i, C = c j T ]) 2. P [X = x i T ] P [C = c j T ] j=1 (2.8) estimates how much the class labels depend on the value of the split attribute. Notice that the χ 2 -test does not depend on the set Q of split predicates. A known result in the statistics literature, see for example (Shao, 1999), is the fact that the χ 2 -test has, asymptotically, a χ 2 distribution with Dom(X) (k 1) degrees of freedom. The G 2 -statistic. G 2 (T, X, Q) = 2 N T IG(T ) log e 2, (2.9) where N T is the number of records at node T. Asymptotically, the G 2 -statistic has also a χ 2 distribution (Mingers, 1987). Interestingly, it is identical to the information gain up to a multiplicative constant, which immediately gives an asymptotic approximation for the distribution of information gain. Note that all split criteria except χ 2 -test take the set of split predicates as argument. For discrete attribute variables in k-ary classification trees, the set of predicates is completely determined by specifying the attribute variable, but this is not the case for discrete variables for binary trees or continuous variables. In these last two situations we also have to determine the best split set or point in order to evaluate how good a split on a particular attribute variable is.

32 18 Split Set Selection for Discrete Attributes Most of the set selection methods proposed in the literature use the same split criterion used for split attribute selection in order to evaluate all possible splits and select as split set the best. This method is referred to as exhaustive search, since all possible splits of the set of values of an attribute variables are evaluated, at least in principle. In general, this process of finding the split set is computationally intensive except when the domain of the split attribute and the number of class labels is small. There is though a notable exception due to Breiman et al. (1984), when there is an efficient algorithm to find the split set: the case when there are only two class labels and an impurity based selection criterion is used. Since this algorithm is relevant for some parts of our work, we describe it here. Let us first start with the following: Theorem 1 (Breiman et al. (1984)). Let I be a finite set, q i, r i, i I be positive quantities and Φ(x) a concave function. For I 1, I 2 a partitioning of I, an optimum of the problem has the property that: ( ) i I argmin I1,I 2 q i Φ 1 q i r i + ( ) i I q i Φ 2 q i r i i I 1 i I 1 q i i I 2 i I 2 q i i I 1, j I 2, r i < r j A direct consequence of this theorem is an efficient algorithm to solve this type of optimization problems, namely order the elements of I into increasing order of r i and consider only the I number of ways to split set I in this order. The correctness of the algorithm is guaranteed by the fact that, the optimum split will be among the splits considered.

33 19 With this, setting I = Dom(X), q i = P [X = x i T ], r i = P [C = c 0 X = x i, T ] and Φ(x) to be the Gini index or entropy for the two class labels case (both are concave): gini(t ) = 2P [C = c 0 T ](1 P [C = c 0 T ]) entropy(t ) = P [C = c 0 T ] ln(p [C = c 0 T ]) (1 P [C = c 0 T ]) ln(1 P [C = c 0 T ]) the optimization criterion, up to a constant factor, is exactly the Gini gain or information gain. Thus, to efficiently find the best split set, we order elements of DomX in the increasing order of r i = P [C = c 0 X = x i, T ] and consider splits only in this order. Since all the split criteria we introduced, except the χ 2 -test, use either the Gini gain or information gain multiplied with a factor that does not depend on the split set, this fast split set selection method can be used for all of them. It is worth mentioning that Loh and Shih (1997) proposed a different technique that consists in transforming values of discrete attributes into continuous values and using split point selection methods for continuous attributes to obtain the split for discrete attributes. Split Point Selection for Continuous Attributes Two methods have been proposed in the literature to deal with the split point selection problem for continuous attributes: exhaustive search and Quadratic Discriminant Analysis. Exhaustive search uses the same split selection criteria as does the split attribute selection method and consists in evaluating all the possible ways to split

34 20 the domain of the continuous attribute in two parts. To make the process efficient, data available is first sorted on the attribute variable that is being evaluated and then traversed in order. At the same time, the sufficient statistics are incrementally maintained and the value of the split criteria computed for each split point. This means that the overall process requires a sort and a linear traversal with constant processing time per value. Most of the classification tree construction algorithms proposed in the literature use the exhaustive search. Loh and Shih (1997) proposed using Quadratic Discriminant Analysis (QDA) to find the split point for continuous attributes, and showed that, from the point of view of accuracy of the produced trees, it is as good as exhaustive search. An apparent problem with QDA is that it works only for two class label problems. Loh and Shih (1997) suggested a solution to this problem: group the class labels into two super-classes based on some class similarity and define QDA and the split set problem in terms of this super-classes. This method can be used to deal with the intractability of finding splits for categorical attributes when the number of classes is larger than two. We now briefly describe QDA. The idea is to approximate the distribution of the data-points with the same class label with a normal distribution, and to take as the split point the point between the centers of the two distributions with equiprobability to belong to each of the distributions. More precisely, for a continuous attribute X, the parameters of the two normal distributions probability to belong

35 21 to the distribution α i, mean µ i and variance σ 2 i are determined with the formulae: α i = P [C = c i T ] µ i = E [X C = c i, T ] σ 2 i = E [ X 2 C = c i, T ] µ 2 i and the equation of the split point µ is: 1 α 1 e (µ µ 1 ) 2 1 2σ 1 2 = α 2 e (µ µ 2 ) 2 2σ 2 2 σ 1 2π σ 2 2π which reduces to the following quadratic equation for the split point: µ 2 ( 1 σ σ 2 2 ) ( µ1 2µ µ ) 2 + µ2 1 µ2 2 σ1 2 σ2 2 σ1 2 σ2 2 = 2 ln α 1 α 2 ln σ2 1 σ 2 2 (2.10) If σ 2 1 is very close to σ 2 2, solving the second order equation is not numerically stable. In this case it is preferable to solve the linear equation: 2µ(µ 1 µ 2 ) = µ 2 1 µ 2 2 2σ 2 1 ln α 1 α 2 that is numerically solvable as long as µ 1 µ 2. To compute the Gini gain of the variable X with split point µ we just need to compute the sufficient statistics: P [C = c i X µ, T ], P [C = c i X µ, T ], and P [X µ T ] = P [C = c 0 T ]P [C = c 0 X µ, T ] + P [C = c 1 T ]P [C = c 1 X µ, T ] and plug them into Equation 2.5. The probability P [x C 1 x µ, T ] is nothing that the cumulative distribution function (c.d.f) of the normal distribution with mean µ 1 and variance σ1 2 at point µ. That is: 1 (x µ 1)2 /2σ2 P [C = c 0 X µ, T ] = e 1 dx x µ σ 1 2π = 1 ( ( )) µ1 µ 1 + Erf 2 σ 1 2

36 22 P [C = c 1 X µ] is similarly obtained. The advantage of QDA is the fact that no sorting of the data is necessary. The sufficient statistics (see next section) can be easily computed in a single pass over the data in any order and solving the quadratic equation gives the split point. Stopping Criteria The recursive process of constructing classification trees has to be eventually stopped. The most popular stopping criteria we use it throughout the thesis is to stop the growth of the tree when the number of data-points on which the decision is based goes below a prescribed minimum. By stopping the growth when small amount of data is available, we avoid taking statistical insignificant decisions that are likely to be very noisy thus wrong.other possibilities are to stop the tree growth when no predictive attribute can be found can be quite damaging to the construction algorithm since no one variable might be predictive but a combination of variables can be predictive or when the tree reached a maximum height. Computing the Sufficient Statistics So far, we have seen how the classification tree construction process can be reduced to sufficient statistics computation for every node. Here we explain how the sufficient statistics can be estimated using the training data. The idea is to use the usual empirical estimates; throughout the thesis we use the symbol the empirical estimate of a probability or expectation. This means that: e = to denote 1. for probabilities of the form P [p(x j ) T ] with p(x j ) some predicate on attribute variable X j, the estimate is simply the number of data-points in the training dataset at node T, D T, for which the predicate p(x j ) holds over the

37 23 overall number of data-points in D T : P [p(x j ) T ] e = {(x, c) D T X j = x j } D T 2. for conditional probabilities of the form P [p(x j ) C = c 0, T ], the estimate is: P [p(x j ) C = c 0, T ] e = {(x, c 0) D T X j = x j } {(x, c 0 ) D T } 3. for expectations of functions of attributes, like E [f(x j ) T ], the estimate is simply the average value of the function applied to the attribute for the data-points in D T : E [f(x j ) T ] e = (x,c) D T f(x j ) D T where f(x) is the function whose expectation is being estimated 4. for expectations of the form E [f(x j ) C = c 0, T ], the estimate is: E [f(x j ) C = c 0, T ] = e (x,c 0 ) D T f(x j ) {(x, c 0 ) D T } Note that the estimates for all these sufficient statistics can be computed in a single pass over the data. Gehrke et al. (1998) explain how these sufficient statistics can be efficiently computed using limited memory and secondary storage Pruning Phase In this thesis we use exclusively Quinlan s re-substitution error pruning (Quinlan, 1993a). A comprehensive overview of other pruning techniques can be found in (Murthy, 1997). Re-substitution error pruning consists in eliminating subtrees in order to obtain a tree with the smallest error on the pruning set, a separate part of the data used

38 24 only for pruning. To achieve this, every node estimates its contribution to the error on pruning data when the majority class is used as en estimate. Then, starting from the leaves and going upward, every node compares the contribution to the error by using the local prediction with the smallest possible contribution to the error of its children (if a node is not a leaf in the final tree, it has no contribution to the error, only leaves contribute), and prunes the tree if the local error contribution is smaller this results in the node becoming a leaf. Since, after visiting any of the nodes the tree is optimally pruned this is the invariant maintained when the overall process finishes the whole tree is optimally pruned. 2.4 Regression Trees We start with the formal definition of the regression problem and we present regression trees, a particular type of regressors. We have the random variables X 1,..., X m as in the previous section to which we add the random variable Y with real line as the domain that we call the predicted attribute or output. A regressor R is a function R : Dom(X 1 ) Dom(X m ) Dom(Y ). Now if we let the set of events to be Ω = Dom(X 1 ) Dom(X m ) Dom(Y ) we can define probability measures P over Ω. Using such a probability measure and some loss function L (i.e. square loss function L(a, x) = a x 2 ) we can define the regressor error as R P (R) = E P [L(Y, R(X 1,..., X m )] where E P is the expectation with respect to probability measure P. In this thesis we use only the square loss function. With this we have:

39 25 Regressor Construction Problem: Given a training dataset D of N independent identically distributed samples from Ω, sampled according to probability distribution P, find a function R that minimizes the functional R P (R). Regression Trees, the particular type of regressors we are interested in, are the natural generalization of classification trees for regression problems. Instead of associating a class label to every node, a real value or a functional dependency of some of the inputs is used. Regression trees were introduced by Breiman et al. (1984) and implemented in their CART system. Regression trees in CART are binary trees, have a constant numerical value in the leaves and use the variance as a measure of impurity. Thus the split selection measure is: N T Err(T ) = (y i y i ) 2 (2.11) i=1 Err(T ) = Err(T ) Err(T 1 ) Err(T 2 ) (2.12) The reason for using variance as the impurity measure is justified by the fact that the best constant predictor in a node is the average of the value of the predicted variable on the test examples that correspond to the node; the variance is thus the mean square error of the average used as a predictor. An alternative split criteria proposed by Breiman et al. (1984) and used also in (Torgo, 1997a) is based on the sample variance as the impurity measure:

40 26 Err S (T ) = Var (Y T ) e = 1 Err(T ) N T Err S (T ) = Err S (T ) P [T 1 T ] Err S (T 1 ) P [T 2 T ] Err S (T 2 ) Interestingly, if the maximum likelihood estimate is used for all the probabilities and expectations, as it is usually done in practice, we have the following connection between the variance and sample variance criteria: Err S (T ) e = Err(T ) N T N T 1 N T Err(T 1 ) N T1 N T 2 N T Err(T 2 ) N T2 = Err(T ) N T Due to this connection, if there are no missing values, minimizing one of the criteria results also in minimizing the other. For a categorical attribute variable X, minimizing Err S (T ) can be done very efficiently since the objective function in Theorem 1 with: Φ(x) = x 2 q i = P [X = x i T ] r i = P [Y X = x i, T ]n is exactly this criterion up to additive and multiplicative constants that do not influence the solution (Breiman et al., 1984). This means that we can simply order the elements in Dom(X) in increasing order of P [Y X = x i, T ] and consider splits only in this order. If the empirical estimates are used for q i = P [X = x i T ] and r i = P [Y X = x i, T ], the criteria Err(T ) is minimized. As in the case of classification trees, prediction is made by navigating the tree

Decision Tree Learning

Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,