SCALABLE CLASSIFICATION AND REGRESSION TREE CONSTRUCTION

Size: px
Start display at page:

Download "SCALABLE CLASSIFICATION AND REGRESSION TREE CONSTRUCTION"

Transcription

1 SCALABLE CLASSIFICATION AND REGRESSION TREE CONSTRUCTION A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Alin Viorel Dobra August 2003

2 c 2003 Alin Viorel Dobra ALL RIGHTS RESERVED

3

4 SCALABLE CLASSIFICATION AND REGRESSION TREE CONSTRUCTION Alin Viorel Dobra, Ph.D. Cornell University 2003 Automating the learning process is one of the long standing goals of Artificial Intelligence and its more recent specialization, Machine Learning. Supervised learning is a particular learning task in which the goal is to establish the connection between some of the attributes of the data made available for learning, called attribute variables, and the remaining attributes called predicted attributes. This thesis is concerned exclusively with supervised learning using tree structured models: classification trees for predicting discrete outputs and regression trees for predicting continuous outputs. In the case of classification and regression trees most methods for selecting the split variable have a strong preference for variables with large domains. Our first contribution is a theoretical characterization of this preference and a general corrective method that can be applied to any split selection method. We further show how the general corrective method can be applied to the Gini gain for discrete variables when building k-ary splits. In the presence of large amounts of data, efficiency of the learning algorithms with respect to the computational effort and memory requirements becomes very important. Our second contribution is a scalable construction algorithm for regression trees with linear models in the leaves. The key to scalability is to use the EM Algorithm for Gaussian Mixtures to locally reduce the regression problem to

5 a, much easier, classification problem. The use of strict split predicates in classification and regression trees has undesirable properties like data fragmentation and sharp decision boundaries, properties that result in decreased accuracy. Our third contribution is the generalization of the classic classification and regression trees by allowing probabilistic splits in a manner that significantly improves the accuracy but, at the same time, does not increase significantly the computational effort to build this types of models.

6 Biographical Sketch Alin Dobra was born on September 20th, 1974 in Bistriţa, Romania. He received a B.S in Computer Science from Technical University of Cluj-Napoca, Romania in June He expects to receive a Ph.D in Computer Science from Cornell University in August In the summers of 1991 and 1992, he interned at Bell-Laboratories in Murray Hill, NJ and worked with Minos Garofalakis and Rajeev Rastogi. He is joining, in the Fall 2003, the Department of Computer and Information Science and Engineering Department at University of Florida, Gainesville as an Assistant Professor. iii

7 Părinţilor mei iv

8 Acknowledgements First and foremost I would like to thank my thesis adviser, Professor Johannes Gehrke. This thesis would have not be possible without his guidance and support for the last three years. Many thanks an my love go to my wife, Delia, that has been there for me all these years. I do not even want to imagine how my life would have been without her and her support. My eternal gratitude goes to my parents, especially my father, that put my education above their personal comfort for more than 20 years. They encouraged and supported my scientific curiosity from an early age even though they never got the chance to pursue their own scientific dreams. I hope this thesis will bring them much personal satisfaction and pride. I met many great people during my five year stay at Cornell University. I thank them all. v

9 Table of Contents 1 Introduction Our Contributions Bias and bias correction in classification tree construction Scalable linear regression tree construction Probabilistic classification and regression trees Thesis Overview and Prerequisites Prerequisites Thesis Overview Classification and Regression Trees Classification Classification Trees Building Classification Trees Tree Growing Phase Pruning Phase Regression Trees Bias Correction in Classification Tree Construction Introduction Preliminaries Split Selection Bias in Split Selection A Definition of Bias Experimental Demonstration of the Bias Correction of the Bias A Tight Approximation of the distribution of Gini Gain Computation of the Expected Value and Variance of Gini Gain Approximating the Distribution of Gini Gain with Parametric Distributions Experimental Evaluation Discussion vi

10 4 Scalable Linear Regression Trees Introduction Preliminaries: EM Algorithm for Gaussian Mixtures Previous solutions to linear regression tree construction Quinlan s construction algorithm Karalic s construction algorithm Chaudhuri s et al. construction algorithm Scalable Linear Regression Trees Efficient Implementation of the EM Algorithm Split Point and Attribute Selection Empirical Evaluation Experimental testbed and methodology Experimental results: Accuracy Experimental results: Scalability Discussion Probabilistic Decision Trees Introduction Probabilistic Decision Trees (PDTs) Generalized Decision Trees(GDTs) From Generalized Decision Trees to Probabilistic Decision Trees Speeding up Inference with PDTs Learning PDTs Computing sufficient statistics for PDTs Adapting DT algorithms to PTDs Split Point Fluctuations Empirical Evaluation Experimental Setup Experimental Results: Accuracy Experimental Results: Running Time Related Work Discussion Conclusions 149 A Probability and Statistics Notions 152 A.1 Basic Probability Notions A.1.1 Probabilities and Random Variables A.2 Basic Statistical Notions A.2.1 Discrete Distributions A.2.2 Continuous Distributions vii

11 B Proofs for Chapter B.0.3 Variance of the Gini gain random variable B.0.4 Mean and Variance of χ 2 -test for two class case viii

12 List of Tables 1.1 Example Training Database P-values at point x for parametric distributions as a function of expected value, µ, and variance, σ Experimental moments and predictions of moments for N = 100, n = 2, p 1 =.5 obtained by Monte Carlo simulation with repetitions. -T are theoretical approximations, -E are experimental approximations Experimental moments and predictions of moments for N = 100, n = 10, p 1 =.5 obtained by Monte Carlo simulation with repetitions. -T are theoretical approximations, -E are experimental approximations Experimental moments and predictions of moments for N = 100, n = 2, p 1 =.01 obtained by Monte Carlo simulation with repetitions. -T are theoretical approximations, -E are experimental approximations Accuracy on real (upper part) and synthetic (lower part) datasets of GUIDE and SECRET. In parenthesis we indicate O for orthogonal splits. The winner is in bold font if it is statistically significant and in italics otherwise Datasets used in experiments; top for classification and bottom for regression Classification tree experimental results Constant regression trees experimental results Linear regression trees experimental results B.1 Formulae for expressions over random vector [X 1... X k ] distributed Multinomial(N, p 1,..., p k ) ix

13 List of Figures 1.1 Example of classification tree for training data in Table Classification Tree Induction Schema Summary of notation for Chapter Contingency table for a generic dataset D and attribute variable X The bias of the Gini gain The bias of the information gain The bias of the gain ratio The bias of the p-value of the χ 2 -test (using a χ 2 -distribution) The bias of the p-value of the G 2 -test (using a χ 2 -distribution) Experimental p-value of Gini gain with one standard deviation error bars against p-value of theoretical gamma approximation for N = 100, n = 2 and p 1 = Experimental p-value of Gini gain with one standard deviation error bars against p-value of theoretical gamma approximation for N = 100, n = 10 and p 1 = Experimental p-value of Gini gain with one standard deviation error bars against p-value of theoretical gamma approximation for N = 100, n = 2 and p 1 = Probability density function of Gini gain for attribute variables X 1 and X Bias of the p-value of the Gini gain using the gamma correction Example of situation where average based decision is different from linear regression based decision Example where classification on sign of residuals is unintuitive SECRET algorithm Projection on X r, Y space of training data Projection on X d, X r, Y space of same training data as in Figure Separator hyperplane for two Gaussian distributions in two dimensional space x

14 4.7 Tabular and graphical representation of running time (in seconds) of GUIDE, GUIDE with 0.01 of point as split points, SECRET and SECRET with oblique splits for synthetic dataset 3DSin (3 continuous attributes) Tabular and graphical representation of running time (in seconds) of GUIDE, GUIDE with 0.01 of point as split points, SECRET and SECRET with oblique splits for synthetic dataset Fried (11 continuous attributes) Running time of SECRET with linear regressors as a function of the number of attributes for dataset 3Dsin Accuracy of the best quadratic approximation of the running time for dataset 3Dsin Running time of SECRET with linear regressors as a function of the size of the 3Dsin dataset Accuracy as a function of learning time for SECRET and GUIDE with four sampling proportions Tabular and graphical representation of running time (in seconds) of vanilla SECRET and probabilistic SECRET(P), both with constant regressors, for synthetic dataset Fried (11 continuous attributes) Tabular and graphical representation of running time (in seconds) of vanilla SECRET and probabilistic SECRET(P), both with linear regressors, for synthetic dataset Fried (11 continuous attributes) B.1 Dependency of the function 1 6p 1+6p 2 1 p 1 (1 p 1 ) on p xi

15 Chapter 1 Introduction Automating the learning process is one of the long standing goals of Artificial Intelligence and its more recent specialization, Machine Learning but also the core goal of newer research areas like Data-mining. The ability to learn from examples has found numerous applications in the scientific and business communities the applications include scientific experiments, medical diagnosis, fraud detection, credit approval, and target marketing (Brachman et al., 1996; Inman, 1996; Fayyad et al., 1996) since it allows the identification of interesting patterns or connections either in the examples provided or, more importantly, in the natural or artificial process that generated the data. In this thesis we are only concerned with data presented in tabular format we call each column an attribute and we associate a name with it. Attributes whose domain is numerical are called numerical attributes, whereas attributes whose domain is not numerical are called categorical attributes. An example of a dataset about people leaving in a metropolitan area is depicted in Table 1.1. Attribute Car type of this dataset is categorical and attribute Age is numerical. Two types of learning tasks have been identified in the literature: unsupervised 1

16 2 Table 1.1: Example Training Database Car Type Driver Age Children Lives in Suburb? sedan 23 0 yes sports 31 1 no sedan 36 1 no truck 25 2 no sports 30 0 no sedan 36 0 no sedan 25 0 yes truck 36 1 no sedan 30 2 yes sedan 31 1 yes sports 25 0 no sedan 45 1 yes sports 23 2 no truck 45 0 yes and supervised learning. They differ in the semantics associated with the attributes of the learning examples and their goals. The general goal of unsupervised learning is to find interesting patterns in the data, patterns that are useful for a higher level understanding of the structure of the data. Types of interesting patterns that are useful are: groupings or clusters in the data as found by various clustering algorithms (see for example the excellent surveys (Berkhin, 2002; Jain et al., 1999)), and frequent item-sets, (Agrawal & Srikant, 1994; Hipp et al., 2000). Unsupervised learning techniques usually assign the same role to all the attributes. Supervised learning tries to determine a connection between a subset of the attributes, called the inputs or attribute variables, and the dependent attribute or outputs. 1 Two of the central problems in supervised learning the only ones we are concerned with in this thesis are classification and regression. Both problems 1 It is possible to have more dependent attributes, but for the purpose of this thesis we consider only one.

17 3 have as goal the construction of a succinct model that can predict the value of the dependent attribute from the attribute variables. The difference between the two tasks is the fact that the dependent attribute is categorical for classification and numerical for regression. Many classification and regression models have been proposed in the literature: Neural networks (Sarle, 1994; Kohonen, 1995; Bishop, 1995; Ripley, 1996), genetic algorithms (Goldberg, 1989), Bayesian methods (Cheeseman et al., 1988; Cheeseman & Stutz, 1996), log-linear models and other statistical methods (James, 1985; Agresti, 1990; Chirstensen, 1997), decision tables (Kohavi, 1995), and treestructured models, so-called classification and regression trees (Sonquist et al., 1971; Gillo, 1972; Morgan & Messenger, 1973; Breiman et al., 1984). Excellent overviews of classification and regression methods were given by Weiss and Kulikowski (1991), Michie et al. (1994) and Hand (1997). Classification and regression trees we call them collectively decision trees are especially attractive in a data mining environment for several reasons. First, due to their intuitive representation, the resulting model is easy to assimilate by humans (Breiman et al., 1984; Mehta et al., 1996). Second, decision trees are non-parametric and thus especially suited for exploratory knowledge discovery. Third, decision trees can be constructed relatively fast compared to other methods (Mehta et al., 1996; Shafer et al., 1996; Lim et al., 1997). And last, the accuracy of decision trees is comparable or superior to other classification and regression models (Murthy, 1995; Lim et al., 1997; Hand, 1997). In this thesis, we restrict our attention exclusively to classification and regression trees. Figure 1.1 depicts a classification tree, built based on data in Table 1.1, that predicts if a person lives in a suburb based on other information about the person. The pred-

18 4 icates, that label the edges (e.g. Age 30), are called split predicates and the attributes involved in such predicates, split attributes. In traditional classification and regression trees only deterministic split predicates are used (i.e. given the split predicate and the value of the the attributes, we can determine if the attribute is true or false). Prediction with classification trees is done by navigating the tree on true predicates until a leaf is reached, when the prediction in the leaf (YES or NO in our example) is returned. The regions of the attribute variable space where the decision is given by the same leaf will be called, throughout the thesis, decision regions and the boundaries between such regions decision boundaries. Age <= 30 >30 Car Type # Childr. sedan sports, truck 0 >0 YES NO Car Type Car Type sedan sports, truck sedan sports, truck NO YES YES NO Figure 1.1: Example of classification tree for training data in Table 1.1 As it can be observed from the figure, the classification trees are easy to understand we immediately observe, for example, that people younger than 30 which drive sports cars tend not to live in suburbs and have a very compact representation. For these reasons and others, detailed in Chapter 2, classification and

19 5 regression trees have been the subject of much research for the last two decades. Nevertheless, at least in our opinion, more research is still necessary to fully understand and develop these types of learning models, especially from a statistical perspective. The synergy of Statistics, Machine Learning and Data-mining methods, when applied to classification and regression tree construction, is the main theme in this thesis. The overall goal of our work was to designed learning algorithms that have good statistical properties, good accuracy and require reasonable computational effort, even for large data-sets. 1.1 Our Contributions Three problems in classification and regression tree construction received our attention: Bias and bias correction in classification tree construction Often, learning algorithms have undesirable preferences, especially in the presence of large amounts of noise. In the case of classification and regression trees most methods for selecting the split variable have a strong preference for variables with large domains. In this thesis we provide a theoretical characterization of this preference and a general corrective method that can be applied to any split selection criteria to remove this undesirable bias. We show how the general corrective method can be applied to the Gini gain for discrete variables when building k-ary splits.

20 Scalable linear regression tree construction In the presence of large amounts of data, efficiency of the learning algorithms with respect to the computational effort and memory requirements becomes very important. Part of this thesis is concerned with the scalable construction of regression trees with linear models in the leaves. The key to scalability is to use the EM Algorithm for Gaussian Mixtures to locally at the level of each node being built reduce the regression problem to a classification problem. As a side benefit, regression trees with oblique splits (involving a linear combination of predictor attributes instead of a single attribute) can be easily built Probabilistic classification and regression trees The use of strict split predicates in classification and regression trees has two undesirable consequences. First, data is fragmented at an exponential rate and therefore decisions in leaves are based on small number of samples. Second, decision boundaries are sharp because a single leaf is responsible for prediction. One principled way to address both these problems is to generalize classification and regression trees to make probabilistic decisions. More specifically, a probabilistic model is assigned to each branch and it is used to determine the probability to follow the branch. Instead of using a single leaf to predict the output for a given input, all leaves are used, but their contributions are weighted by the probability to reach them when starting from the root. In this thesis we show how to find well motivated probabilistic models and to design scalable algorithms for building such probabilistic classification and regression trees.

21 7 1.2 Thesis Overview and Prerequisites Prerequisites This thesis requires relatively few prerequisites. We assume the reader is familiar with basic linear algebra and calculus, in particular the notions of equations, vectors, matrices and Riemann integrals. Standard textbooks on Linear Algebra (our favorite reference is (Hefferon, 2003)) and Calculus (for example (Swokowski, 1991)) suffice. The thesis relies heavily on notions of Probability Theory and Statistics. In Appendix A we provide an overview of the necessary notions and results for reading this thesis. Certainly, readers familiar with these topics will find it easier to follow the presentation especially the proofs but the exposition in Appendix A should suffice Thesis Overview Chapter 2 provides a broad introduction to classification and regression tree construction. In the rest of the thesis we assume that the reader is familiar with these notions. In Chapter 3 we address the bias and bias correction problem for classification tree construction. We provide proofs of results in this chapter in Appendix B. Chapter 4 is dedicated to the linear regression tree construction problem, and Chapter 5 to probabilistic decision trees. Concluding remarks and directions of future research are given in Chapter 6.

22 Chapter 2 Classification and Regression Trees In this chapter we give an introduction to classification and regression trees. We first start by formally introducing the classification trees and present some construction algorithms for building such classifiers. Then, we explain how regression trees differ. As mentioned in the introduction, we collectively refer to these types of models as decision trees. 2.1 Classification Let X 1,..., X m, C be random variables where X i has domain Dom(X i ). The random variable C has domain Dom(C) = {1,..., k}. We call X 1... X m attribute variables m is the number of such attribute variables and C the class label or predicted attribute. A classifier C is a function C : Dom(X 1 ) Dom(X m ) Dom(C). Let Ω = Dom(X 1 ) Dom(X m ) Dom(C) be the set of events. The underlying 8

23 9 assumption in classification is the fact that the generative process for the data is probabilistic; it generates the datasets according to an unknown probability distribution P over the set of events Ω. For a given classifier C and a given probability distribution P over Ω we can introduce a functional R P (C) = P [C(X 1,..., X n ) C] called the generalization error of the classifier C. Given some information about P in the form of a set of samples, we would like to build a classifier that best approximates P. This leads us to the following: Classifier Construction Problem: Given a training dataset D of N independent identically distributed samples from Ω, sampled according to probability distribution P, find a function C that minimizes the functional R P (C), where P is the probability distribution used to generate D. In general, the classifier construction problem is very hard to solve if we allow the classifier to be an arbitrary function. Arguments rooted in statistical learning theory (Vapnik, 1998) suggest that we have to restrict the class of classifiers that we allow in order to hope to solve this problem. For this reason we restrict our attention to a special type of classifier classification trees. 2.2 Classification Trees A classification tree is a directed, acyclic graph T with tree shape. The root of the tree denoted by Root(T ) does not have any incoming edges. Every other node has exactly one incoming edge and may have 0, 2 or more outgoing edges. We call a node T without outgoing edges a leaf node, otherwise T is called an internal node. Each leaf node is labeled with one class label; each internal node T

24 10 is labeled with one attribute variable X T, called the split attribute. We denote the class label associated with a leaf node T by Label(T ). Each edge (T, T ) from an internal node T to one of its children T has a predicate q (T,T ) associated with it where q (T,T ) involves only the splitting attribute X T of node T. The set of predicates Q T on the outgoing edges of an internal node T must contain disjoint predicates involving the split attribute whose conjunction is true for any value of the split attribute exactly one of the predicates in Q T is true. We will refer to the set of predicates in Q T as splitting predicates of T Given a classification tree T, we can define the associated classifier C T (x 1,..., x m ) in the following recursive manner: Label(T ) if T is a leaf node C(x 1,..., x m, T ) = C(x 1,..., x m, T j ) if T is an internal node, X i is label of T, and q (T,Tj )(x i ) = true (2.1) C T (x 1,..., x m ) = C(x 1,..., x m, Root(T )) (2.2) thus, to make a prediction, we start at the root node and navigate the tree on true predicates until a leaf is reached, when the class label associated with it is returned as the result of the prediction. If the tree T is a well-formed classification tree (as defined above), then the function C T () is also well defined and, by our definition, a classifier which we call a classification tree classifier, or in short a classification tree. Two main variations have been proposed for classification trees both are in extensive use. If we allow at most two branches for any of the intermediate nodes we get a binary classification tree; otherwise we get a k-ary classification

25 11 tree. Binary classification trees were introduced by Breiman et al. (1984); k- ary classification trees were introduced by Quinlan (1986). The main difference between these types of trees is in what predicates are allowed for discrete attribute variables (for continuous attribute variables both allow only predicates of the form X > c where c is a constant). For binary classification trees, predicates of the form X S, with S a subset of the possible values of the attribute, are allowed. This means that for each node we have to determine both a split attribute and a split set. For discrete attributes in k-ary classification trees, there are as many split predicates as there are values for the attribute variable and all are of the form X = x i, with x i one of the possible value of X. In this situation, no split set has to be determined but the fanout of the tree can be very large. For continuous attribute variables, both types of classification trees split a node into two parts on predicates of the form X s and its complement X > s, where the real number s is called the split point. Figure 1.1 shows an example of a binary classification tree that is build to predict the data in the dataset in Table Building Classification Trees Now that we introduced the classification trees, we can formally state the classification tree construction problem by instantiating the general classifier construction problem: Classification Tree Construction Problem: Given a training dataset D of N independent identically distributed samples from Ω, sampled according to probability distribution P, find a classification tree T such that the misclassification

26 12 rate functional R P (C T ) of the corresponding classifier C T is minimized. The main issue with solving the classification tree problem in particular and the classifier problem in general, is the fact that the classifier has to be a good predictor for the distribution not for the sample made available from the distribution. This means that we cannot just simply build a classifier that is as good as possible with respect to the available sample it is easy to see that we can achieve zero error with arbitrary classification trees if we do not have contradicting examples since the noise in the data will be learned as well. This noise learning phenomena, called overfitting, is one of the main problems in classification. For this reason, classification trees are build in two phases. In the first phase a tree as large as possible is constructed in a manner that minimizes the error with respect to some subset of the available data subset that we call training data. In the second phase the remaining samples we call them the pruning data are used to prune the large tree by removing subtrees in a manner that reduces the estimate of the generalization error computed using the pruning data. We discuss each of these two phases individually in what follows Tree Growing Phase Several aspects of decision tree construction have been shown to be NP-hard. Some of these are: building optimal trees from decision tables (Hyafil & Rivest, 1976), constructing minimum cost classification tree to represent a simple function (Cox et al., 1989), and building optimal classification trees in terms of size to store information in a dataset (Murphy & Mccraw, 1991). In order to deal with the complexity of choosing the split attributes and split sets and points, most of the classification tree construction algorithms use the

27 13 Input: node T, data-partition D, split selection method V Output: classification tree T for D rooted at T Top-Down Classification Tree Induction Schema: BuildTree(Node T, data-partition D, split attribute selection method V) (1) Apply V to D to find the split attribute X for node T. (2) Let n be the number of children of T. (2) if (T splits) (3) Partition D into D 1,..., D n and label note T with split attribute X (4) Create children nodes T 1,..., T n of T and label the edge (T, T i ) with predicate q (T,Ti ) (5) foreach i {1,.., n} (6) BuildTree(T i, D i, V) (7) endforeach (8) else (9) Label T with the majority class label of D (10) endif Figure 2.1: Classification Tree Induction Schema greedy induction schema in Figure 2.1. It consists in deciding, at each step, upon a split attribute and split set or point, if necessary, partitioning the data according with the newly determined split predicates and recursively repeating the process on these partitions, one for each child. The construction process at a node is terminated when a termination condition is satisfied. The only difference between the two types of classification trees is the fact that for k-ary trees no split set needs to be determined for discrete attributes.

28 14 We now discuss how the split attribute and split set or point are picked at each step in the recursive construction process, then show some common termination conditions. Split Attribute Selection At each step in the recursive construction algorithm, we have to decide on what attribute variable to split. The purpose of the split is to separate, as much as possible, the class labels from each others. To make this intuition useful, we need a metric that estimates how much the separation of the classes is improved when a particular split is performed. We call such a metric a split criteria or a split selection method. There is extensive research in the machine learning and statistics literature on devising split selection criteria that produce classification trees with high predictive accuracy (Murthy, 1997). We briefly discuss here only the ones relevant for our work. A very popular class of split selection methods are impurity-based (Breiman et al., 1984; Quinlan, 1986). The popularity is well deserved since studies have shown that this class of split selection methods have high predictive accuracy (Lim et al., 1997), and at the same time they are simple and intuitive. Each impuritybased split selection criteria is based on an impurity function Φ(p 1,..., p k ), with p j interpreted as the probability of seeing the class label c j. Intuitively, the impurity function measures how impure the data is. It is required to have the following properties (Breiman et al., 1984): 1. to be concave: 2 Φ(p 1,..., p k ) p 2 i > 0

29 15 2. to be symmetric in all its arguments, i.e. for π a permutation, Φ(p 1,..., p k ) = Φ(p π1,..., p πk ) 3. to have unique maximum at (1/k,..., 1/k) when the mix of class labels is most impure 4. to achieve the minimum for (1, 0,..., 0), (0, 1, 0,..., 0),..., (0,..., 0, 1), when the mix of class labels is the most pure With this, for a node T of the classification tree being built, the impurity at node T is: i(t ) = Φ(P [C = c 1 T ],..., P [C = c k T ] where P [C = c j T ] is the probability that the class label is c j given that the data reaches node T. We defer the discussion on how these statistics are computed for the end of this section. Given a set Q of split predicates on attribute variable X that split a node T into nodes T 1,..., T n, we can define the reduction in impurity as: i(t, X, Q) = i(t ) = i(t ) n P [T i T ] i(t i ) i=1 n P [q (T,Ti )(X) T ] i(t i ) i=1 (2.3) Intuitively, the reduction in impurity is the amount of purity gained by splitting, where the impurity after split is the weighted sum of impurities of each child node. By instantiating the impurity function we get the first two split selection criteria:

30 16 Gini Gain. This split criterion was introduced by Breiman et al. (1984). By setting the impurity function to be the Gini index: k gini(t ) = 1 P [C = c j T ] and plugging it into Equation 2.3 we get the Gini gain split criteria: j=1 GG(T, X, Q) = gini(t ) n P [q (T,Ti )(X) T ] gini(t i ) (2.4) i=1 For two class labels, the Gini gain takes the more compact form: GG b (T, X, Q) = P [C = c 0 T ] 2 (P [C = C 0 T 1 ] P [T 1 T ]) 2 P [T 1 T ](1 P [T 1 T ]) (2.5) Information Gain. This split criterion was introduced by Quinlan (1986). By setting the impurity function to be the entropy of the dataset k entropy(t ) = P [C = c j T ] log P [C = c j T ] j=1 and plugging it into Equation 2.3 we get the information gain split criteria: n IG(T, X, Q) = entropy(t ) P [q j (X) T ] entropy(t j ) (2.6) j=1 Gain Ratio. Quinlan introduced this adjusted version of the information gain to remove the preference of information gain for attribute variables with large domains (Quinlan, 1986). GR(T, X, Q) = IG(T, X, Q) Dom(X) j=1 P [X = x j T ] log P [X = x j T ] (2.7) Two other popular split selection methods come from the statistics literature:

31 17 The χ 2 Statistic (test). χ 2 (T, X) = Dom(X) i=1 k (P [X = x i T ] P [C = c j T ] P [X = x i, C = c j T ]) 2. P [X = x i T ] P [C = c j T ] j=1 (2.8) estimates how much the class labels depend on the value of the split attribute. Notice that the χ 2 -test does not depend on the set Q of split predicates. A known result in the statistics literature, see for example (Shao, 1999), is the fact that the χ 2 -test has, asymptotically, a χ 2 distribution with Dom(X) (k 1) degrees of freedom. The G 2 -statistic. G 2 (T, X, Q) = 2 N T IG(T ) log e 2, (2.9) where N T is the number of records at node T. Asymptotically, the G 2 -statistic has also a χ 2 distribution (Mingers, 1987). Interestingly, it is identical to the information gain up to a multiplicative constant, which immediately gives an asymptotic approximation for the distribution of information gain. Note that all split criteria except χ 2 -test take the set of split predicates as argument. For discrete attribute variables in k-ary classification trees, the set of predicates is completely determined by specifying the attribute variable, but this is not the case for discrete variables for binary trees or continuous variables. In these last two situations we also have to determine the best split set or point in order to evaluate how good a split on a particular attribute variable is.

32 18 Split Set Selection for Discrete Attributes Most of the set selection methods proposed in the literature use the same split criterion used for split attribute selection in order to evaluate all possible splits and select as split set the best. This method is referred to as exhaustive search, since all possible splits of the set of values of an attribute variables are evaluated, at least in principle. In general, this process of finding the split set is computationally intensive except when the domain of the split attribute and the number of class labels is small. There is though a notable exception due to Breiman et al. (1984), when there is an efficient algorithm to find the split set: the case when there are only two class labels and an impurity based selection criterion is used. Since this algorithm is relevant for some parts of our work, we describe it here. Let us first start with the following: Theorem 1 (Breiman et al. (1984)). Let I be a finite set, q i, r i, i I be positive quantities and Φ(x) a concave function. For I 1, I 2 a partitioning of I, an optimum of the problem has the property that: ( ) i I argmin I1,I 2 q i Φ 1 q i r i + ( ) i I q i Φ 2 q i r i i I 1 i I 1 q i i I 2 i I 2 q i i I 1, j I 2, r i < r j A direct consequence of this theorem is an efficient algorithm to solve this type of optimization problems, namely order the elements of I into increasing order of r i and consider only the I number of ways to split set I in this order. The correctness of the algorithm is guaranteed by the fact that, the optimum split will be among the splits considered.

33 19 With this, setting I = Dom(X), q i = P [X = x i T ], r i = P [C = c 0 X = x i, T ] and Φ(x) to be the Gini index or entropy for the two class labels case (both are concave): gini(t ) = 2P [C = c 0 T ](1 P [C = c 0 T ]) entropy(t ) = P [C = c 0 T ] ln(p [C = c 0 T ]) (1 P [C = c 0 T ]) ln(1 P [C = c 0 T ]) the optimization criterion, up to a constant factor, is exactly the Gini gain or information gain. Thus, to efficiently find the best split set, we order elements of DomX in the increasing order of r i = P [C = c 0 X = x i, T ] and consider splits only in this order. Since all the split criteria we introduced, except the χ 2 -test, use either the Gini gain or information gain multiplied with a factor that does not depend on the split set, this fast split set selection method can be used for all of them. It is worth mentioning that Loh and Shih (1997) proposed a different technique that consists in transforming values of discrete attributes into continuous values and using split point selection methods for continuous attributes to obtain the split for discrete attributes. Split Point Selection for Continuous Attributes Two methods have been proposed in the literature to deal with the split point selection problem for continuous attributes: exhaustive search and Quadratic Discriminant Analysis. Exhaustive search uses the same split selection criteria as does the split attribute selection method and consists in evaluating all the possible ways to split

34 20 the domain of the continuous attribute in two parts. To make the process efficient, data available is first sorted on the attribute variable that is being evaluated and then traversed in order. At the same time, the sufficient statistics are incrementally maintained and the value of the split criteria computed for each split point. This means that the overall process requires a sort and a linear traversal with constant processing time per value. Most of the classification tree construction algorithms proposed in the literature use the exhaustive search. Loh and Shih (1997) proposed using Quadratic Discriminant Analysis (QDA) to find the split point for continuous attributes, and showed that, from the point of view of accuracy of the produced trees, it is as good as exhaustive search. An apparent problem with QDA is that it works only for two class label problems. Loh and Shih (1997) suggested a solution to this problem: group the class labels into two super-classes based on some class similarity and define QDA and the split set problem in terms of this super-classes. This method can be used to deal with the intractability of finding splits for categorical attributes when the number of classes is larger than two. We now briefly describe QDA. The idea is to approximate the distribution of the data-points with the same class label with a normal distribution, and to take as the split point the point between the centers of the two distributions with equiprobability to belong to each of the distributions. More precisely, for a continuous attribute X, the parameters of the two normal distributions probability to belong

35 21 to the distribution α i, mean µ i and variance σ 2 i are determined with the formulae: α i = P [C = c i T ] µ i = E [X C = c i, T ] σ 2 i = E [ X 2 C = c i, T ] µ 2 i and the equation of the split point µ is: 1 α 1 e (µ µ 1 ) 2 1 2σ 1 2 = α 2 e (µ µ 2 ) 2 2σ 2 2 σ 1 2π σ 2 2π which reduces to the following quadratic equation for the split point: µ 2 ( 1 σ σ 2 2 ) ( µ1 2µ µ ) 2 + µ2 1 µ2 2 σ1 2 σ2 2 σ1 2 σ2 2 = 2 ln α 1 α 2 ln σ2 1 σ 2 2 (2.10) If σ 2 1 is very close to σ 2 2, solving the second order equation is not numerically stable. In this case it is preferable to solve the linear equation: 2µ(µ 1 µ 2 ) = µ 2 1 µ 2 2 2σ 2 1 ln α 1 α 2 that is numerically solvable as long as µ 1 µ 2. To compute the Gini gain of the variable X with split point µ we just need to compute the sufficient statistics: P [C = c i X µ, T ], P [C = c i X µ, T ], and P [X µ T ] = P [C = c 0 T ]P [C = c 0 X µ, T ] + P [C = c 1 T ]P [C = c 1 X µ, T ] and plug them into Equation 2.5. The probability P [x C 1 x µ, T ] is nothing that the cumulative distribution function (c.d.f) of the normal distribution with mean µ 1 and variance σ1 2 at point µ. That is: 1 (x µ 1)2 /2σ2 P [C = c 0 X µ, T ] = e 1 dx x µ σ 1 2π = 1 ( ( )) µ1 µ 1 + Erf 2 σ 1 2

36 22 P [C = c 1 X µ] is similarly obtained. The advantage of QDA is the fact that no sorting of the data is necessary. The sufficient statistics (see next section) can be easily computed in a single pass over the data in any order and solving the quadratic equation gives the split point. Stopping Criteria The recursive process of constructing classification trees has to be eventually stopped. The most popular stopping criteria we use it throughout the thesis is to stop the growth of the tree when the number of data-points on which the decision is based goes below a prescribed minimum. By stopping the growth when small amount of data is available, we avoid taking statistical insignificant decisions that are likely to be very noisy thus wrong.other possibilities are to stop the tree growth when no predictive attribute can be found can be quite damaging to the construction algorithm since no one variable might be predictive but a combination of variables can be predictive or when the tree reached a maximum height. Computing the Sufficient Statistics So far, we have seen how the classification tree construction process can be reduced to sufficient statistics computation for every node. Here we explain how the sufficient statistics can be estimated using the training data. The idea is to use the usual empirical estimates; throughout the thesis we use the symbol the empirical estimate of a probability or expectation. This means that: e = to denote 1. for probabilities of the form P [p(x j ) T ] with p(x j ) some predicate on attribute variable X j, the estimate is simply the number of data-points in the training dataset at node T, D T, for which the predicate p(x j ) holds over the

37 23 overall number of data-points in D T : P [p(x j ) T ] e = {(x, c) D T X j = x j } D T 2. for conditional probabilities of the form P [p(x j ) C = c 0, T ], the estimate is: P [p(x j ) C = c 0, T ] e = {(x, c 0) D T X j = x j } {(x, c 0 ) D T } 3. for expectations of functions of attributes, like E [f(x j ) T ], the estimate is simply the average value of the function applied to the attribute for the data-points in D T : E [f(x j ) T ] e = (x,c) D T f(x j ) D T where f(x) is the function whose expectation is being estimated 4. for expectations of the form E [f(x j ) C = c 0, T ], the estimate is: E [f(x j ) C = c 0, T ] = e (x,c 0 ) D T f(x j ) {(x, c 0 ) D T } Note that the estimates for all these sufficient statistics can be computed in a single pass over the data. Gehrke et al. (1998) explain how these sufficient statistics can be efficiently computed using limited memory and secondary storage Pruning Phase In this thesis we use exclusively Quinlan s re-substitution error pruning (Quinlan, 1993a). A comprehensive overview of other pruning techniques can be found in (Murthy, 1997). Re-substitution error pruning consists in eliminating subtrees in order to obtain a tree with the smallest error on the pruning set, a separate part of the data used

38 24 only for pruning. To achieve this, every node estimates its contribution to the error on pruning data when the majority class is used as en estimate. Then, starting from the leaves and going upward, every node compares the contribution to the error by using the local prediction with the smallest possible contribution to the error of its children (if a node is not a leaf in the final tree, it has no contribution to the error, only leaves contribute), and prunes the tree if the local error contribution is smaller this results in the node becoming a leaf. Since, after visiting any of the nodes the tree is optimally pruned this is the invariant maintained when the overall process finishes the whole tree is optimally pruned. 2.4 Regression Trees We start with the formal definition of the regression problem and we present regression trees, a particular type of regressors. We have the random variables X 1,..., X m as in the previous section to which we add the random variable Y with real line as the domain that we call the predicted attribute or output. A regressor R is a function R : Dom(X 1 ) Dom(X m ) Dom(Y ). Now if we let the set of events to be Ω = Dom(X 1 ) Dom(X m ) Dom(Y ) we can define probability measures P over Ω. Using such a probability measure and some loss function L (i.e. square loss function L(a, x) = a x 2 ) we can define the regressor error as R P (R) = E P [L(Y, R(X 1,..., X m )] where E P is the expectation with respect to probability measure P. In this thesis we use only the square loss function. With this we have:

39 25 Regressor Construction Problem: Given a training dataset D of N independent identically distributed samples from Ω, sampled according to probability distribution P, find a function R that minimizes the functional R P (R). Regression Trees, the particular type of regressors we are interested in, are the natural generalization of classification trees for regression problems. Instead of associating a class label to every node, a real value or a functional dependency of some of the inputs is used. Regression trees were introduced by Breiman et al. (1984) and implemented in their CART system. Regression trees in CART are binary trees, have a constant numerical value in the leaves and use the variance as a measure of impurity. Thus the split selection measure is: N T Err(T ) = (y i y i ) 2 (2.11) i=1 Err(T ) = Err(T ) Err(T 1 ) Err(T 2 ) (2.12) The reason for using variance as the impurity measure is justified by the fact that the best constant predictor in a node is the average of the value of the predicted variable on the test examples that correspond to the node; the variance is thus the mean square error of the average used as a predictor. An alternative split criteria proposed by Breiman et al. (1984) and used also in (Torgo, 1997a) is based on the sample variance as the impurity measure:

40 26 Err S (T ) = Var (Y T ) e = 1 Err(T ) N T Err S (T ) = Err S (T ) P [T 1 T ] Err S (T 1 ) P [T 2 T ] Err S (T 2 ) Interestingly, if the maximum likelihood estimate is used for all the probabilities and expectations, as it is usually done in practice, we have the following connection between the variance and sample variance criteria: Err S (T ) e = Err(T ) N T N T 1 N T Err(T 1 ) N T1 N T 2 N T Err(T 2 ) N T2 = Err(T ) N T Due to this connection, if there are no missing values, minimizing one of the criteria results also in minimizing the other. For a categorical attribute variable X, minimizing Err S (T ) can be done very efficiently since the objective function in Theorem 1 with: Φ(x) = x 2 q i = P [X = x i T ] r i = P [Y X = x i, T ]n is exactly this criterion up to additive and multiplicative constants that do not influence the solution (Breiman et al., 1984). This means that we can simply order the elements in Dom(X) in increasing order of P [Y X = x i, T ] and consider splits only in this order. If the empirical estimates are used for q i = P [X = x i T ] and r i = P [Y X = x i, T ], the criteria Err(T ) is minimized. As in the case of classification trees, prediction is made by navigating the tree

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

More information

Classification Using Decision Trees

Classification Using Decision Trees Classification Using Decision Trees 1. Introduction Data mining term is mainly used for the specific set of six activities namely Classification, Estimation, Prediction, Affinity grouping or Association

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

Growing a Large Tree

Growing a Large Tree STAT 5703 Fall, 2004 Data Mining Methodology I Decision Tree I Growing a Large Tree Contents 1 A Single Split 2 1.1 Node Impurity.................................. 2 1.2 Computation of i(t)................................

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

the tree till a class assignment is reached

the tree till a class assignment is reached Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

More information

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Goals for the lecture you should understand the following concepts the decision tree representation the standard top-down approach to learning a tree Occam s razor entropy and information

More information

day month year documentname/initials 1

day month year documentname/initials 1 ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi

More information

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each

More information

Bias Correction in Classification Tree Construction ICML 2001

Bias Correction in Classification Tree Construction ICML 2001 Bias Correction in Classification Tree Construction ICML 21 Alin Dobra Johannes Gehrke Department of Computer Science Cornell University December 15, 21 Classification Tree Construction Outlook Temp. Humidity

More information

Rule Generation using Decision Trees

Rule Generation using Decision Trees Rule Generation using Decision Trees Dr. Rajni Jain 1. Introduction A DT is a classification scheme which generates a tree and a set of rules, representing the model of different classes, from a given

More information

Classification: Decision Trees

Classification: Decision Trees Classification: Decision Trees Outline Top-Down Decision Tree Construction Choosing the Splitting Attribute Information Gain and Gain Ratio 2 DECISION TREE An internal node is a test on an attribute. A

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1 Decision Trees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 5 th, 2007 2005-2007 Carlos Guestrin 1 Linear separability A dataset is linearly separable iff 9 a separating

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014 Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists

More information

Chapter 6: Classification

Chapter 6: Classification Chapter 6: Classification 1) Introduction Classification problem, evaluation of classifiers, prediction 2) Bayesian Classifiers Bayes classifier, naive Bayes classifier, applications 3) Linear discriminant

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Fall 2018 Some slides from Tom Mitchell, Dan Roth and others 1 Key issues in machine learning Modeling How to formulate your problem as a machine learning problem?

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Goals for the lecture you should understand the following concepts the decision tree representation the standard top-down approach to learning a tree Occam s razor entropy and information

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Spring 2018 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

Lecture 7: DecisionTrees

Lecture 7: DecisionTrees Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabás Póczos Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Lecture 06 - Regression & Decision Trees Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom

More information

Algorithms for Classification: The Basic Methods

Algorithms for Classification: The Basic Methods Algorithms for Classification: The Basic Methods Outline Simplicity first: 1R Naïve Bayes 2 Classification Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

More information

Machine Learning & Data Mining

Machine Learning & Data Mining Group M L D Machine Learning M & Data Mining Chapter 7 Decision Trees Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University Top 10 Algorithm in DM #1: C4.5 #2: K-Means #3: SVM

More information

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler + Machine Learning and Data Mining Decision Trees Prof. Alexander Ihler Decision trees Func-onal form f(x;µ): nested if-then-else statements Discrete features: fully expressive (any func-on) Structure:

More information

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18 Decision Tree Analysis for Classification Problems Entscheidungsunterstützungssysteme SS 18 Supervised segmentation An intuitive way of thinking about extracting patterns from data in a supervised manner

More information

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Introduction to ML Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Why Bayesian learning? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical

More information

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng 1 Decision Trees 2 Instances Describable by Attribute-Value Pairs Target Function Is Discrete Valued Disjunctive Hypothesis May Be Required Possibly Noisy Training Data Examples Equipment or medical diagnosis

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

CHAPTER-17. Decision Tree Induction

CHAPTER-17. Decision Tree Induction CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning Lecture 3: Decision Trees p. Decision

More information

Regression tree methods for subgroup identification I

Regression tree methods for subgroup identification I Regression tree methods for subgroup identification I Xu He Academy of Mathematics and Systems Science, Chinese Academy of Sciences March 25, 2014 Xu He (AMSS, CAS) March 25, 2014 1 / 34 Outline The problem

More information

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations Review of Lecture 1 This course is about finding novel actionable patterns in data. We can divide data mining algorithms (and the patterns they find) into five groups Across records Classification, Clustering,

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

Basing Decisions on Sentences in Decision Diagrams

Basing Decisions on Sentences in Decision Diagrams Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence Basing Decisions on Sentences in Decision Diagrams Yexiang Xue Department of Computer Science Cornell University yexiang@cs.cornell.edu

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4] 1 DECISION TREE LEARNING [read Chapter 3] [recommended exercises 3.1, 3.4] Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting Decision Tree 2 Representation: Tree-structured

More information

Decision Trees.

Decision Trees. . Machine Learning Decision Trees Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems - Machine Learning Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning last change November 26, 2014 Ute Schmid (CogSys,

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Classification/Supervised Learning Definitions Data. Consider a set A = {A 1,...,A n } of attributes, and an additional

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees CS194-10 Fall 2011 Lecture 8 CS194-10 Fall 2011 Lecture 8 1 Outline Decision tree models Tree construction Tree pruning Continuous input features CS194-10 Fall 2011 Lecture 8 2

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some

More information

Information Theory & Decision Trees

Information Theory & Decision Trees Information Theory & Decision Trees Jihoon ang Sogang University Email: yangjh@sogang.ac.kr Decision tree classifiers Decision tree representation for modeling dependencies among input variables using

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Day 3: Classification, logistic regression

Day 3: Classification, logistic regression Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Topics so far Supervised

More information

UVA CS 4501: Machine Learning

UVA CS 4501: Machine Learning UVA CS 4501: Machine Learning Lecture 21: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department of Computer Science Where are we? è Five major sections of this course

More information

Learning Methods for Linear Detectors

Learning Methods for Linear Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han Math for Machine Learning Open Doors to Data Science and Artificial Intelligence Richard Han Copyright 05 Richard Han All rights reserved. CONTENTS PREFACE... - INTRODUCTION... LINEAR REGRESSION... 4 LINEAR

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Decision Tree And Random Forest

Decision Tree And Random Forest Decision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany) Spring 2019 Contact: mailto: Ammar@cu.edu.eg

More information

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data

More information

Machine Learning 2nd Edi7on

Machine Learning 2nd Edi7on Lecture Slides for INTRODUCTION TO Machine Learning 2nd Edi7on CHAPTER 9: Decision Trees ETHEM ALPAYDIN The MIT Press, 2010 Edited and expanded for CS 4641 by Chris Simpkins alpaydin@boun.edu.tr h1p://www.cmpe.boun.edu.tr/~ethem/i2ml2e

More information

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups Contemporary Mathematics Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups Robert M. Haralick, Alex D. Miasnikov, and Alexei G. Myasnikov Abstract. We review some basic methodologies

More information

Data classification (II)

Data classification (II) Lecture 4: Data classification (II) Data Mining - Lecture 4 (2016) 1 Outline Decision trees Choice of the splitting attribute ID3 C4.5 Classification rules Covering algorithms Naïve Bayes Classification

More information

CSCI 5622 Machine Learning

CSCI 5622 Machine Learning CSCI 5622 Machine Learning DATE READ DUE Mon, Aug 31 1, 2 & 3 Wed, Sept 2 3 & 5 Wed, Sept 9 TBA Prelim Proposal www.rodneynielsen.com/teaching/csci5622f09/ Instructor: Rodney Nielsen Assistant Professor

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

BAYESIAN DECISION THEORY

BAYESIAN DECISION THEORY Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will

More information

Lecture VII: Classification I. Dr. Ouiem Bchir

Lecture VII: Classification I. Dr. Ouiem Bchir Lecture VII: Classification I Dr. Ouiem Bchir 1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University 2018 CS420, Machine Learning, Lecture 5 Tree Models Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html ML Task: Function Approximation Problem setting

More information

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Summary! Input Knowledge representation! Preparing data for learning! Input: Concept, Instances, Attributes"

More information

Unsupervised Anomaly Detection for High Dimensional Data

Unsupervised Anomaly Detection for High Dimensional Data Unsupervised Anomaly Detection for High Dimensional Data Department of Mathematics, Rowan University. July 19th, 2013 International Workshop in Sequential Methodologies (IWSM-2013) Outline of Talk Motivation

More information

Decision Trees.

Decision Trees. . Machine Learning Decision Trees Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi Predictive Modeling: Classification Topic 6 Mun Yi Agenda Models and Induction Entropy and Information Gain Tree-Based Classifier Probability Estimation 2 Introduction Key concept of BI: Predictive modeling

More information

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m ) CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions with R 1,..., R m R p disjoint. f(x) = M c m 1(x R m ) m=1 The CART algorithm is a heuristic, adaptive

More information

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Supervised Learning Input: labelled training data i.e., data plus desired output Assumption:

More information

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td Data Mining Andrew Kusiak 2139 Seamans Center Iowa City, Iowa 52242-1527 Preamble: Control Application Goal: Maintain T ~Td Tel: 319-335 5934 Fax: 319-335 5669 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Supervised Learning via Decision Trees

Supervised Learning via Decision Trees Supervised Learning via Decision Trees Lecture 4 1 Outline 1. Learning via feature splits 2. ID3 Information gain 3. Extensions Continuous features Gain ratio Ensemble learning 2 Sequence of decisions

More information

Machine Learning 3. week

Machine Learning 3. week Machine Learning 3. week Entropy Decision Trees ID3 C4.5 Classification and Regression Trees (CART) 1 What is Decision Tree As a short description, decision tree is a data classification procedure which

More information

Generalization Error on Pruning Decision Trees

Generalization Error on Pruning Decision Trees Generalization Error on Pruning Decision Trees Ryan R. Rosario Computer Science 269 Fall 2010 A decision tree is a predictive model that can be used for either classification or regression [3]. Decision

More information

Classification and Prediction

Classification and Prediction Classification Classification and Prediction Classification: predict categorical class labels Build a model for a set of classes/concepts Classify loan applications (approve/decline) Prediction: model

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order MACHINE LEARNING Definition 1: Learning is constructing or modifying representations of what is being experienced [Michalski 1986], p. 10 Definition 2: Learning denotes changes in the system That are adaptive

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Learning = improving with experience Improve over task T (e.g, Classification, control tasks) with respect

More information

Classification and regression trees

Classification and regression trees Classification and regression trees Pierre Geurts p.geurts@ulg.ac.be Last update: 23/09/2015 1 Outline Supervised learning Decision tree representation Decision tree learning Extensions Regression trees

More information

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING Santiago Ontañón so367@drexel.edu Summary so far: Rational Agents Problem Solving Systematic Search: Uninformed Informed Local Search Adversarial Search

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan, Steinbach, Kumar Adapted by Qiang Yang (2010) Tan,Steinbach,

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Dyadic Classification Trees via Structural Risk Minimization

Dyadic Classification Trees via Structural Risk Minimization Dyadic Classification Trees via Structural Risk Minimization Clayton Scott and Robert Nowak Department of Electrical and Computer Engineering Rice University Houston, TX 77005 cscott,nowak @rice.edu Abstract

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Decision Trees Part 1. Rao Vemuri University of California, Davis

Decision Trees Part 1. Rao Vemuri University of California, Davis Decision Trees Part 1 Rao Vemuri University of California, Davis Overview What is a Decision Tree Sample Decision Trees How to Construct a Decision Tree Problems with Decision Trees Classification Vs Regression

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Tutorial 6. By:Aashmeet Kalra

Tutorial 6. By:Aashmeet Kalra Tutorial 6 By:Aashmeet Kalra AGENDA Candidate Elimination Algorithm Example Demo of Candidate Elimination Algorithm Decision Trees Example Demo of Decision Trees Concept and Concept Learning A Concept

More information