Analysis and correction of bias in Total Decrease in Node Impurity measures for tree-based algorithms

Size: px
Start display at page:

Download "Analysis and correction of bias in Total Decrease in Node Impurity measures for tree-based algorithms"

Transcription

1 Analysis and correction of bias in Total Decrease in Node Impurity measures for tree-based algorithms Marco Sandri and Paola Zuccolotto University of Brescia - Department of Quantitative Methods C.da Santa Chiara Brescia - Italy. Abstract. Variable selection is one of the main problem faced by data mining and machine learning techniques. For the most part, these techniques are more or less explicitly based on some measure of variable importance. This paper considers Total Decrease in Node Impurity (TDNI) measures, a popular class of variable importance measures defined in the field of decision trees and tree-based ensemble methods, like Random Forests and Gradient Boosting Machines. In spite of their wide use, some measures of this class are known to be biased and some correction strategies have been proposed. The aim of this paper is twofold. First, to investigate the source and the characteristics of bias in TDNI measures using the notions of informative and uninformative splits. Second, a bias-correction algorithm, recently proposed for the Gini measure in the context of classification, is extended to the entire class of TDNI measures and its performance is investigated in the regression framework using simulated and real data. Corresponding author: Paola Zuccolotto, Quantitative Methods Department, University of Brescia, C.da Santa Chiara 50, Brescia, Italy. zuk@eco.unibs.it 1

2 1 Introduction In the last decades, with the proliferation of large datasets, the problem of variable selection has gained an increasing attention in the field of data analysis. Given a large set of observed covariates that describes the phenomenon under study, the researcher often needs to identify the subset of informative (predictive) variables and to set uninformative (noisy) variables apart. A preliminary variable selection is a fundamental and crucial step in model building, whatever the approach to model the phenomenon might be. Many of the variable selection methods proposed in the literature are directly or indirectly based on the assumption that, given the set X = {X 1,,X p } of potential predictors for a response variable Y, an importance or relevance µ i can be defined for each covariate X i in terms of prediction/explanation of Y, and this measure can be evaluated from data using some variable importance estimator VI i. The notion of importance has been widely investigated in the philosophical, AI, machine learning and statistical literature. Several are the attempts appeared in the scientific literature to formalize and quantify this notion. See [Bell and Wang, 2000] for a brief overview of the current lines of research and [van der Laan, 2006] for a novel approach. In the present work, following [Pearl, 1988], we start by identifying unimportance with conditional independence of random variables and importance with the negation of unimportance. This paper focuses on measures of variable importance developed in the area of tree-based ensemble methods. Decision trees model data by partitioning the feature space into a set of disjoint rectangles and then fitting a simple model (e.g. a constant) to each one. Classification and Regression Trees (CART) were introduced by Breiman, Friedman, Olshen and Stone in 1984 and are a milestone in this field. Since then, a great number of developments have been proposed in several disciplines [Murthy, 2004]. One of the main problems of tree predictors is model instability, defined as the existence of many different models, distant in terms of form and interpretation, that have about the same training or test set error [Breiman, 1996]. Ensemble learning is a class of methods developed for reducing model instability and improving the accuracy of a predictor through aggregation of several similar predictors. Each ensemble member is constructed by a different function of the input covariates. Ensemble prediction is obtained by linear combination of the predictions of ensemble members. Ensembles can be built using different prediction methods, i.e. using different base learners as ensemble members. An interesting proposal uses CART as base learners. Typically, such aggregation neutralizes the effects of tree instability and arrives at greater accuracy by reducing either the bias or the variance of a single classifier [Bühlmann and Yu, 2002]. Popular examples of tree-based ensembles are Random Forests (RF, [Breiman, 2001]) and Gradient Boosting Machine (GBM, [Friedman, 2001]). The first approach to variable importance (VI, henceforth) measurement in tree-based pre- 2

3 dictors dates back to the cited book of [Breiman et al., 1984], where an interesting and effective notion of variable importance was proposed. The importance µ i of a covariate X i is defined as the total decrease of heterogeneity of the response variable Y given by the knowledge of X = {X 1,,X p } when the feature space is partitioned recursively. The VI measure originated by this notion is obtained by summing up all the decreases of the heterogeneity index in the nodes of the tree. This class of measures is called Total Decrease in Node Impurity (TDNI henceforth) and, with little changes, is used in many tree-based ensemble methods (see e.g. [Breiman, 2002], [Friedman, 2001]). It is also available in many software for data mining, like the randomforest package in R [Breiman et al., 2006], the gbm package in R [Ridgeway, 2007], the boost Stata command [Schonlau, 2005], the MART package in R [Friedman, 2002]. In spite of their wide use, some TDNI measures are known to be biased. [Breiman et al., 1984] first noted that they tend to favor covariates having more values (i.e. less missing values, more categories or distinct numerical values) and thus offering more splits. [White and Liu, 1994], [Kononenko,1995] and [Dobra and Gehrke, 2001] investigated in greater detail the nature of bias and elucidate the relation between bias and the number of values of covariates. [Strobl, 2005], [Strobl et al., 2007b] and [Sandri and Zuccolotto, 2008] focused attention on the bias of the Gini variable importance measure (hereafter Gini TDNI), a measure frequently used in classification trees and based on the adoption of the Gini gain as splitting criterion of tree nodes. Several methods have been proposed in the last decade for eliminating bias from the Gini TDNI. [Loh and Shih, 1997] and [Kim and Loh, 2001] proposed to modify the algorithm for the construction of classification trees in order to avoid selection bias. The authors showed that bias can be eliminated by separating at each node variable selection from split point selection. In the work of [Strobl, 2005], an unbiased estimation of the Gini TDNI was found in Conditional Random Forests, a new class of RF developed by [Hothorn et al., 2006]. [Strobl et al., 2007b] derived the exact distribution of the maximally selected Gini gain by means of a combinatorial approach and the resulting p-value is suggested as an unbiased split selection criterion in recursive partitioning algorithms. The heuristic correction strategy proposed by [Sandri and Zuccolotto, 2008] is based on the introduction of a set of random pseudocovariates in the X matrix. The authors showed that the algorithm can efficiently remove bias from Gini TDNI in RF and GBM. The aim of this paper is twofold. First, to investigate the source and the characteristics of bias in TDNI measures by the introduction of the notions of informative and uninformative splits, showing its connections with the level of covariates measurement and with the number of uninformative splits. Second, to generalize and extend the domain of applicability of the correction algorithm of [Sandri and Zuccolotto, 2008] to the class of TDNI measures, evaluating its performances on simulated and real data in regression problems when the residual sum of squares is used as splitting criterion for the tree nodes. The paper is organized as follows. In Section 2 we define the class of TDNI measures and the 3

4 corresponding estimators for single trees and tree-based ensembles. In Section 3 we define the crucial notion of informative and uninformative splits and show that uninformative splits are the main source of bias for TDNI measures. Subsection 4 investigates the relationship existing between bias of TDNI measures and level of covariates measurement from a theoretical point of view and by means of some simulation experiments. In Section 5 the bias-correction strategy for classification problems proposed by [Sandri and Zuccolotto, 2008] is recalled and extended to the class of TDNI measures. The performances of the method are tested on simulated (Section 5) and real data (Section 6). Section 7 concludes. 2 Characterization of TDNI measures Let (Y,X) : Ω (D Y D X1 D Xp ) D be a vector random variable defined on a probability space (Ω, F,P), where X = {X 1,,X p } is a set of covariates and Y a response variable. A tree-structured binary recursive partitioning algorithm yields a hierarchical partition of the domain D into J disjoint (hyper-)rectangles R j D, j = 1,2,,J. Each rectangle is generated in D by splitting a parent rectangle in two parts by a binary split of the domain of a covariate X i. Therefore, a rectangle can be described by the set of covariates and splits used to generate it. If Y, X 1 and X 2 are three numerical real-valued random variables, a rectangle could be for example given by R j = {(y,x) IR 3 x 1 > a b x 2 c}), where a D X1 IR and b,c D X2 IR, b < c. Consider a value s D Xi R j, where D Xi R j is the domain of X i restricted to the rectangle R j. The impurity reduction generated in the rectangle R j by X i at the cutpoint s, is given by: { ( )} d s ij = H Y (X i,r j ) = p j H Y p jl H Y Xi s + p jr H Y Xi >s, (1) where p j = P(R j ), p jl = P(X i s R j ) and p jr = P(X i > s R j ). H Y,H Y Xi s and H Y Xi >s are the heterogeneity indexes of Y in the jth rectangle and in the left and right splits of R j, respectively. Let d ij be the maximum heterogeneity reduction allowed by covariate X i in the jth rectangle, for all the possible cutpoints s D Xi R j d ij = max d s ij. (2) s D Xi R j The goal of partitioning algorithms is to maximally reduce the heterogeneity of Y within the rectangles. Therefore, for each R j, the splitting variable X i and the cutpoint s are those that maximize the impurity reduction in that subset. In other words, the partitioning variable X i satisfies in R j the condition d ij > d hj for h = 1,2,...,p, h i. 4

5 In this context, TDNI measures of variable importance are based on the following notion of importance µ i of a covariate X i : µ i is the total decrease of the heterogeneity index H Y attributable to X i. In other words, µ i is computed summing up all the decreases of heterogeneity d ij obtained in the rectangles generated using X i as splitting variable: µ i = j J d ij I ij (3) where I ij is the indicator function which equals 1 if the ith variable is used to split R j and 0 otherwise. Several impurity/heterogeneity indexes H have been proposed for the case of a categorical Y : the Pearson s chi-squared statistic, the Gini criterion, the entropy criterion, the families of splitting criteria of [Shih, 1999], etc. When Y is numerical, the most popular measure H is variance. In the following example we show how, according to equation (3), the importance µ i can be calculated using the joint probability distribution of (Y, X). The data generating process described in this example will be used to produce the dataset of Example 2 in Section 3 and of Simulation (4) and (5) in Section 4. Example 1. (Calculation of variable importance) Consider the variable (Y, X) = {Y,X 1,X 2,X 3 }, where X 1 and X 2 are two binary 0/1 independent covariates, Y is a continuous standard normal response variable generated by the following data generating process (see Fig. 1(a)): P(X 1 = 0) = P(X 2 = 0) = 1/2, (Y X 1 = 0) N( 1/2,3/4), (Y X 1 = 1) N(1/2,3/4), (Y X 2 = 0) N( 1/3,8/9), (Y X 2 = 1) N(1/3,8/9), (Y X 1 = 0 X 2 = 0) = (Y X 1 = 0 X 2 = 1) N( 1/2,3/4), (Y X 1 = 1 X 2 = 0) N(0,1/2), (Y X 1 = 1 X 2 = 1) N(1,1/2). Consider the following three cases for the uninformative variable X 3 : Case A: a binary 0/1 covariate on X 1 and X 2 ; Case B: a continuous standard normal covariate independent on X 1 and X 2 ; Case C: a continuous covariate, normally distributed conditionally to X 1 : (X 3 X 1 = 0) N(0,1) and (X 3 X 1 = 1) N(1,1). Variable importances can be calculated in the three cases A, B and C. Since Y is continuous, we can adopt variance as heterogeneity index and (1) can be expressed as: ( )} d s ij = σ2 Y (X i,r j ) = p j {σy 2 p jl σy 2 X i s + p jrσy 2 X i >s, (4) where σy 2, σ2 Y X i s and σ2 Y X i >s are the variances of Y in the jth rectangle and in the left and right splits, respectively. Case A and B. The calculation of variable importance is the same in the two cases, because 5

6 Y is stochastically independent on X 3. The different levels of measurement of X 3 in the two cases do not influence variable importance. When the whole sample space is considered (i.e. R 1 = D), X 1 is the most effective variable in reducing the heterogeneity of Y by means of a binary split because d 11 = p 1 1/4 = 1/4 > d 21 = p 1 1/9 = 1/9 > d 31 = 0. The sample space is then partitioned according to X 1. The sample space conditioned to X 1 = 1 (R 3 ), can be further partitioned by X 2 since d 23 = p 3 1/4 = P(X 1 = 1) 1/4 = 1/8 > d 33 = 0. The sample space conditioned to (X 1 = 1) (X 2 = 0) (R 4 ) cannot be further partitioned because d 34 = 0. The same is true in the sample space conditioned to (X 1 = 1) (X 2 = 1)(R5). Similarly, no further partitioning is possible in the sample space conditioned to X 1 = 0 (R 2 ) because d 22 = d 32 = 0. Hence, the VIs of the three covariates are µ 1 = d 11 = 1/4, µ 2 = d 23 = 1/8, µ 3 = 0. Case C. In this case X 3 is no longer independent on Y, due to the relationship existing between X 3 and X 1. Here X 3 is independent on Y, conditionally to X 1. Now we show that the VIs of the three covariates are the same as case A and B, because in R 1 = D X 1 remains the most effective variable in reducing the heterogeneity of Y by means of a binary split. This can be proved as follows. Let s be a cutpoint for X 3 in R 1 and let P(X 1 = 0 X 3 s) = p and P(X 1 = 0 X 3 > s) = q. The distributions of Y, conditionally on X 3 lower and greater than s, are mixtures of normal variables, that is f(y X 3 s) = pf(y X 1 = 0) + (1 p)f(y X 1 = 1) and f(y X 3 > s) = qf(y X 1 = 0) + (1 q)f(y X 1 = 1), where f(y X A) is the density function of y given that X A. It is easy to show that σ 2 Y X 3 s = 3/4 + p(1 p) > σ2 Y X 1 =0 = σ2 Y X 1 =1 = 3/4, σ 2 Y X 3 >s = 3/4 + q(1 q) > σ2 Y X 1 =0 = σ2 Y X 1 =1 = 3/4. It follows that, for all s, ) ( ) σy (P(X 2 1 = 0)σY 2 X 1 =0 + P(X 1 = 1)σY 2 X 1 =1 > σy 2 P(X 3 s)σy 2 X 3 s + P(X 3 > s)σy 2 X 3 >s, thus d 11 > d 31. When the sample space is partitioned according to X 1, X 3 is independent on Y in the two generated rectangles, so the domain partition is the same as in case A and B, as well as the resulting VI measures µ 1, µ 2 and µ 3. [Figures (1) approximately here] We consider now the case of a tree t built using a sample with size N. The impurity reduction ˆd ij at node j attributable to covariate X i with cutpoint s can be estimated by: ˆd s ij = H Y (X i ) = n { ( j njl Ĥ Y Ĥ N n Y Xi s + n )} jr Ĥ j n Y Xi >s j 6 (5)

7 where ĤY,ĤY X i s and ĤY X i >s are the estimated heterogeneities of Y in the jth rectangle and in the left and right splits, respectively. n j, n jl, n jr are the sample sizes in node j and in the left and right splits. Similarly, d ij is estimated by: ˆd ij = max s S ij ˆds ij, (6) where S ij is the set of available cutpoints of variable X i at node j. The covariate X i is selected at node j as splitting variable if ˆd ij > ˆd hj for all h = 1,...,p, h i. The estimate of the TDNI importance µ i using a tree t is given by the sum of the estimated impurity reductions attributable to covariate X i over the set J of nonterminal nodes of the tree [Breiman et al., 1984], that is: VI i (t) = j J ˆd ij I ij. (7) In the regression case we can use the sample variance ˆσ 2 as an estimator of the heterogeneity of node and splits. Hence, (5) becomes: ˆd s ij = σ Y 2 (X i) = n { j ˆσ Y 2 N = n { j DEVtotal (j) N n j ( njl n j ˆσ 2 Y X i s + n jr DEV within(jl,jr) n j ˆσ Y 2 X n i >s j } )} (8) = 1 N DEV between(jl,jr), (9) where DEV within and DEV between are the within-node and the between-node deviance, respectively. From (9) one can derive that, in the regression case, the TDNI measure (7) of a covariate X i is equal to the total amount of DEV between imputable to that covariate in the tree. For tree-based ensembles the VI measure is given by the average of VI i over the set of T trees: VI i = 1 T VI i (t). (10) T t=1 The VI measure (10) has been proposed by [Breiman, 2002] in Random Forests and is called Measure 4 (M4), because the last of a set of four importance measures. With minor modifications, [Friedman, 2001] proposed an influence of input variables for GBM, with ˆd 2 ij in place of ˆd ij and ˆVI i rescaled by assigning a value of 100 to the most influential covariate. 7

8 3 Bias in TDNI measures A crucial point for the analysis that follows is the notion of informative and uninformative splits. Suppose that D has been recursively partitioned into J rectangles {R j } j=1,2,,j. If X i and Y are stochastically independent, they continue to be independent in each R j. On the contrary, if some association between X i and Y exists, X i and Y could be dependent or conditionally independent in a given R j. In other words, when predicting Y, uninformative covariates (i.e stochastically independent on Y ) always remain uninformative, in each subset of the sample space. Informative (i.e somehow associated with Y ) covariates can continue to be informative or can become uninformative in R j. Suppose now to grow a tree using a sample of N units and suppose that, within a given node, there is at least one covariate having some association with Y. The node will be split by using the best covariate, that is the covariate that maximizes the heterogeneity reduction ˆd ij. Hence, because the heterogeneity reductions of informative covariates will be typically greater than the heterogeneity reductions of the uninformative ones, an informative covariate will be chosen as splitting variable. We define this circumstance as an informative split. When within a node there are no informative covariates, only uninformative covariates and/or informative covariates which became uninformative can be chosen as splitting covariate. This is the case of an uninformative split. We can formalize the following definition. Definition 1: (Informative and uninformative splits) Given a tree t grown from a sample, the split of a node j made by covariate X i (i.e. ˆdij > ˆd hj, h = 1,2,,p, h i) is called uninformative if d ij = 0 and informative otherwise. An uninformatively-split node is a node where an uninformative split occurs. In informative splits, the heterogeneity reduction ˆd ij of the splitting covariate is a direct consequence of its importance. Differently, in an uninformative split, ˆdij is a product of chance. Therefore, when calculating the TDNI measure VI i (t) of covariate X i by (7), it is of fundamental importance to distinguish between impurity reductions attributable to informative splits and impurity reductions generated by uninformative splits. In other words, VI i (t) can be expressed as the sum of two components: VI i (t) = ˆdij I ij + ˆdij I ij = ˆµ i (t) + ε i (t) (11) j J I j J U where J I and J U, J I J U = J, are the set of nodes characterized by informative and uninformative splits, respectively. ˆµ i (t) is the part of the VI measure attributable to informative splits and directly related to the true importance of X i. On the contrary, the term ε i (t) is a noisy 8

9 Table 1: Sample data of Example X X X Y X X X Y component associated with the selection of X i within uninformative splits and is a source of bias for VI i (t). Example 2. (Informative and uninformative splits): Consider the sample of 16 units given in Table 1 and generated by the process of Example 1, Case A. Suppose to fit a fully-grown regression tree to these data. Applying (5), in the root node the impurity reductions associated to the three covariates are: ˆd11 = 0.266, ˆd 21 = 0.214, ˆd 31 = Hence, X 1 is the best splitting variable at the root node and the split is informative because d 11 0 (and d 11 > d 21 > d 31 ). In node 2 (X 1 = 0), ˆd22 = and ˆd 32 = X 2 is used as splitting variable. This is a uninformative split because d 22 = 0 (and d 32 = 0). In node 3 (X 1 = 1), ˆd23 = and ˆd 33 = The splitting variable is X 2 and the split is informative since d 23 0 (and d 23 > d 33 ). In node 4, 5, 6 and 7 the impurity reductions attributable to X 3 are ˆd 34 = and ˆd 35 = 0.012, ˆd36 = 0.006, ˆd37 = The splits are all uninformative. Nodes 8 through 15 are leaf nodes. The resulting regression tree is shown in Fig.1(b). Uninformative splits have been marked by thick lines. The estimated VIs of the three covariates are: VI 1 = ˆµ 1 = 0.266, VI 2 = ˆµ 2 + ε 2 = = and VI 3 = ε 3 = = Two remarks are worth pointing out. First, the sample data used to grow the tree are also used to calculate TDNI measures. In other words, TDNI are a class of in-sample measures of variable importance. This is the crucial difference between TDNI measures and the mean decrease in prediction accuracy, a popular permutation-based VI measure. This measure is defined in RF as follows: for each tree, the algorithm randomly rearranges the values of the 9

10 ith variable for the out-of-bag set (i.e. the subset of the bootstrap sample not used in the construction of the tree), puts this permuted set down the tree, and gets new predictions from the forest. The importance of the ith variable is defined as the difference between the original out-of-bag error rate and the out-of-bag error rate for the randomly permuted ith covariate. Hence, mean decrease in accuracy is fundamentally an out-of-sample VI measure, in contrast to the in-sample character of TDNI measures. Second, the notion of informative and uninformative splits is intimately related to the notion of overfitting. It is well known that an overfitted model shows a high in-sample accuracy, but does not validate, that is, does not provide accurate predictions for out-of-sample observations. Base learners of RF are fully-grown trees. They have a serious risk of overfitting [Berk, 2006]. When building a CART, the model starts learning the underlying structure of data and typically the first splits of the tree are informative splits. Subsequently, after an adequate number of informative splits, informative variables become uninformative, uninformative splits take place and the model learns the fine structure of data that is generated by noise. In other words, overfitting and uninformative splits of tree-based models are synonymous. The in-sample character of TDNI measures and the effects of overfitting can lead to a seeming paradox. Consider the case where only one covariate X is available, X and Y are continuous, a sample of N units has been observed and ˆd ij as defined in (8) is used. In a fully-grown tree the importance of X is equal to the variance of Y, no matter what relationship between X and Y exists. Consider the two limiting cases: the deterministic case Y = f(x) and the null case (i.e. Y and X are stochastically independent). The importance (7) of X is the same in the two cases, but in the first case the (maximal) tree contains only informative splits and VI = ˆµ, while in the second case only uninformative splits are present and VI = ε. This problem vanishes if the mean decrease in prediction accuracy is used instead of TDNI measures or if one avoids uninformative splits when building the tree. Pruning techniques are effective methods for controlling overfitting in CART. The aim of pruning is to remove uninformative splits: in well-pruned trees the number of uninformative splits is minimized. Hence, bias of TDNI measures is minimized, too. In the context of RF unpruned trees are typically used as base learners because the RF prediction is obtained averaging the predictions of the single trees and this neutralizes the problem of overfitting. Pruning seems to be unnecessary, but we know that this is only partially true. There is no need of pruning for improving the accuracy of prediction, but when we use RF for variable selection we cannot forget that fully-grown unpruned trees can generate a substantial amount of bias in TDNI measures. 10

11 4 Level of bias and level of covariate measurement In a tree grown from a sample of N units, each covariate X i can be used as splitting variable only in a finite number of nodes. At node j, X i is characterized by a finite number nps ij of possible binary splits which depends on the level of measurement of the covariate. A nominal covariate with k categories within a given node has nps ij = 2 k 1 1 possible splits, while an ordinal covariate with k categories has nps ij = k 1 possible splits. A numerical (continuous) covariate with n j distinct values within node j can be viewed as the limiting case of an ordinal covariate with as many categories as the number of sample units in the node. Thus, it has nps ij = n j 1. Of course, nps ij nps ik for all parent nodes k of a node j because each child node contains only a subset of the original sample and each covariate in a node has a number of distinct values (or categories) lower than or equal to the number of its distinct values (or categories) in the parent nodes. Consider a node where all the covariates X i are conditionally independent on Y. By definition, only an uninformative split can take place in this node. All the binary partitions of all the covariates have the same probability to be the best one and the selection of the splitting variable is only a product of chance. Therefore, the covariates with the highest number nps ij of possible splits are more likely to be chosen as splitting variables. Recalling the decomposition given in (11), the above considerations imply that the expected values of the noisy component ε i of the estimated VIs are not equal but depend on the level of measurement of covariates. Let J U be the set of all the possible uninformatively-split nodes, that is the set of all the nodes where an uninformative split occurs, for all the trees grown on all the possible N-size samples. We have: E(ε i ) = E( ˆdij I ij ) = E(( ˆd ij I ij ) I j ), j J U j J U where I j is the indicator function which equals 1 if the uninformatively-split node j occurs in the tree t and 0 otherwise. Since the occurrence of a given node j depends only on former splits, I j and ˆd ij I ij are independent and we can write: E(ε i ) = E( ˆd ij I ij ) E(I j ) = E( ˆd ij I ij ) q j, j J U j J U where q j is the probability of occurrence of node j in a tree. Finally, applying the law of iterated expectation, we obtain: E(ε i ) = j J U E( ˆd ij I ij = 1) p ij q j (12) where p ij is the probability of selecting covariate X i at node j. By definition, in uninformativelysplit nodes all the covariates are independent on the response variable and E( ˆd ij I ij = 1) = d j 11

12 for all i = 1,2,...,p. On the contrary, p ij depends on the number nps ij of possible splits of X i at node j: nps ij p ij = p i=1 nps (13) ij Equation (12) proves two fundamental facts: (a) covariate importances estimated by TDNI measures can show different levels of bias according to different levels of measurement of covariates and (b) the source of bias is very closely connected to the selection mechanism of the splitting variable in uninformatively-split nodes. In the following two subsections, using some simulation experiments, we investigate in further detail these characteristics of bias: its dependence on the number of uninformative splits and on the covariates level of measurement. 4.1 Simulation studies with a single regression tree The main goal of the present study is to investigate the problem of bias in TDNI VI measures estimated by ensemble methods with fully-grown trees as base learners, like RF. Equations (10) and (11) show that bias in tree-based ensembles can be expressed as an average of the bias originated in base learners. Hence, it is convenient to start our investigation about the source and the characteristics of bias considering a single unpruned regression tree. Two simulation studies are performed. We consider a data generating process with 9 independent covariates: 1 binary variable (B), 4 ordinal variables (O4, O8, O16 and O32) with 4, 8, 16 and 32 categories and 4 nominal variables (N4, N8, N16 and N32) with 4, 8, 16 and 32 categories. The continuous outcome variable Y is independent on these covariates (null case). The null case guarantees that only uninformative splits take place and estimated VIs are entirely dominated by bias. Simulation (1). In the first numerical experiment we investigate the relationship between bias and number of uninformative splits. We generate 100 random samples with size N=3000. For each sample, VIs are estimated by a single unpruned tree varying the nodesize parameter (the minimum size of terminal nodes) in the set {5,6,7,8,9,10,13,15, 20,25,30, 60}. The number of leaf nodes of the trees together with the estimated VIs of the 9 covariates are collected in a dataset. For each covariate, we estimate a quadratic regression model between number of leaf nodes and VI. Fig. 2(a) shows the relationship existing between the mean level of bias predicted by quadratic models and the number of uninformative splits (R 2 values ranges from to 0.949, the lower value has been observed for the binary variable and the higher for N32). These results substantially agree with the information contained in equation (12): bias levels appear to be a non-decreasing function of the number of uninformative splits. The estimated functions show a 12

13 statistically significant concavity. This fact can be explained considering equation (5) and (7): growing trees with increasing levels of ramification generates nodes with a decreasing size n j and ˆd ij is directly proportional to the node sample size n j. Hence, a progressive increase in the number of leaf nodes produces in (7) the addition of terms that do not grow at the same speed and increases of VI that are less than proportional. Simulation (2). In this simulation we study the relationship between bias and the number of categories in ordinal and nominal covariates. We generate a set of 100 random samples with size N=3000. For each sample, VIs are estimated by a single unpruned tree with the nodesize parameter fixed to 5. The estimated VIs of the 9 covariates are collected in a dataset and the relationship between bias and number of categories is analyzed estimating two quadratic regression models: one for the set of nominal covariates and one for ordinal covariates. The mean levels of bias predicted by quadratic models are plotted in Fig. 2(b) with respect to the logarithm of the number of covariate categories (the value of R 2 of the two models is 0.997). According to equations (12) and (13), these curves show that bias levels grow monotonically with the number of categories. In addiction, nominal variables have higher level of bias compared to ordinal variables because at each node they allow a greater number of possible splits nps ij than ordinal variables with the same number of categories. [Figures (2) approximately here] 4.2 Simulation studies with a tree-based learning ensemble In this subsection, we continue our investigation about the characteristics of bias of TDNI measures taking into account tree-based ensembles. We present four numerical experiments where VIs are estimated using RF. We devote special attention to the effect of bias on the ranking of covariates sorted by decreasing levels of estimated importance. Simulation (3). A set of r = 100 samples with N = 250 observations are generated using the null-case mechanism described in the previous subsection: 9 independent covariates with different levels of measurement and a continuous outcome variable independent on these covariates. Simulation (4). In this experiment the data generating process of Example 1 (case B) is used, with the addition of 6 random independent variables having different levels of measurement: a binary covariate (B), two ordinal (O6 and O11) and two nominal (N6 and N11) covariates and 1 continuous covariate (C2). The covariates are all mutually independent and only X 1 and X 2 are informative. We consider a set of r = 100 samples with N = 400 observations. 13

14 Simulation (5). Here we consider the case of correlated predictor variables. This simulation is similar to Simulation (4). The only difference relates the two continuous covariates. X 3 and C2 are normally distributed conditionally to X 1 and B, respectively: (X 3 X 1 = 0) N(0,1), (X 3 X 1 = 1) N(1,1) and (C2 B = 0) N(0,1), (C2 B = 1) N(1,1). Simulation (6). In the last simulation experiment we consider the case of a regression problem with p > N. The sample size is N = 50 and the number of covariates is p = 200. The covariates has been divided into 20 groups, each consisting of 10 ordinal covariates with the following number of categories: 2, 2, 3, 3, 4, 4, 6, 6, 8, 8. The first group X 1 =(X 1, X 2,..., X 10 ) contains mutually correlated covariates and X 1, X 3, X 5, X 7 and X 9 are correlated to the outcome. In the second group X 2, covariates are mutually independent and X 11, X 13, X 15, X 17 and X 19 are correlated to the outcome. The remaining 18 groups have 180 covariates mutually independent and independent on outcome. More specifically, r = 1000 repetitions has been generated by the following data generating process: (1) N observations (y,x c 1,,xc p ) are randomly drawn from a multivariate (p + 1)-dimensional Gaussian distribution with mean vector µ = (0,,0) and covariance matrix Σ = 1 ρ Y X 1 ρ Y X 2 ρ Y X 20 ρ Y X1 Σ X1 0 0 ρ Y X2 0 I ρ Y X I where ρ Y Xi is the vector of 10 correlations between the covariates of the ith group and the outcome Y, ρ Y X1 = ρ Y X2 (0.31,0,0.28,0, 0.26, 0,0.26,0,0.25,0) and ρ Y X3 = = ρ Y X20 = (0,,0). The symbol 0 denotes a 10-dimensional square matrix of zeros, I is the 10-dimensional identity matrix, and the generic element of Σ X1 is given by s ij = 0.4 i j. (2) Each covariate generated at step (1) is transformed into a categorical variable X i, with a number k i of categories as given above. Categories are defined according to quantiles of the normal distribution: X i = k if X i (q (k 1)/ki,q k/ki ], k = 1,2,,k i, where q h is the hth quantile of a standard normal distribution. The resulting data generating process has the following features: outcome Y is associated to covariates X 1,X 3,X 5,,X 19 with constant correlation ratios η 2 Y X i = 0.05; covariates X 1,X 2,,X 10 are mutually associated with Cramer s ν indexes approximately 14

15 given by ν i,j 0.4 j i 1 ν i,i+1, j > i + 1, where {ν i,i+1 } i=1,2,,9 {0.26, 0.29, 0.23, 0.24, 0.20, 0.21, 0.17, 0.18, 0.15}; covariates X 11,X 12,,X 200 are mutually independent and independent on covariates X 1,X 2,,X 10. [Figures 3, 4, 5 and 6 approximately here] The gray boxes of Figures 3(a), 4(a) and 5(a) show the distribution of the uncorrected TDNI measures estimated by means of RF for the three experiments (3), (4) and (5), respectively. Random Forests are implemented in the randomforest package [Liaw and Wiener (2002)] of the R language [R Development Core Team (2008)]. In the simulation experiments of this Section we have trained RFs with ntree=1000 regression trees, with mtry=5 variables randomly sampled as candidates at each split and with a minimum size of terminal nodes nodesize=5. The results of Simulation (3) substantially confirm what already observed in Fig. 2(b): higher numbers of covariate categories are generally associated to higher levels of bias. A byproduct of this relationship is an artificial ranking of covariates according to the number of categories, instead of variable importance. In this simulation, covariates are equally important because all uninformative but bias generates an erroneous ranking where the highest positions in the ranking are achieved by variables with the highest number of categories. The negative effects of VI bias on ranking are more evident in Simulation (4) and (5), where only X 1 and X 2 are informative. In these experiments the ranking of covariates using uncorrected VIs is clearly wrong. Uninformative covariates X 3, C2 and N11 are erroneously more important than the true informative variable X 2. The average TDNI measures of Simulation (6) estimated using RF are visualized in Figure 6(a). A great amount of bias hides informative covariates. Their uncorrected importances are lower than the importances of many uninformative variables. Bias strongly distort the ranking of variables. All these results undoubtedly show that an effective method for bias correction is necessary when using TDNI measure of variable importance. 5 A bias-correction strategy This Section starts with a brief recall of the bias-correction strategy recently proposed by [Sandri and Zuccolotto, 2008]. Let X be the (N p) matrix containing the N observed values of the p covariates {X 1,,X p }. A set of matrices {Z s } S s=1 is generated by randomly permuting S times the N rows of X. We 15

16 call pseudocovariates the columns of Z. Row permutation destroys the association existing between the response variable Y and each pseudocovariate Z i. On the contrary, the association between two pseudocovariates Z i and Z j is preserved, i.e. it is equal to the association existing between X i and X j. Horizontally concatenating matrices X and Z s generates a set of (N 2p) matrices X s [X, Z s ], s = 1,2,...,S. The augmented matrices X are repeatedly used to predict Y using an ensemble predictor and S importance measures are then computed for each covariate X i and for the corresponding pseudocovariate Z i. Let VI s X i and VI s Z i be the sth importance measures of X i and Z i, respectively. The adjusted variable importance measure VI i is computed considering the average of VI s X i VI s Z i differences, that is: VI i = 1 S ( VI s X S i VI s Z i ). (14) s=1 For more details about the algorithm and the principles that support and guide the procedure, see [Sandri and Zuccolotto, 2008]. This method has been originally developed for classification tree-based ensembles when the Gini gain is used as splitting criterion, but it can be extended to the entire class of TDNI measures. In section 3, we have shown that for this class of measures the decomposition of equation (11) holds. The proposed bias-correction method is crucially based on this decomposition: in the set J U of uninformatively-split nodes, each pseudocovariate Z i share approximately the same properties (number of possible splits nps ij, independence on Y ) of the corresponding covariate X i and consequently have approximately the same probability p ij of being selected as splitting variable. Hence, the average importance of Z i calculated for S random permutations is an approximation of the sum given in (12): av[ VI s Z i ] E[ε Xi ]. In other words, S s=1 VI s Z i can be used as an approximation of the bias affecting the importance of X i. A direct consequence of the generalization of the bias-correction strategy to TDNI measures is the extension of its domain of applicability to regression problems. We investigate the effectiveness of the above bias-correction algorithm in the 4 regression problems related to the data generating processes described in Section 4.2. The white boxes of Figures 3(a), 4(a), 5(a) and 6(a) visualize the distribution of the corrected VIs (ntree=1000, mtry=3, nodesize=5 and number of random permutation S = 25). These experiments clearly show that bias-correction using pseudocovariates has power to reduce bias, even when using a small number S of random permutations. The technique generates right rankings of covariates according to their importance and informative variables can be correctly discriminated from uninformative ones. The method shows good performances also when estimating VIs of a large 16

17 number of (mutually correlated, mutually independent, informative and uninformative) variables using a very limited number of sample units. All the uninformative covariates correctly have a null mean importance and all the mean VIs of informative variables are greater than zero. The optimal number of variables randomly selected in each node, i.e. the value of the parameter mtry that minimizes the out of bag prediction error rate of the RF, is typically a function of the number of covariates of the dataset. Extensive simulations (not reported here) show that, when adding the p pseudocovariates, it is not necessary to increase mtry and the optimal value chosen for the original dataset can still be used. This fact can be explained by considering that pseudocovariates are not additional potentially predictive variables but only competitors of the original covariates. For each covariate, our method generates a twin uninformative pseudocovariate that, only in uninformative splits, participates to the competition for the best split, with approximately the same probability of selection of X i. For the sake of comparison, Figures 3(b), 4(b) and 5(b) show the distribution of the (unbiased) permutation measures computed by Conditional Random Forests (CRF). This choice is motivated by two considerations: (a) the unified framework for recursive partitioning used by CRF to grow single trees strongly reduces the number of uninformative splits and (b) permutation measures are not affected by the kind of bias described in this paper. We have estimated CRF using the cforest command of the party package for R, with the following settings: 1000 trees, 3 randomly picked input variables in each node, minsplit = 5, a quadratic test statistics and Bonferroni-adjusted P-values (see [Hothorn et al., 2006] and [Strobl et al., 2007a]). The results clearly show that these measures and the proposed bias-corrected TDNI measures are qualitatively very similar. Hence, the two methods can be used alternatively when one needs to calculate VIs. In Simulation study (5) the procedure is not able to completely remove the bias due to the association between X 3 and X 1. A bias in favor of X 3 is still present. This effect can be also observed when using the permutation measure calculated by CRF (see Fig. 5(b)). A new conditional permutation scheme for the computation of VI in case of correlated predictor variables has been recently proposed by [Strobl et al., 2008]. 6 Case study The Plasma-Retinol dataset is available at the StatLib Datasets Archive and contains 315 observations of 14 variables aiming at investigating the relationship between personal characteristics and dietary factors, and plasma concentrations of retinol, beta-carotene and other carotenoids. The identification of determinants of low plasma concentration of retinol and beta-carotene is important because observational studies have suggested that this circumstance might be associated with increased risk of developing certain types of cancer. 17

18 Table 2: Variables in the Plasma-Retinol dataset Response Variables Betaplasma: Plasma beta-carotene (ng/ml) Retplasma: Plasma Retinol (ng/ml) Covariates Age: Age (years) Sex: Sex (1=Male, 2=Female) Smokstat: Smoking status (1=Never, 2=Former, 3=Current Smoker) Quetelet: Quetelet (weight/(height 2 )) Vituse: Vitamin Use (1=Yes, fairly often, 2=Yes, not often, 3=No) Calories: Number of calories consumed per day Fat: Grams of fat consumed per day Fiber: Grams of fiber consumed per day Alcohol: Number of alcoholic drinks consumed per week Cholesterol: Cholesterol consumed (mg per day) Betadiet: Dietary beta-carotene consumed (mcg per day) Retdiet: Dietary retinol consumed (mcg per day) Previous studies showed that plasma retinol levels tend to vary by age and sex, while the only dietary predictor seems to be alcohol consumption. For plasma beta-carotene, dietary intake, regular use of vitamins, and fiber intake are associated with higher plasma concentrations, while Quetelet Index and cholesterol intake are associated with lower plasma levels (Nierenberg et al., 1989). We used the RF algorithm to predict Betaplasma and Retplasma as a function of the 12 covariates described in Table 2. The hyperparameters of the tree-based ensembles are ntree=3000, mtry=4, nodesize=5 and a number of random permutations S = 300. We computed biased VI Xi (Fig. 7(a) and 8(a)) and bias-corrected importance measures VI Xi (Fig. 7(b) and 8(b)) for Betaplasma and Retplasma, respectively. We set negative values of VI to zero. For Betaplasma the corrected measures allow to identify 5 mainly predictive covariates (Fiber, Betadiet, Vituse, Quetelet, Alcohol). Similarly, the most important predictors of Retplasma seem to be Alcohol, Sex, Age. Our results largely confirm the findings of the preceding analyses except for the importance of Cholesterol and Alcohol in predicting Betaplasma. The influence of bias correction on ranking by importance is apparent. From one side, it 18

19 allows to discard predictors whose importance is artificially amplified by bias. In fact, on the basis of biased VIs, Fat and Calories seem informative covariates of Betaplasma, but after bias removal their importances vanish. On the other side, bias correction can reveal informative predictors, hidden by bias in VI of other predictors. In Fig. 8(a), Sex seems scarcely influent on Retplasma, but the correction procedure shows that this variable is one of the three most important covariates. Fig. 7(c) and 8(c) show the permutation-based VIs calculated using CRF. These estimates are very similar to VIs obtained by the proposed algorithm. 7 Concluding remarks [Figures (7) and (8) approximately here] It is well-known that the Gini VI measure, computed by means of tree-based learning ensembles, is affected by different kinds of bias ([Strobl, 2005]). The existence of bias in VI is potentially dangerous especially in a variable selection perspective, since it can dramatically alter the ranking of predictors. The main source of bias for the class of TDNI variable importance measures is intimately connected to the tree-construction mechanism: covariates X i with the same importance in a node j can have different probabilities p ij of being selected as splitting variables. Using theoretical considerations and simulation experiments, the present paper shows that the levels of this bias depend on the characteristics of covariates, i.e. on their measurement levels. In addiction, the analysis indicates that this kind of bias is generated by uninformative splits. These splits are binary partitions of sample units that are completely driven by chance where the association between response variable and covariates has been entirely captured by earlier splits (the informative splits). The heuristic bias-correction strategy of [Sandri and Zuccolotto, 2008] is based on the introduction of a set of pseudocovariates in the original dataset, in the spirit of [Wu et al.(2007)]. Pseudocovariates are noise variables that are independent on the response variable and have the same correlation structure of the original covariates. Working on simulated and real life regression problems, the paper shows that pseudocovariates have the capacity to approximate the component of the estimated VI attributable to bias. Hence, the proposed permutation method can reduce bias due to different measurement levels of covariates and can yield correct ranking of variables according to their importance. Of course, the drawback of repeatedly generating pseudocovariates is an additional computational burden. This is a problem common to all permutation-based procedures and it cannot be ignored when the method is applied to large-scale datasets. However, it is fundamental to take 19

20 into account that the use of the bias-correction method is necessary only when covariates with different levels of measurement are present. For example, genome-wide association studies typically collect data about thousands to hundreds of thousands single nucleotide polymorphisms that have all the same characteristics (e.g. are all real-valued variables). In these cases bias in TDNI measure does not affect covariates ranking and therefore one can avoid to apply any correction. Anyway, simulations experiments show that the number S of sets {Z s } of pseudocovariates required for a satisfactory bias correction is often small and the extra computational effort is generally moderate even in medium-size problems. References [Bell and Wang, 2000] Bell, D. and Wang, H. (2000): A formalism for relevance and its application in feature subset selection. Machine Learning, 4(2), [Berk, 2006] Berk, R.A. (2006): An Introduction to Ensemble Methods for Data Analysis. Sociological Methods & Research, 34(3), [Breiman et al., 1984] Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984): Classification and Regression Trees. Chapman & Hall, London. [Breiman, 1996] Breiman, L. (1996a): The heuristic of instability in model selection. Annals of Statistics, 24, [Breiman, 2001] Breiman, L. (2001): Random Forests. Machine Learning, 45, [1] Breiman, L. (2001b): Statistical modeling: the two cultures. Statistical Science, 16, [Breiman, 2002] Breiman, L. (2002): Manual on setting up, using, and understanding Random Forests v3.1. Technical Report, [Breiman et al., 2006] Breiman L., Cutler A., Liaw A., Wiener M. (2006): Breiman and Cutlers Random Forests for Classification and Regression. R package version [Bühlmann and Yu, 2002] Bühlmann P. and Yu B. (2002): Analyzing bagging. Annals of Statistics, 30 (4), [Dobra and Gehrke, 2001] Dobra A, Gehrke J (2001): Bias Correction in Classification Tree Construction. In Proceedings of the Seventeenth International Conference on Machine Learning, Williams College, Williamstown, MA, USA. Edited by Brodley CE, Danyluk AP, pp

REGRESSION TREE CREDIBILITY MODEL

REGRESSION TREE CREDIBILITY MODEL LIQUN DIAO AND CHENGGUO WENG Department of Statistics and Actuarial Science, University of Waterloo Advances in Predictive Analytics Conference, Waterloo, Ontario Dec 1, 2017 Overview Statistical }{{ Method

More information

Random Forests for Ordinal Response Data: Prediction and Variable Selection

Random Forests for Ordinal Response Data: Prediction and Variable Selection Silke Janitza, Gerhard Tutz, Anne-Laure Boulesteix Random Forests for Ordinal Response Data: Prediction and Variable Selection Technical Report Number 174, 2014 Department of Statistics University of Munich

More information

day month year documentname/initials 1

day month year documentname/initials 1 ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi

More information

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

More information

Regression tree methods for subgroup identification I

Regression tree methods for subgroup identification I Regression tree methods for subgroup identification I Xu He Academy of Mathematics and Systems Science, Chinese Academy of Sciences March 25, 2014 Xu He (AMSS, CAS) March 25, 2014 1 / 34 Outline The problem

More information

Variable importance measures in regression and classification methods

Variable importance measures in regression and classification methods MASTER THESIS Variable importance measures in regression and classification methods Institute for Statistics and Mathematics Vienna University of Economics and Business under the supervision of Univ.Prof.

More information

CHAPTER6 LINEAR REGRESSION

CHAPTER6 LINEAR REGRESSION CHAPTER6 LINEAR REGRESSION YI-TING HWANG DEPARTMENT OF STATISTICS NATIONAL TAIPEI UNIVERSITY EXAMPLE 1 Suppose that a real-estate developer is interested in determining the relationship between family

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Classification using stochastic ensembles

Classification using stochastic ensembles July 31, 2014 Topics Introduction Topics Classification Application and classfication Classification and Regression Trees Stochastic ensemble methods Our application: USAID Poverty Assessment Tools Topics

More information

Growing a Large Tree

Growing a Large Tree STAT 5703 Fall, 2004 Data Mining Methodology I Decision Tree I Growing a Large Tree Contents 1 A Single Split 2 1.1 Node Impurity.................................. 2 1.2 Computation of i(t)................................

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

the tree till a class assignment is reached

the tree till a class assignment is reached Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY

RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY 1 RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY Leo Breiman Statistics Department University of California Berkeley, CA. 94720 leo@stat.berkeley.edu Technical Report 518, May 1, 1998 abstract Bagging

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

Generalization to Multi-Class and Continuous Responses. STA Data Mining I

Generalization to Multi-Class and Continuous Responses. STA Data Mining I Generalization to Multi-Class and Continuous Responses STA 5703 - Data Mining I 1. Categorical Responses (a) Splitting Criterion Outline Goodness-of-split Criterion Chi-square Tests and Twoing Rule (b)

More information

measure in classification trees

measure in classification trees A bias correction algorithm for the Gini variable importance measure in classification trees Marco Sandri and Paola Zuccolotto University of Brescia - Department of Quantitative Methods C.da Santa Chiara

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

More information

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c15 2013/9/9 page 331 le-tex 331 15 Ensemble Learning The expression ensemble learning refers to a broad class

More information

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University 2018 CS420, Machine Learning, Lecture 5 Tree Models Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html ML Task: Function Approximation Problem setting

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Lecture 06 - Regression & Decision Trees Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom

More information

Decision Tree And Random Forest

Decision Tree And Random Forest Decision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany) Spring 2019 Contact: mailto: Ammar@cu.edu.eg

More information

Lecture 7: DecisionTrees

Lecture 7: DecisionTrees Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:

More information

Constructing Prediction Intervals for Random Forests

Constructing Prediction Intervals for Random Forests Senior Thesis in Mathematics Constructing Prediction Intervals for Random Forests Author: Benjamin Lu Advisor: Dr. Jo Hardin Submitted to Pomona College in Partial Fulfillment of the Degree of Bachelor

More information

UVA CS 4501: Machine Learning

UVA CS 4501: Machine Learning UVA CS 4501: Machine Learning Lecture 21: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department of Computer Science Where are we? è Five major sections of this course

More information

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each

More information

Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

BAGGING PREDICTORS AND RANDOM FOREST

BAGGING PREDICTORS AND RANDOM FOREST BAGGING PREDICTORS AND RANDOM FOREST DANA KANER M.SC. SEMINAR IN STATISTICS, MAY 2017 BAGIGNG PREDICTORS / LEO BREIMAN, 1996 RANDOM FORESTS / LEO BREIMAN, 2001 THE ELEMENTS OF STATISTICAL LEARNING (CHAPTERS

More information

Gradient Boosting (Continued)

Gradient Boosting (Continued) Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive

More information

Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

More information

Tree-based methods. Patrick Breheny. December 4. Recursive partitioning Bias-variance tradeoff Example Further remarks

Tree-based methods. Patrick Breheny. December 4. Recursive partitioning Bias-variance tradeoff Example Further remarks Tree-based methods Patrick Breheny December 4 Patrick Breheny STA 621: Nonparametric Statistics 1/36 Introduction Trees Algorithm We ve seen that local methods and splines both operate locally either by

More information

Statistical Consulting Topics Classification and Regression Trees (CART)

Statistical Consulting Topics Classification and Regression Trees (CART) Statistical Consulting Topics Classification and Regression Trees (CART) Suppose the main goal in a data analysis is the prediction of a categorical variable outcome. Such as in the examples below. Given

More information

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Machine Learning Recitation 8 Oct 21, Oznur Tastan Machine Learning 10601 Recitation 8 Oct 21, 2009 Oznur Tastan Outline Tree representation Brief information theory Learning decision trees Bagging Random forests Decision trees Non linear classifier Easy

More information

Advanced Statistical Methods: Beyond Linear Regression

Advanced Statistical Methods: Beyond Linear Regression Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

measure in classification trees

measure in classification trees A bias correction algorithm for the Gini variable importance measure in classification trees Marco Sandri and Paola Zuccolotto University of Brescia - Department of Quantitative Methods C.da Santa Chiara

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Goals for the lecture you should understand the following concepts the decision tree representation the standard top-down approach to learning a tree Occam s razor entropy and information

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes. Random Forests One of the best known classifiers is the random forest. It is very simple and effective but there is still a large gap between theory and practice. Basically, a random forest is an average

More information

Variable importance in RF. 1 Start. p < Conditional variable importance in RF. 2 n = 15 y = (0.4, 0.6) Other variable importance measures

Variable importance in RF. 1 Start. p < Conditional variable importance in RF. 2 n = 15 y = (0.4, 0.6) Other variable importance measures n = y = (.,.) n = 8 y = (.,.89) n = 8 > 8 n = y = (.88,.8) > > n = 9 y = (.8,.) n = > > > > n = n = 9 y = (.,.) y = (.,.889) > 8 > 8 n = y = (.,.8) n = n = 8 y = (.889,.) > 8 n = y = (.88,.8) n = y = (.8,.)

More information

Chapter 14 Combining Models

Chapter 14 Combining Models Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.

More information

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Summary! Input Knowledge representation! Preparing data for learning! Input: Concept, Instances, Attributes"

More information

JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA

JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA 1 SEPARATING SIGNAL FROM BACKGROUND USING ENSEMBLES OF RULES JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA 94305 E-mail: jhf@stanford.edu

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

More information

Gradient Boosting, Continued

Gradient Boosting, Continued Gradient Boosting, Continued David Rosenberg New York University December 26, 2016 David Rosenberg (New York University) DS-GA 1003 December 26, 2016 1 / 16 Review: Gradient Boosting Review: Gradient Boosting

More information

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m ) CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions with R 1,..., R m R p disjoint. f(x) = M c m 1(x R m ) m=1 The CART algorithm is a heuristic, adaptive

More information

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1 Decision Trees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 5 th, 2007 2005-2007 Carlos Guestrin 1 Linear separability A dataset is linearly separable iff 9 a separating

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

Classification Using Decision Trees

Classification Using Decision Trees Classification Using Decision Trees 1. Introduction Data mining term is mainly used for the specific set of six activities namely Classification, Estimation, Prediction, Affinity grouping or Association

More information

Supplementary material for Intervention in prediction measure: a new approach to assessing variable importance for random forests

Supplementary material for Intervention in prediction measure: a new approach to assessing variable importance for random forests Supplementary material for Intervention in prediction measure: a new approach to assessing variable importance for random forests Irene Epifanio Dept. Matemàtiques and IMAC Universitat Jaume I Castelló,

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabás Póczos Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

ABC random forest for parameter estimation. Jean-Michel Marin

ABC random forest for parameter estimation. Jean-Michel Marin ABC random forest for parameter estimation Jean-Michel Marin Université de Montpellier Institut Montpelliérain Alexander Grothendieck (IMAG) Institut de Biologie Computationnelle (IBC) Labex Numev! joint

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Fall 2018 Some slides from Tom Mitchell, Dan Roth and others 1 Key issues in machine learning Modeling How to formulate your problem as a machine learning problem?

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Spring 2018 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy

More information

SF2930 Regression Analysis

SF2930 Regression Analysis SF2930 Regression Analysis Alexandre Chotard Tree-based regression and classication 20 February 2017 1 / 30 Idag Overview Regression trees Pruning Bagging, random forests 2 / 30 Today Overview Regression

More information

Roman Hornung. Ordinal Forests. Technical Report Number 212, 2017 Department of Statistics University of Munich.

Roman Hornung. Ordinal Forests. Technical Report Number 212, 2017 Department of Statistics University of Munich. Roman Hornung Ordinal Forests Technical Report Number 212, 2017 Department of Statistics University of Munich http://www.stat.uni-muenchen.de Ordinal Forests Roman Hornung 1 October 23, 2017 1 Institute

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014 Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists

More information

Generalized Boosted Models: A guide to the gbm package

Generalized Boosted Models: A guide to the gbm package Generalized Boosted Models: A guide to the gbm package Greg Ridgeway April 15, 2006 Boosting takes on various forms with different programs using different loss functions, different base models, and different

More information

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008) Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008) RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement

More information

Ensemble Methods: Jay Hyer

Ensemble Methods: Jay Hyer Ensemble Methods: committee-based learning Jay Hyer linkedin.com/in/jayhyer @adatahead Overview Why Ensemble Learning? What is learning? How is ensemble learning different? Boosting Weak and Strong Learners

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Announcements Kevin Jamieson

Announcements Kevin Jamieson Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium

More information

Conditional variable importance in R package extendedforest

Conditional variable importance in R package extendedforest Conditional variable importance in R package extendedforest Stephen J. Smith, Nick Ellis, C. Roland Pitcher February 10, 2011 Contents 1 Introduction 1 2 Methods 2 2.1 Conditional permutation................................

More information

Data splitting. INSERM Workshop: Evaluation of predictive models: goodness-of-fit and predictive power #+TITLE:

Data splitting. INSERM Workshop: Evaluation of predictive models: goodness-of-fit and predictive power #+TITLE: #+TITLE: Data splitting INSERM Workshop: Evaluation of predictive models: goodness-of-fit and predictive power #+AUTHOR: Thomas Alexander Gerds #+INSTITUTE: Department of Biostatistics, University of Copenhagen

More information

Decision Trees (Cont.)

Decision Trees (Cont.) Decision Trees (Cont.) R&N Chapter 18.2,18.3 Side example with discrete (categorical) attributes: Predicting age (3 values: less than 30, 30-45, more than 45 yrs old) from census data. Attributes (split

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Goals for the lecture you should understand the following concepts the decision tree representation the standard top-down approach to learning a tree Occam s razor entropy and information

More information

Harrison B. Prosper. Bari Lectures

Harrison B. Prosper. Bari Lectures Harrison B. Prosper Florida State University Bari Lectures 30, 31 May, 1 June 2016 Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 1 h Lecture 1 h Introduction h Classification h Grid Searches

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Statistical aspects of prediction models with high-dimensional data

Statistical aspects of prediction models with high-dimensional data Statistical aspects of prediction models with high-dimensional data Anne Laure Boulesteix Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie February 15th, 2017 Typeset by

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems - Machine Learning Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning last change November 26, 2014 Ute Schmid (CogSys,

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

A Bias Correction for the Minimum Error Rate in Cross-validation

A Bias Correction for the Minimum Error Rate in Cross-validation A Bias Correction for the Minimum Error Rate in Cross-validation Ryan J. Tibshirani Robert Tibshirani Abstract Tuning parameters in supervised learning problems are often estimated by cross-validation.

More information

Bias-Variance in Machine Learning

Bias-Variance in Machine Learning Bias-Variance in Machine Learning Bias-Variance: Outline Underfitting/overfitting: Why are complex hypotheses bad? Simple example of bias/variance Error as bias+variance for regression brief comments on

More information

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Information Theory & Decision Trees

Information Theory & Decision Trees Information Theory & Decision Trees Jihoon ang Sogang University Email: yangjh@sogang.ac.kr Decision tree classifiers Decision tree representation for modeling dependencies among input variables using

More information

Rule Generation using Decision Trees

Rule Generation using Decision Trees Rule Generation using Decision Trees Dr. Rajni Jain 1. Introduction A DT is a classification scheme which generates a tree and a set of rules, representing the model of different classes, from a given

More information

Lecture 7 Decision Tree Classifier

Lecture 7 Decision Tree Classifier Machine Learning Dr.Ammar Mohammed Lecture 7 Decision Tree Classifier Decision Tree A decision tree is a simple classifier in the form of a hierarchical tree structure, which performs supervised classification

More information

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften Statistisches Data Mining (StDM) Woche 11 Oliver Dürr Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften oliver.duerr@zhaw.ch Winterthur, 29 November 2016 1 Multitasking

More information

Ensembles. Léon Bottou COS 424 4/8/2010

Ensembles. Léon Bottou COS 424 4/8/2010 Ensembles Léon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. Léon

More information

1 Handling of Continuous Attributes in C4.5. Algorithm

1 Handling of Continuous Attributes in C4.5. Algorithm .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Classification/Supervised Learning Potpourri Contents 1. C4.5. and continuous attributes: incorporating continuous

More information

Adaptive Crowdsourcing via EM with Prior

Adaptive Crowdsourcing via EM with Prior Adaptive Crowdsourcing via EM with Prior Peter Maginnis and Tanmay Gupta May, 205 In this work, we make two primary contributions: derivation of the EM update for the shifted and rescaled beta prior and

More information

Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities ; CU- CS

Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities ; CU- CS University of Colorado, Boulder CU Scholar Computer Science Technical Reports Computer Science Spring 5-1-23 Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities

More information

Final Exam, Fall 2002

Final Exam, Fall 2002 15-781 Final Exam, Fall 22 1. Write your name and your andrew email address below. Name: Andrew ID: 2. There should be 17 pages in this exam (excluding this cover sheet). 3. If you need more room to work

More information