Analysis and correction of bias in Total Decrease in Node Impurity measures for tree-based algorithms

Size: px

Start display at page:

Download "Analysis and correction of bias in Total Decrease in Node Impurity measures for tree-based algorithms"

Giles West
5 years ago
Views:

1 Analysis and correction of bias in Total Decrease in Node Impurity measures for tree-based algorithms Marco Sandri and Paola Zuccolotto University of Brescia - Department of Quantitative Methods C.da Santa Chiara Brescia - Italy. Abstract. Variable selection is one of the main problem faced by data mining and machine learning techniques. For the most part, these techniques are more or less explicitly based on some measure of variable importance. This paper considers Total Decrease in Node Impurity (TDNI) measures, a popular class of variable importance measures defined in the field of decision trees and tree-based ensemble methods, like Random Forests and Gradient Boosting Machines. In spite of their wide use, some measures of this class are known to be biased and some correction strategies have been proposed. The aim of this paper is twofold. First, to investigate the source and the characteristics of bias in TDNI measures using the notions of informative and uninformative splits. Second, a bias-correction algorithm, recently proposed for the Gini measure in the context of classification, is extended to the entire class of TDNI measures and its performance is investigated in the regression framework using simulated and real data. Corresponding author: Paola Zuccolotto, Quantitative Methods Department, University of Brescia, C.da Santa Chiara 50, Brescia, Italy. zuk@eco.unibs.it 1

2 1 Introduction In the last decades, with the proliferation of large datasets, the problem of variable selection has gained an increasing attention in the field of data analysis. Given a large set of observed covariates that describes the phenomenon under study, the researcher often needs to identify the subset of informative (predictive) variables and to set uninformative (noisy) variables apart. A preliminary variable selection is a fundamental and crucial step in model building, whatever the approach to model the phenomenon might be. Many of the variable selection methods proposed in the literature are directly or indirectly based on the assumption that, given the set X = {X 1,,X p } of potential predictors for a response variable Y, an importance or relevance µ i can be defined for each covariate X i in terms of prediction/explanation of Y, and this measure can be evaluated from data using some variable importance estimator VI i. The notion of importance has been widely investigated in the philosophical, AI, machine learning and statistical literature. Several are the attempts appeared in the scientific literature to formalize and quantify this notion. See [Bell and Wang, 2000] for a brief overview of the current lines of research and [van der Laan, 2006] for a novel approach. In the present work, following [Pearl, 1988], we start by identifying unimportance with conditional independence of random variables and importance with the negation of unimportance. This paper focuses on measures of variable importance developed in the area of tree-based ensemble methods. Decision trees model data by partitioning the feature space into a set of disjoint rectangles and then fitting a simple model (e.g. a constant) to each one. Classification and Regression Trees (CART) were introduced by Breiman, Friedman, Olshen and Stone in 1984 and are a milestone in this field. Since then, a great number of developments have been proposed in several disciplines [Murthy, 2004]. One of the main problems of tree predictors is model instability, defined as the existence of many different models, distant in terms of form and interpretation, that have about the same training or test set error [Breiman, 1996]. Ensemble learning is a class of methods developed for reducing model instability and improving the accuracy of a predictor through aggregation of several similar predictors. Each ensemble member is constructed by a different function of the input covariates. Ensemble prediction is obtained by linear combination of the predictions of ensemble members. Ensembles can be built using different prediction methods, i.e. using different base learners as ensemble members. An interesting proposal uses CART as base learners. Typically, such aggregation neutralizes the effects of tree instability and arrives at greater accuracy by reducing either the bias or the variance of a single classifier [Bühlmann and Yu, 2002]. Popular examples of tree-based ensembles are Random Forests (RF, [Breiman, 2001]) and Gradient Boosting Machine (GBM, [Friedman, 2001]). The first approach to variable importance (VI, henceforth) measurement in tree-based pre- 2

3 dictors dates back to the cited book of [Breiman et al., 1984], where an interesting and effective notion of variable importance was proposed. The importance µ i of a covariate X i is defined as the total decrease of heterogeneity of the response variable Y given by the knowledge of X = {X 1,,X p } when the feature space is partitioned recursively. The VI measure originated by this notion is obtained by summing up all the decreases of the heterogeneity index in the nodes of the tree. This class of measures is called Total Decrease in Node Impurity (TDNI henceforth) and, with little changes, is used in many tree-based ensemble methods (see e.g. [Breiman, 2002], [Friedman, 2001]). It is also available in many software for data mining, like the randomforest package in R [Breiman et al., 2006], the gbm package in R [Ridgeway, 2007], the boost Stata command [Schonlau, 2005], the MART package in R [Friedman, 2002]. In spite of their wide use, some TDNI measures are known to be biased. [Breiman et al., 1984] first noted that they tend to favor covariates having more values (i.e. less missing values, more categories or distinct numerical values) and thus offering more splits. [White and Liu, 1994], [Kononenko,1995] and [Dobra and Gehrke, 2001] investigated in greater detail the nature of bias and elucidate the relation between bias and the number of values of covariates. [Strobl, 2005], [Strobl et al., 2007b] and [Sandri and Zuccolotto, 2008] focused attention on the bias of the Gini variable importance measure (hereafter Gini TDNI), a measure frequently used in classification trees and based on the adoption of the Gini gain as splitting criterion of tree nodes. Several methods have been proposed in the last decade for eliminating bias from the Gini TDNI. [Loh and Shih, 1997] and [Kim and Loh, 2001] proposed to modify the algorithm for the construction of classification trees in order to avoid selection bias. The authors showed that bias can be eliminated by separating at each node variable selection from split point selection. In the work of [Strobl, 2005], an unbiased estimation of the Gini TDNI was found in Conditional Random Forests, a new class of RF developed by [Hothorn et al., 2006]. [Strobl et al., 2007b] derived the exact distribution of the maximally selected Gini gain by means of a combinatorial approach and the resulting p-value is suggested as an unbiased split selection criterion in recursive partitioning algorithms. The heuristic correction strategy proposed by [Sandri and Zuccolotto, 2008] is based on the introduction of a set of random pseudocovariates in the X matrix. The authors showed that the algorithm can efficiently remove bias from Gini TDNI in RF and GBM. The aim of this paper is twofold. First, to investigate the source and the characteristics of bias in TDNI measures by the introduction of the notions of informative and uninformative splits, showing its connections with the level of covariates measurement and with the number of uninformative splits. Second, to generalize and extend the domain of applicability of the correction algorithm of [Sandri and Zuccolotto, 2008] to the class of TDNI measures, evaluating its performances on simulated and real data in regression problems when the residual sum of squares is used as splitting criterion for the tree nodes. The paper is organized as follows. In Section 2 we define the class of TDNI measures and the 3

4 corresponding estimators for single trees and tree-based ensembles. In Section 3 we define the crucial notion of informative and uninformative splits and show that uninformative splits are the main source of bias for TDNI measures. Subsection 4 investigates the relationship existing between bias of TDNI measures and level of covariates measurement from a theoretical point of view and by means of some simulation experiments. In Section 5 the bias-correction strategy for classification problems proposed by [Sandri and Zuccolotto, 2008] is recalled and extended to the class of TDNI measures. The performances of the method are tested on simulated (Section 5) and real data (Section 6). Section 7 concludes. 2 Characterization of TDNI measures Let (Y,X) : Ω (D Y D X1 D Xp ) D be a vector random variable defined on a probability space (Ω, F,P), where X = {X 1,,X p } is a set of covariates and Y a response variable. A tree-structured binary recursive partitioning algorithm yields a hierarchical partition of the domain D into J disjoint (hyper-)rectangles R j D, j = 1,2,,J. Each rectangle is generated in D by splitting a parent rectangle in two parts by a binary split of the domain of a covariate X i. Therefore, a rectangle can be described by the set of covariates and splits used to generate it. If Y, X 1 and X 2 are three numerical real-valued random variables, a rectangle could be for example given by R j = {(y,x) IR 3 x 1 > a b x 2 c}), where a D X1 IR and b,c D X2 IR, b < c. Consider a value s D Xi R j, where D Xi R j is the domain of X i restricted to the rectangle R j. The impurity reduction generated in the rectangle R j by X i at the cutpoint s, is given by: { ( )} d s ij = H Y (X i,r j ) = p j H Y p jl H Y Xi s + p jr H Y Xi >s, (1) where p j = P(R j ), p jl = P(X i s R j ) and p jr = P(X i > s R j ). H Y,H Y Xi s and H Y Xi >s are the heterogeneity indexes of Y in the jth rectangle and in the left and right splits of R j, respectively. Let d ij be the maximum heterogeneity reduction allowed by covariate X i in the jth rectangle, for all the possible cutpoints s D Xi R j d ij = max d s ij. (2) s D Xi R j The goal of partitioning algorithms is to maximally reduce the heterogeneity of Y within the rectangles. Therefore, for each R j, the splitting variable X i and the cutpoint s are those that maximize the impurity reduction in that subset. In other words, the partitioning variable X i satisfies in R j the condition d ij > d hj for h = 1,2,...,p, h i. 4

5 In this context, TDNI measures of variable importance are based on the following notion of importance µ i of a covariate X i : µ i is the total decrease of the heterogeneity index H Y attributable to X i. In other words, µ i is computed summing up all the decreases of heterogeneity d ij obtained in the rectangles generated using X i as splitting variable: µ i = j J d ij I ij (3) where I ij is the indicator function which equals 1 if the ith variable is used to split R j and 0 otherwise. Several impurity/heterogeneity indexes H have been proposed for the case of a categorical Y : the Pearson s chi-squared statistic, the Gini criterion, the entropy criterion, the families of splitting criteria of [Shih, 1999], etc. When Y is numerical, the most popular measure H is variance. In the following example we show how, according to equation (3), the importance µ i can be calculated using the joint probability distribution of (Y, X). The data generating process described in this example will be used to produce the dataset of Example 2 in Section 3 and of Simulation (4) and (5) in Section 4. Example 1. (Calculation of variable importance) Consider the variable (Y, X) = {Y,X 1,X 2,X 3 }, where X 1 and X 2 are two binary 0/1 independent covariates, Y is a continuous standard normal response variable generated by the following data generating process (see Fig. 1(a)): P(X 1 = 0) = P(X 2 = 0) = 1/2, (Y X 1 = 0) N( 1/2,3/4), (Y X 1 = 1) N(1/2,3/4), (Y X 2 = 0) N( 1/3,8/9), (Y X 2 = 1) N(1/3,8/9), (Y X 1 = 0 X 2 = 0) = (Y X 1 = 0 X 2 = 1) N( 1/2,3/4), (Y X 1 = 1 X 2 = 0) N(0,1/2), (Y X 1 = 1 X 2 = 1) N(1,1/2). Consider the following three cases for the uninformative variable X 3 : Case A: a binary 0/1 covariate on X 1 and X 2 ; Case B: a continuous standard normal covariate independent on X 1 and X 2 ; Case C: a continuous covariate, normally distributed conditionally to X 1 : (X 3 X 1 = 0) N(0,1) and (X 3 X 1 = 1) N(1,1). Variable importances can be calculated in the three cases A, B and C. Since Y is continuous, we can adopt variance as heterogeneity index and (1) can be expressed as: ( )} d s ij = σ2 Y (X i,r j ) = p j {σy 2 p jl σy 2 X i s + p jrσy 2 X i >s, (4) where σy 2, σ2 Y X i s and σ2 Y X i >s are the variances of Y in the jth rectangle and in the left and right splits, respectively. Case A and B. The calculation of variable importance is the same in the two cases, because 5

6 Y is stochastically independent on X 3. The different levels of measurement of X 3 in the two cases do not influence variable importance. When the whole sample space is considered (i.e. R 1 = D), X 1 is the most effective variable in reducing the heterogeneity of Y by means of a binary split because d 11 = p 1 1/4 = 1/4 > d 21 = p 1 1/9 = 1/9 > d 31 = 0. The sample space is then partitioned according to X 1. The sample space conditioned to X 1 = 1 (R 3 ), can be further partitioned by X 2 since d 23 = p 3 1/4 = P(X 1 = 1) 1/4 = 1/8 > d 33 = 0. The sample space conditioned to (X 1 = 1) (X 2 = 0) (R 4 ) cannot be further partitioned because d 34 = 0. The same is true in the sample space conditioned to (X 1 = 1) (X 2 = 1)(R5). Similarly, no further partitioning is possible in the sample space conditioned to X 1 = 0 (R 2 ) because d 22 = d 32 = 0. Hence, the VIs of the three covariates are µ 1 = d 11 = 1/4, µ 2 = d 23 = 1/8, µ 3 = 0. Case C. In this case X 3 is no longer independent on Y, due to the relationship existing between X 3 and X 1. Here X 3 is independent on Y, conditionally to X 1. Now we show that the VIs of the three covariates are the same as case A and B, because in R 1 = D X 1 remains the most effective variable in reducing the heterogeneity of Y by means of a binary split. This can be proved as follows. Let s be a cutpoint for X 3 in R 1 and let P(X 1 = 0 X 3 s) = p and P(X 1 = 0 X 3 > s) = q. The distributions of Y, conditionally on X 3 lower and greater than s, are mixtures of normal variables, that is f(y X 3 s) = pf(y X 1 = 0) + (1 p)f(y X 1 = 1) and f(y X 3 > s) = qf(y X 1 = 0) + (1 q)f(y X 1 = 1), where f(y X A) is the density function of y given that X A. It is easy to show that σ 2 Y X 3 s = 3/4 + p(1 p) > σ2 Y X 1 =0 = σ2 Y X 1 =1 = 3/4, σ 2 Y X 3 >s = 3/4 + q(1 q) > σ2 Y X 1 =0 = σ2 Y X 1 =1 = 3/4. It follows that, for all s, ) ( ) σy (P(X 2 1 = 0)σY 2 X 1 =0 + P(X 1 = 1)σY 2 X 1 =1 > σy 2 P(X 3 s)σy 2 X 3 s + P(X 3 > s)σy 2 X 3 >s, thus d 11 > d 31. When the sample space is partitioned according to X 1, X 3 is independent on Y in the two generated rectangles, so the domain partition is the same as in case A and B, as well as the resulting VI measures µ 1, µ 2 and µ 3. [Figures (1) approximately here] We consider now the case of a tree t built using a sample with size N. The impurity reduction ˆd ij at node j attributable to covariate X i with cutpoint s can be estimated by: ˆd s ij = H Y (X i ) = n { ( j njl Ĥ Y Ĥ N n Y Xi s + n )} jr Ĥ j n Y Xi >s j 6 (5)

7 where ĤY,ĤY X i s and ĤY X i >s are the estimated heterogeneities of Y in the jth rectangle and in the left and right splits, respectively. n j, n jl, n jr are the sample sizes in node j and in the left and right splits. Similarly, d ij is estimated by: ˆd ij = max s S ij ˆds ij, (6) where S ij is the set of available cutpoints of variable X i at node j. The covariate X i is selected at node j as splitting variable if ˆd ij > ˆd hj for all h = 1,...,p, h i. The estimate of the TDNI importance µ i using a tree t is given by the sum of the estimated impurity reductions attributable to covariate X i over the set J of nonterminal nodes of the tree [Breiman et al., 1984], that is: VI i (t) = j J ˆd ij I ij. (7) In the regression case we can use the sample variance ˆσ 2 as an estimator of the heterogeneity of node and splits. Hence, (5) becomes: ˆd s ij = σ Y 2 (X i) = n { j ˆσ Y 2 N = n { j DEVtotal (j) N n j ( njl n j ˆσ 2 Y X i s + n jr DEV within(jl,jr) n j ˆσ Y 2 X n i >s j } )} (8) = 1 N DEV between(jl,jr), (9) where DEV within and DEV between are the within-node and the between-node deviance, respectively. From (9) one can derive that, in the regression case, the TDNI measure (7) of a covariate X i is equal to the total amount of DEV between imputable to that covariate in the tree. For tree-based ensembles the VI measure is given by the average of VI i over the set of T trees: VI i = 1 T VI i (t). (10) T t=1 The VI measure (10) has been proposed by [Breiman, 2002] in Random Forests and is called Measure 4 (M4), because the last of a set of four importance measures. With minor modifications, [Friedman, 2001] proposed an influence of input variables for GBM, with ˆd 2 ij in place of ˆd ij and ˆVI i rescaled by assigning a value of 100 to the most influential covariate. 7

8 3 Bias in TDNI measures A crucial point for the analysis that follows is the notion of informative and uninformative splits. Suppose that D has been recursively partitioned into J rectangles {R j } j=1,2,,j. If X i and Y are stochastically independent, they continue to be independent in each R j. On the contrary, if some association between X i and Y exists, X i and Y could be dependent or conditionally independent in a given R j. In other words, when predicting Y, uninformative covariates (i.e stochastically independent on Y ) always remain uninformative, in each subset of the sample space. Informative (i.e somehow associated with Y ) covariates can continue to be informative or can become uninformative in R j. Suppose now to grow a tree using a sample of N units and suppose that, within a given node, there is at least one covariate having some association with Y. The node will be split by using the best covariate, that is the covariate that maximizes the heterogeneity reduction ˆd ij. Hence, because the heterogeneity reductions of informative covariates will be typically greater than the heterogeneity reductions of the uninformative ones, an informative covariate will be chosen as splitting variable. We define this circumstance as an informative split. When within a node there are no informative covariates, only uninformative covariates and/or informative covariates which became uninformative can be chosen as splitting covariate. This is the case of an uninformative split. We can formalize the following definition. Definition 1: (Informative and uninformative splits) Given a tree t grown from a sample, the split of a node j made by covariate X i (i.e. ˆdij > ˆd hj, h = 1,2,,p, h i) is called uninformative if d ij = 0 and informative otherwise. An uninformatively-split node is a node where an uninformative split occurs. In informative splits, the heterogeneity reduction ˆd ij of the splitting covariate is a direct consequence of its importance. Differently, in an uninformative split, ˆdij is a product of chance. Therefore, when calculating the TDNI measure VI i (t) of covariate X i by (7), it is of fundamental importance to distinguish between impurity reductions attributable to informative splits and impurity reductions generated by uninformative splits. In other words, VI i (t) can be expressed as the sum of two components: VI i (t) = ˆdij I ij + ˆdij I ij = ˆµ i (t) + ε i (t) (11) j J I j J U where J I and J U, J I J U = J, are the set of nodes characterized by informative and uninformative splits, respectively. ˆµ i (t) is the part of the VI measure attributable to informative splits and directly related to the true importance of X i. On the contrary, the term ε i (t) is a noisy 8

9 Table 1: Sample data of Example X X X Y X X X Y component associated with the selection of X i within uninformative splits and is a source of bias for VI i (t). Example 2. (Informative and uninformative splits): Consider the sample of 16 units given in Table 1 and generated by the process of Example 1, Case A. Suppose to fit a fully-grown regression tree to these data. Applying (5), in the root node the impurity reductions associated to the three covariates are: ˆd11 = 0.266, ˆd 21 = 0.214, ˆd 31 = Hence, X 1 is the best splitting variable at the root node and the split is informative because d 11 0 (and d 11 > d 21 > d 31 ). In node 2 (X 1 = 0), ˆd22 = and ˆd 32 = X 2 is used as splitting variable. This is a uninformative split because d 22 = 0 (and d 32 = 0). In node 3 (X 1 = 1), ˆd23 = and ˆd 33 = The splitting variable is X 2 and the split is informative since d 23 0 (and d 23 > d 33 ). In node 4, 5, 6 and 7 the impurity reductions attributable to X 3 are ˆd 34 = and ˆd 35 = 0.012, ˆd36 = 0.006, ˆd37 = The splits are all uninformative. Nodes 8 through 15 are leaf nodes. The resulting regression tree is shown in Fig.1(b). Uninformative splits have been marked by thick lines. The estimated VIs of the three covariates are: VI 1 = ˆµ 1 = 0.266, VI 2 = ˆµ 2 + ε 2 = = and VI 3 = ε 3 = = Two remarks are worth pointing out. First, the sample data used to grow the tree are also used to calculate TDNI measures. In other words, TDNI are a class of in-sample measures of variable importance. This is the crucial difference between TDNI measures and the mean decrease in prediction accuracy, a popular permutation-based VI measure. This measure is defined in RF as follows: for each tree, the algorithm randomly rearranges the values of the 9

10 ith variable for the out-of-bag set (i.e. the subset of the bootstrap sample not used in the construction of the tree), puts this permuted set down the tree, and gets new predictions from the forest. The importance of the ith variable is defined as the difference between the original out-of-bag error rate and the out-of-bag error rate for the randomly permuted ith covariate. Hence, mean decrease in accuracy is fundamentally an out-of-sample VI measure, in contrast to the in-sample character of TDNI measures. Second, the notion of informative and uninformative splits is intimately related to the notion of overfitting. It is well known that an overfitted model shows a high in-sample accuracy, but does not validate, that is, does not provide accurate predictions for out-of-sample observations. Base learners of RF are fully-grown trees. They have a serious risk of overfitting [Berk, 2006]. When building a CART, the model starts learning the underlying structure of data and typically the first splits of the tree are informative splits. Subsequently, after an adequate number of informative splits, informative variables become uninformative, uninformative splits take place and the model learns the fine structure of data that is generated by noise. In other words, overfitting and uninformative splits of tree-based models are synonymous. The in-sample character of TDNI measures and the effects of overfitting can lead to a seeming paradox. Consider the case where only one covariate X is available, X and Y are continuous, a sample of N units has been observed and ˆd ij as defined in (8) is used. In a fully-grown tree the importance of X is equal to the variance of Y, no matter what relationship between X and Y exists. Consider the two limiting cases: the deterministic case Y = f(x) and the null case (i.e. Y and X are stochastically independent). The importance (7) of X is the same in the two cases, but in the first case the (maximal) tree contains only informative splits and VI = ˆµ, while in the second case only uninformative splits are present and VI = ε. This problem vanishes if the mean decrease in prediction accuracy is used instead of TDNI measures or if one avoids uninformative splits when building the tree. Pruning techniques are effective methods for controlling overfitting in CART. The aim of pruning is to remove uninformative splits: in well-pruned trees the number of uninformative splits is minimized. Hence, bias of TDNI measures is minimized, too. In the context of RF unpruned trees are typically used as base learners because the RF prediction is obtained averaging the predictions of the single trees and this neutralizes the problem of overfitting. Pruning seems to be unnecessary, but we know that this is only partially true. There is no need of pruning for improving the accuracy of prediction, but when we use RF for variable selection we cannot forget that fully-grown unpruned trees can generate a substantial amount of bias in TDNI measures. 10

11 4 Level of bias and level of covariate measurement In a tree grown from a sample of N units, each covariate X i can be used as splitting variable only in a finite number of nodes. At node j, X i is characterized by a finite number nps ij of possible binary splits which depends on the level of measurement of the covariate. A nominal covariate with k categories within a given node has nps ij = 2 k 1 1 possible splits, while an ordinal covariate with k categories has nps ij = k 1 possible splits. A numerical (continuous) covariate with n j distinct values within node j can be viewed as the limiting case of an ordinal covariate with as many categories as the number of sample units in the node. Thus, it has nps ij = n j 1. Of course, nps ij nps ik for all parent nodes k of a node j because each child node contains only a subset of the original sample and each covariate in a node has a number of distinct values (or categories) lower than or equal to the number of its distinct values (or categories) in the parent nodes. Consider a node where all the covariates X i are conditionally independent on Y. By definition, only an uninformative split can take place in this node. All the binary partitions of all the covariates have the same probability to be the best one and the selection of the splitting variable is only a product of chance. Therefore, the covariates with the highest number nps ij of possible splits are more likely to be chosen as splitting variables. Recalling the decomposition given in (11), the above considerations imply that the expected values of the noisy component ε i of the estimated VIs are not equal but depend on the level of measurement of covariates. Let J U be the set of all the possible uninformatively-split nodes, that is the set of all the nodes where an uninformative split occurs, for all the trees grown on all the possible N-size samples. We have: E(ε i ) = E( ˆdij I ij ) = E(( ˆd ij I ij ) I j ), j J U j J U where I j is the indicator function which equals 1 if the uninformatively-split node j occurs in the tree t and 0 otherwise. Since the occurrence of a given node j depends only on former splits, I j and ˆd ij I ij are independent and we can write: E(ε i ) = E( ˆd ij I ij ) E(I j ) = E( ˆd ij I ij ) q j, j J U j J U where q j is the probability of occurrence of node j in a tree. Finally, applying the law of iterated expectation, we obtain: E(ε i ) = j J U E( ˆd ij I ij = 1) p ij q j (12) where p ij is the probability of selecting covariate X i at node j. By definition, in uninformativelysplit nodes all the covariates are independent on the response variable and E( ˆd ij I ij = 1) = d j 11

12 for all i = 1,2,...,p. On the contrary, p ij depends on the number nps ij of possible splits of X i at node j: nps ij p ij = p i=1 nps (13) ij Equation (12) proves two fundamental facts: (a) covariate importances estimated by TDNI measures can show different levels of bias according to different levels of measurement of covariates and (b) the source of bias is very closely connected to the selection mechanism of the splitting variable in uninformatively-split nodes. In the following two subsections, using some simulation experiments, we investigate in further detail these characteristics of bias: its dependence on the number of uninformative splits and on the covariates level of measurement. 4.1 Simulation studies with a single regression tree The main goal of the present study is to investigate the problem of bias in TDNI VI measures estimated by ensemble methods with fully-grown trees as base learners, like RF. Equations (10) and (11) show that bias in tree-based ensembles can be expressed as an average of the bias originated in base learners. Hence, it is convenient to start our investigation about the source and the characteristics of bias considering a single unpruned regression tree. Two simulation studies are performed. We consider a data generating process with 9 independent covariates: 1 binary variable (B), 4 ordinal variables (O4, O8, O16 and O32) with 4, 8, 16 and 32 categories and 4 nominal variables (N4, N8, N16 and N32) with 4, 8, 16 and 32 categories. The continuous outcome variable Y is independent on these covariates (null case). The null case guarantees that only uninformative splits take place and estimated VIs are entirely dominated by bias. Simulation (1). In the first numerical experiment we investigate the relationship between bias and number of uninformative splits. We generate 100 random samples with size N=3000. For each sample, VIs are estimated by a single unpruned tree varying the nodesize parameter (the minimum size of terminal nodes) in the set {5,6,7,8,9,10,13,15, 20,25,30, 60}. The number of leaf nodes of the trees together with the estimated VIs of the 9 covariates are collected in a dataset. For each covariate, we estimate a quadratic regression model between number of leaf nodes and VI. Fig. 2(a) shows the relationship existing between the mean level of bias predicted by quadratic models and the number of uninformative splits (R 2 values ranges from to 0.949, the lower value has been observed for the binary variable and the higher for N32). These results substantially agree with the information contained in equation (12): bias levels appear to be a non-decreasing function of the number of uninformative splits. The estimated functions show a 12

13 statistically significant concavity. This fact can be explained considering equation (5) and (7): growing trees with increasing levels of ramification generates nodes with a decreasing size n j and ˆd ij is directly proportional to the node sample size n j. Hence, a progressive increase in the number of leaf nodes produces in (7) the addition of terms that do not grow at the same speed and increases of VI that are less than proportional. Simulation (2). In this simulation we study the relationship between bias and the number of categories in ordinal and nominal covariates. We generate a set of 100 random samples with size N=3000. For each sample, VIs are estimated by a single unpruned tree with the nodesize parameter fixed to 5. The estimated VIs of the 9 covariates are collected in a dataset and the relationship between bias and number of categories is analyzed estimating two quadratic regression models: one for the set of nominal covariates and one for ordinal covariates. The mean levels of bias predicted by quadratic models are plotted in Fig. 2(b) with respect to the logarithm of the number of covariate categories (the value of R 2 of the two models is 0.997). According to equations (12) and (13), these curves show that bias levels grow monotonically with the number of categories. In addiction, nominal variables have higher level of bias compared to ordinal variables because at each node they allow a greater number of possible splits nps ij than ordinal variables with the same number of categories. [Figures (2) approximately here] 4.2 Simulation studies with a tree-based learning ensemble In this subsection, we continue our investigation about the characteristics of bias of TDNI measures taking into account tree-based ensembles. We present four numerical experiments where VIs are estimated using RF. We devote special attention to the effect of bias on the ranking of covariates sorted by decreasing levels of estimated importance. Simulation (3). A set of r = 100 samples with N = 250 observations are generated using the null-case mechanism described in the previous subsection: 9 independent covariates with different levels of measurement and a continuous outcome variable independent on these covariates. Simulation (4). In this experiment the data generating process of Example 1 (case B) is used, with the addition of 6 random independent variables having different levels of measurement: a binary covariate (B), two ordinal (O6 and O11) and two nominal (N6 and N11) covariates and 1 continuous covariate (C2). The covariates are all mutually independent and only X 1 and X 2 are informative. We consider a set of r = 100 samples with N = 400 observations. 13

14 Simulation (5). Here we consider the case of correlated predictor variables. This simulation is similar to Simulation (4). The only difference relates the two continuous covariates. X 3 and C2 are normally distributed conditionally to X 1 and B, respectively: (X 3 X 1 = 0) N(0,1), (X 3 X 1 = 1) N(1,1) and (C2 B = 0) N(0,1), (C2 B = 1) N(1,1). Simulation (6). In the last simulation experiment we consider the case of a regression problem with p > N. The sample size is N = 50 and the number of covariates is p = 200. The covariates has been divided into 20 groups, each consisting of 10 ordinal covariates with the following number of categories: 2, 2, 3, 3, 4, 4, 6, 6, 8, 8. The first group X 1 =(X 1, X 2,..., X 10 ) contains mutually correlated covariates and X 1, X 3, X 5, X 7 and X 9 are correlated to the outcome. In the second group X 2, covariates are mutually independent and X 11, X 13, X 15, X 17 and X 19 are correlated to the outcome. The remaining 18 groups have 180 covariates mutually independent and independent on outcome. More specifically, r = 1000 repetitions has been generated by the following data generating process: (1) N observations (y,x c 1,,xc p ) are randomly drawn from a multivariate (p + 1)-dimensional Gaussian distribution with mean vector µ = (0,,0) and covariance matrix Σ = 1 ρ Y X 1 ρ Y X 2 ρ Y X 20 ρ Y X1 Σ X1 0 0 ρ Y X2 0 I ρ Y X I where ρ Y Xi is the vector of 10 correlations between the covariates of the ith group and the outcome Y, ρ Y X1 = ρ Y X2 (0.31,0,0.28,0, 0.26, 0,0.26,0,0.25,0) and ρ Y X3 = = ρ Y X20 = (0,,0). The symbol 0 denotes a 10-dimensional square matrix of zeros, I is the 10-dimensional identity matrix, and the generic element of Σ X1 is given by s ij = 0.4 i j. (2) Each covariate generated at step (1) is transformed into a categorical variable X i, with a number k i of categories as given above. Categories are defined according to quantiles of the normal distribution: X i = k if X i (q (k 1)/ki,q k/ki ], k = 1,2,,k i, where q h is the hth quantile of a standard normal distribution. The resulting data generating process has the following features: outcome Y is associated to covariates X 1,X 3,X 5,,X 19 with constant correlation ratios η 2 Y X i = 0.05; covariates X 1,X 2,,X 10 are mutually associated with Cramer s ν indexes approximately 14

15 given by ν i,j 0.4 j i 1 ν i,i+1, j > i + 1, where {ν i,i+1 } i=1,2,,9 {0.26, 0.29, 0.23, 0.24, 0.20, 0.21, 0.17, 0.18, 0.15}; covariates X 11,X 12,,X 200 are mutually independent and independent on covariates X 1,X 2,,X 10. [Figures 3, 4, 5 and 6 approximately here] The gray boxes of Figures 3(a), 4(a) and 5(a) show the distribution of the uncorrected TDNI measures estimated by means of RF for the three experiments (3), (4) and (5), respectively. Random Forests are implemented in the randomforest package [Liaw and Wiener (2002)] of the R language [R Development Core Team (2008)]. In the simulation experiments of this Section we have trained RFs with ntree=1000 regression trees, with mtry=5 variables randomly sampled as candidates at each split and with a minimum size of terminal nodes nodesize=5. The results of Simulation (3) substantially confirm what already observed in Fig. 2(b): higher numbers of covariate categories are generally associated to higher levels of bias. A byproduct of this relationship is an artificial ranking of covariates according to the number of categories, instead of variable importance. In this simulation, covariates are equally important because all uninformative but bias generates an erroneous ranking where the highest positions in the ranking are achieved by variables with the highest number of categories. The negative effects of VI bias on ranking are more evident in Simulation (4) and (5), where only X 1 and X 2 are informative. In these experiments the ranking of covariates using uncorrected VIs is clearly wrong. Uninformative covariates X 3, C2 and N11 are erroneously more important than the true informative variable X 2. The average TDNI measures of Simulation (6) estimated using RF are visualized in Figure 6(a). A great amount of bias hides informative covariates. Their uncorrected importances are lower than the importances of many uninformative variables. Bias strongly distort the ranking of variables. All these results undoubtedly show that an effective method for bias correction is necessary when using TDNI measure of variable importance. 5 A bias-correction strategy This Section starts with a brief recall of the bias-correction strategy recently proposed by [Sandri and Zuccolotto, 2008]. Let X be the (N p) matrix containing the N observed values of the p covariates {X 1,,X p }. A set of matrices {Z s } S s=1 is generated by randomly permuting S times the N rows of X. We 15

16 call pseudocovariates the columns of Z. Row permutation destroys the association existing between the response variable Y and each pseudocovariate Z i. On the contrary, the association between two pseudocovariates Z i and Z j is preserved, i.e. it is equal to the association existing between X i and X j. Horizontally concatenating matrices X and Z s generates a set of (N 2p) matrices X s [X, Z s ], s = 1,2,...,S. The augmented matrices X are repeatedly used to predict Y using an ensemble predictor and S importance measures are then computed for each covariate X i and for the corresponding pseudocovariate Z i. Let VI s X i and VI s Z i be the sth importance measures of X i and Z i, respectively. The adjusted variable importance measure VI i is computed considering the average of VI s X i VI s Z i differences, that is: VI i = 1 S ( VI s X S i VI s Z i ). (14) s=1 For more details about the algorithm and the principles that support and guide the procedure, see [Sandri and Zuccolotto, 2008]. This method has been originally developed for classification tree-based ensembles when the Gini gain is used as splitting criterion, but it can be extended to the entire class of TDNI measures. In section 3, we have shown that for this class of measures the decomposition of equation (11) holds. The proposed bias-correction method is crucially based on this decomposition: in the set J U of uninformatively-split nodes, each pseudocovariate Z i share approximately the same properties (number of possible splits nps ij, independence on Y ) of the corresponding covariate X i and consequently have approximately the same probability p ij of being selected as splitting variable. Hence, the average importance of Z i calculated for S random permutations is an approximation of the sum given in (12): av[ VI s Z i ] E[ε Xi ]. In other words, S s=1 VI s Z i can be used as an approximation of the bias affecting the importance of X i. A direct consequence of the generalization of the bias-correction strategy to TDNI measures is the extension of its domain of applicability to regression problems. We investigate the effectiveness of the above bias-correction algorithm in the 4 regression problems related to the data generating processes described in Section 4.2. The white boxes of Figures 3(a), 4(a), 5(a) and 6(a) visualize the distribution of the corrected VIs (ntree=1000, mtry=3, nodesize=5 and number of random permutation S = 25). These experiments clearly show that bias-correction using pseudocovariates has power to reduce bias, even when using a small number S of random permutations. The technique generates right rankings of covariates according to their importance and informative variables can be correctly discriminated from uninformative ones. The method shows good performances also when estimating VIs of a large 16

17 number of (mutually correlated, mutually independent, informative and uninformative) variables using a very limited number of sample units. All the uninformative covariates correctly have a null mean importance and all the mean VIs of informative variables are greater than zero. The optimal number of variables randomly selected in each node, i.e. the value of the parameter mtry that minimizes the out of bag prediction error rate of the RF, is typically a function of the number of covariates of the dataset. Extensive simulations (not reported here) show that, when adding the p pseudocovariates, it is not necessary to increase mtry and the optimal value chosen for the original dataset can still be used. This fact can be explained by considering that pseudocovariates are not additional potentially predictive variables but only competitors of the original covariates. For each covariate, our method generates a twin uninformative pseudocovariate that, only in uninformative splits, participates to the competition for the best split, with approximately the same probability of selection of X i. For the sake of comparison, Figures 3(b), 4(b) and 5(b) show the distribution of the (unbiased) permutation measures computed by Conditional Random Forests (CRF). This choice is motivated by two considerations: (a) the unified framework for recursive partitioning used by CRF to grow single trees strongly reduces the number of uninformative splits and (b) permutation measures are not affected by the kind of bias described in this paper. We have estimated CRF using the cforest command of the party package for R, with the following settings: 1000 trees, 3 randomly picked input variables in each node, minsplit = 5, a quadratic test statistics and Bonferroni-adjusted P-values (see [Hothorn et al., 2006] and [Strobl et al., 2007a]). The results clearly show that these measures and the proposed bias-corrected TDNI measures are qualitatively very similar. Hence, the two methods can be used alternatively when one needs to calculate VIs. In Simulation study (5) the procedure is not able to completely remove the bias due to the association between X 3 and X 1. A bias in favor of X 3 is still present. This effect can be also observed when using the permutation measure calculated by CRF (see Fig. 5(b)). A new conditional permutation scheme for the computation of VI in case of correlated predictor variables has been recently proposed by [Strobl et al., 2008]. 6 Case study The Plasma-Retinol dataset is available at the StatLib Datasets Archive and contains 315 observations of 14 variables aiming at investigating the relationship between personal characteristics and dietary factors, and plasma concentrations of retinol, beta-carotene and other carotenoids. The identification of determinants of low plasma concentration of retinol and beta-carotene is important because observational studies have suggested that this circumstance might be associated with increased risk of developing certain types of cancer. 17

18 Table 2: Variables in the Plasma-Retinol dataset Response Variables Betaplasma: Plasma beta-carotene (ng/ml) Retplasma: Plasma Retinol (ng/ml) Covariates Age: Age (years) Sex: Sex (1=Male, 2=Female) Smokstat: Smoking status (1=Never, 2=Former, 3=Current Smoker) Quetelet: Quetelet (weight/(height 2 )) Vituse: Vitamin Use (1=Yes, fairly often, 2=Yes, not often, 3=No) Calories: Number of calories consumed per day Fat: Grams of fat consumed per day Fiber: Grams of fiber consumed per day Alcohol: Number of alcoholic drinks consumed per week Cholesterol: Cholesterol consumed (mg per day) Betadiet: Dietary beta-carotene consumed (mcg per day) Retdiet: Dietary retinol consumed (mcg per day) Previous studies showed that plasma retinol levels tend to vary by age and sex, while the only dietary predictor seems to be alcohol consumption. For plasma beta-carotene, dietary intake, regular use of vitamins, and fiber intake are associated with higher plasma concentrations, while Quetelet Index and cholesterol intake are associated with lower plasma levels (Nierenberg et al., 1989). We used the RF algorithm to predict Betaplasma and Retplasma as a function of the 12 covariates described in Table 2. The hyperparameters of the tree-based ensembles are ntree=3000, mtry=4, nodesize=5 and a number of random permutations S = 300. We computed biased VI Xi (Fig. 7(a) and 8(a)) and bias-corrected importance measures VI Xi (Fig. 7(b) and 8(b)) for Betaplasma and Retplasma, respectively. We set negative values of VI to zero. For Betaplasma the corrected measures allow to identify 5 mainly predictive covariates (Fiber, Betadiet, Vituse, Quetelet, Alcohol). Similarly, the most important predictors of Retplasma seem to be Alcohol, Sex, Age. Our results largely confirm the findings of the preceding analyses except for the importance of Cholesterol and Alcohol in predicting Betaplasma. The influence of bias correction on ranking by importance is apparent. From one side, it 18

19 allows to discard predictors whose importance is artificially amplified by bias. In fact, on the basis of biased VIs, Fat and Calories seem informative covariates of Betaplasma, but after bias removal their importances vanish. On the other side, bias correction can reveal informative predictors, hidden by bias in VI of other predictors. In Fig. 8(a), Sex seems scarcely influent on Retplasma, but the correction procedure shows that this variable is one of the three most important covariates. Fig. 7(c) and 8(c) show the permutation-based VIs calculated using CRF. These estimates are very similar to VIs obtained by the proposed algorithm. 7 Concluding remarks [Figures (7) and (8) approximately here] It is well-known that the Gini VI measure, computed by means of tree-based learning ensembles, is affected by different kinds of bias ([Strobl, 2005]). The existence of bias in VI is potentially dangerous especially in a variable selection perspective, since it can dramatically alter the ranking of predictors. The main source of bias for the class of TDNI variable importance measures is intimately connected to the tree-construction mechanism: covariates X i with the same importance in a node j can have different probabilities p ij of being selected as splitting variables. Using theoretical considerations and simulation experiments, the present paper shows that the levels of this bias depend on the characteristics of covariates, i.e. on their measurement levels. In addiction, the analysis indicates that this kind of bias is generated by uninformative splits. These splits are binary partitions of sample units that are completely driven by chance where the association between response variable and covariates has been entirely captured by earlier splits (the informative splits). The heuristic bias-correction strategy of [Sandri and Zuccolotto, 2008] is based on the introduction of a set of pseudocovariates in the original dataset, in the spirit of [Wu et al.(2007)]. Pseudocovariates are noise variables that are independent on the response variable and have the same correlation structure of the original covariates. Working on simulated and real life regression problems, the paper shows that pseudocovariates have the capacity to approximate the component of the estimated VI attributable to bias. Hence, the proposed permutation method can reduce bias due to different measurement levels of covariates and can yield correct ranking of variables according to their importance. Of course, the drawback of repeatedly generating pseudocovariates is an additional computational burden. This is a problem common to all permutation-based procedures and it cannot be ignored when the method is applied to large-scale datasets. However, it is fundamental to take 19

20 into account that the use of the bias-correction method is necessary only when covariates with different levels of measurement are present. For example, genome-wide association studies typically collect data about thousands to hundreds of thousands single nucleotide polymorphisms that have all the same characteristics (e.g. are all real-valued variables). In these cases bias in TDNI measure does not affect covariates ranking and therefore one can avoid to apply any correction. Anyway, simulations experiments show that the number S of sets {Z s } of pseudocovariates required for a satisfactory bias correction is often small and the extra computational effort is generally moderate even in medium-size problems. References [Bell and Wang, 2000] Bell, D. and Wang, H. (2000): A formalism for relevance and its application in feature subset selection. Machine Learning, 4(2), [Berk, 2006] Berk, R.A. (2006): An Introduction to Ensemble Methods for Data Analysis. Sociological Methods & Research, 34(3), [Breiman et al., 1984] Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984): Classification and Regression Trees. Chapman & Hall, London. [Breiman, 1996] Breiman, L. (1996a): The heuristic of instability in model selection. Annals of Statistics, 24, [Breiman, 2001] Breiman, L. (2001): Random Forests. Machine Learning, 45, [1] Breiman, L. (2001b): Statistical modeling: the two cultures. Statistical Science, 16, [Breiman, 2002] Breiman, L. (2002): Manual on setting up, using, and understanding Random Forests v3.1. Technical Report, [Breiman et al., 2006] Breiman L., Cutler A., Liaw A., Wiener M. (2006): Breiman and Cutlers Random Forests for Classification and Regression. R package version [Bühlmann and Yu, 2002] Bühlmann P. and Yu B. (2002): Analyzing bagging. Annals of Statistics, 30 (4), [Dobra and Gehrke, 2001] Dobra A, Gehrke J (2001): Bias Correction in Classification Tree Construction. In Proceedings of the Seventeenth International Conference on Machine Learning, Williams College, Williamstown, MA, USA. Edited by Brodley CE, Danyluk AP, pp

REGRESSION TREE CREDIBILITY MODEL

REGRESSION TREE CREDIBILITY MODEL LIQUN DIAO AND CHENGGUO WENG Department of Statistics and Actuarial Science, University of Waterloo Advances in Predictive Analytics Conference, Waterloo, Ontario Dec 1, 2017 Overview Statistical }{{ Method