Cost-Weighted Boosting with Jittering and Over/Under-Sampling: JOUS-Boost

Size: px

Start display at page:

Download "Cost-Weighted Boosting with Jittering and Over/Under-Sampling: JOUS-Boost"

Valerie Banks
5 years ago
Views:

1 Cost-Weighted Boosting with Jittering and Over/Under-Sampling: JOUS-Boost David Mease Statistics Department Wharton School University of Pennsylvania Philadelphia, PA Abraham J. Wyner Statistics Department Wharton School University of Pennsylvania Philadelphia, PA Andreas Buja Statistics Department Wharton School University of Pennsylvania Philadelphia, PA Abstract Most binary classifiers such as AdaBoost assume equal cost of misclassifying the two classes or, equivalently, classifying at the 1/2 quantile of conditional class probabilities. We approach the problem of unequal costs of misclassification or, equivalently, classification at quantiles other than 1/2, with the well-known practice of over/undersampling of the two classes. An algorithm which combines jittering of the data with over/under-sampling, called JOUS-Boost, is presented which is simple yet successful and preserves advantages such as relative protection against overfitting at arbitrary quantile boundaries and hence for arbitrary misclassification costs. We use collections of classifiers obtained from a grid of quantiles to form estimators of class probabilities. The estimated class probabilities are compared to those obtained by applying a link function to the score. 1 INTRODUCTION Most binary classifiers such as AdaBoost assume equal costs of misclassifying the two classes. Often, however, one is interested in classification with unequal costs, as in medical problems where a false negative is often much more serious than a false positive. Alternatively, classification problems are often formulated not in terms of differential costs of misclassification, but in terms of quantile thresholds, as in marketing when households with an estimated take-rate higher than, say, 0.80 are slated for a campaign. It is not difficult to see that the problem of binary classification with unequal costs is equivalent to the problem of thresholding conditional class probabilities at arbitrary quantiles. As a special case one obtains the equivalence of an equal-cost classifier and a median classifier that thresholds the conditional probability of both classes at 1/2. A related problem is that of mapping a classifier to a base rate of the two classes other than that for which the classifier was trained. For example, the training sample may have contained 10% positives, but one is interested in a classifier that performs well when there are 50% positives. A simple calculation shows that a change of base rate is equivalent to a change in the cost ratio and also to a change of the quantile on the class probabilities. These connections are lucidly explained by Elkan (2001), and we will recount them in Section 4 in a form suitable for our purposes. The problem of classifying with arbitrary cost ratios and/or imbalanced base rates has been addressed by a sizable literature. See a website dedicated to this topic at The problem of imbalance has been of particular interest in areas such as text classification where there often exist very few positives but a vast supply of negatives. One will then try to correct for this imbalance. One general class of approaches to the equivalent problems of unequal costs, arbitrary quantile thresholds, and imbalanced base rates rests on the idea of over- and under-sampling (Chan and Stolfo, 1998; Elkan, 2001; Estabrooks, Jo and Japkowicz, 2004). In these schemes one typically over-samples the minority class by sampling with replacement, and/or one under-samples the majority class by sampling without replacement. Sampling with replacement is necessary when the class size is increased, whereas sampling without replacement seems more natural when the class size is decreased. Note that sampling with replacement is bound to produce ties in the sample, the more the higher the sampling rate. An interesting variation of the idea is Synthetic Minority Over- Sampling ( SMOTE ) by Chawla et al. (2002, 2003) which avoids ties in the over-sampled minority class

2 by moving the sampled predictor points toward near neighbors in the minority class. The necessity to break ties is an issue that we will address in detail in Section 5. We will give evidence that ties can be a problem because classification methods may be driven more by the set of distinct data points than the multiplicities of the tied data points. Changing multiplicities, which is what sampling with replacement does, will therefore have a smaller than expected effect. Thus changing multiplicities should be accompanied by breaking the ties, as in SMOTE. We give, however, evidence that tie-breaking does not need to be sophisticated: random perturbations or jittering of the data points in predictor space work surprisingly well. We demonstrate this point with the combination of jittering and over/under-sampling with AdaBoost, here called JOUS-Boost. Jittering has additional beneficial effects: it is essentially a smoothing method that produces samples of the convolution of the distribution in predictor space with the distribution of the jitters. As a result, classifiers produced with JOUS are weakly continuous as a function of the training data and should hence have more stable sampling properties. Another general solution to the problem of classification at arbitrary quantile thresholds is by direct estimation of conditional class probability functions, or CCPF for short. This is the traditional approach in statistics, primarily in the form of logistic regression. Given an estimate of the CCPF as a function on predictor space, one can trivially achieve classification at arbitrary quantiles by thresholding the estimated CCPF at that quantile. If estimation of a of CCPF solves the problem of classification for arbitrary costs and quantiles and imbalances in one fell swoop, it is natural to wonder why one bothers to solve the problem by any other means. Why would there still be an interest in, for example, classifiers trained on over/under-sampled training data? The answer has to do with the fact that finding a good classifier is a robust endeavor, whereas estimation of a CCPF is a subtle business. The latter has to simultaneously contain the information for good classification at all quantiles. By comparison, a classifier is usually described by thresholding some score function at zero, but this function is quite arbitrary except for the requirement to be on the proper side of zero often enough to produce few errors on test samples. This robustness of the classification problem (often observed in the literature) is due to the crude nature of the criterion, namely, misclassification loss. We illustrate the difficulty of estimating the CCPF with the example of AdaBoost. Recent conceptual interpretations of AdaBoost by Friedman, Hastie and Tibshirani would suggest that a simple link function maps AdaBoost s scores to conditional class probabilities. We will give evidence that this is not so in practice: at the point when AdaBoost produces a successful classifier, its scores have typically diverged to such a degree that putting them through the link function produces values that cluster near 0 and 1 and are hence worthless as CCPF estimates. Thus AdaBoost is successful at estimating classification boundaries but not conditional class probabilities. Because of the relative difficulty of CCPF estimation compared to classification, it will always remain of interest to approach the latter problem directly. In fact, we take this argument as a license for traveling the opposite direction: we construct estimators of CCPF s from collections of classifiers computed from grids of quantiles. The precision of such estimators depends on the denseness of the quantile grids, but in view of the natural sampling variability of CCPF estimators, the granularity of the grid may not need to be very fine to compare with the noise level. Furthermore, if the primary use of the CCPF estimator is thresholding for classification, then the proposed estimator will be satisfactory for practical purposes if the grid contains all quantiles of actual interest. In the remainder of this article we first give evidence that, for example, AdaBoost does not produce useful estimates of the CCPF (Section 2). We then discuss classification with unequal cost (Section 3) and the use of over/under-sampling (Section 4). Next we indicate how ties defeat over/under-sampling and how the problem can be remedied with simple data perturbations or jittering. We illustrate the result with JOUS-Boost, the combination of AdaBoost with jittering and over/under-sampling (Section 5). We then describe the construction of CCPF estimators from collections of classifiers (Section 6) and we document their performance in an empirical study (Section 7). 2 ADABOOST DOES NOT ESTIMATE CONDITIONAL CLASS PROBABILITIES We use AdaBoost to illustrate the phenomenon that an estimation method produces useless CCPF s but useful classifiers, although conceptually it should be estimating the former also. We trace the reason to overfitting in terms of the criterion for CCPF estimation, here exponential loss, while at the same time performing well in terms of the classification criterion, namely, misclassification loss. We introduce customary notations. We are given

3 training data x 1,..., x n and y 1,..., y n where each x i is a d dimensional vector of predictors and y i { 1, +1} is the associated observed class label. To justify generalization, it is usually assumed that training data as well as any test data are i.i.d. samples from some population of (x, y) pairs. The most common implementation of Boosting is (discrete) Adaboost (Freund and Schapire, 1996). This algorithm is as follows. 1. Initialize weights w i = 1/n, i = 1,..., n and F (x i ) = 0 for all x i. 2. Fit the classifier f { 1, 1} to the training data using weights w i. 3. Compute the weighted error rate ɛ n i=1 w ii[y i f(x i )] and its log-odds, α log((1 ɛ)/ɛ). 4. Replace F by F + αf. 5. Replace the weights w i with w i w i e αi[yi f(xi)] and then renormalize: w i w i /( w i ). 6. With the new weights go back to step 2 above. The AdaBoost algorithm repeats this updating process m times and then stops. If we let F m (x) = F (x) denote the additive score function at the time of stopping, then the final classifier is I{F m (x) > 0}. It has been observed repeatedly that the performance of the procedure is quite insensitive to the choice of m as long as m is large enough: m = 200 seems to work very well in many real classification problems when the weak or base learner is a stump (a tree with just two leaves). For less weak learners such as trees with a larger number of leafs (we use 8 below), smaller values of m may suffice. Motivated by analogies between AdaBoost and additive logistic regression, Friedman, Hastie and Tibshirani (2000) proposed a connection between the score function F m (x) and a logistic regression model. They suggest that an estimate p m (x) of the CCPF p(x) can be obtained from F m through an (essentially) logistic link function: p m (x) = e Fm(x). (1) e Fm(x) + efm(x) This is clearly consistent for median classification, as the symmetry of the link implies that thresholding p m at 1/2 amounts to thresholding F m at 0. What is not clear is that this view of Boosting provides a useful tool for estimating the class probabilities p(x) in general. We examine this question visually through a simple two dimensional non-linear example. In our example, we restrict the domain of X to the square [0, 50] 2 and construct the level curves of p(x) to be concentric circles with center (25, 25). We take 1 r < 8 28 r p(x) = 20 8 r 28 0 r > 28 where r is the distance from the point (25, 25) in R 2. Figure 1 shows the circles for p(x) colored-coded such that white is p(x).1, green (light grey) is.1 < p(x).5, red (dark grey) is.5 < p(x).9. and black is p(x) > Figure 1: Three Level Curves for p(x) We now simulate n = 400 observations (x, y), where x is chosen uniform from the square in Figure 1 and y i { 1, 1} is randomly selected according to p(x) above. The goal of the simulation is to determine whether Boosting is able to recover p(x) through the logistic link function. Using this as the training set, Discrete Adaboost described above was implemented using as base classifiers trees of at most eight nodes fit by the Rpart package in the R language. The algorithm was run for up to m = 3000 iterations. Figure 2: Estimates of p(x) Quantiles at 10, 200 and 3000 Iterations The plots in Figure 2 display the resulting estimates for the contours of p(x) (i.e. the quantiles) using the same color-coding as before for a hold-out sample consisting of a grid of 400 points evenly spread over the region. The three plots correspond to m equal to 10, 200 and 3000 iterations. It can be seen in the first plot that at 10 iterations the estimated quantiles are a reasonable estimate of the true quantiles displayed in Figure 1. However, the second plot shows that by m = 200 iterations overfitting has begun in the sense that the estimates for the quantiles q =.1 and q =.9 have moved very close to the estimate for the quantile q =.5. The problem worsens as it is run longer, so that by m = 3000 iterations, the estimates for q =.1 and

4 q =.9 are virtually indistinguishable from the estimate for q =.5. Note that for this problem the training error first becomes zero at m = 209 iterations. The reason why this happens is straightforward. As the number of Boosting iterations grows, the scores F (x i ) for which y i = +1 tends towards, and scores for which y i = 1 tend toward. This occurs because the exponential loss is minimized at these limits (assuming the fitted class of functions F (x) is highly flexible and able to interpolate any finite sample, as is the case for most weighted sums of weak learners). But on hold-out samples we observe that when y = +1, Boosting usually returns a very large value for F, thus resulting in a correct classification when the threshold for F is zero. The analogue is true for observations in hold-out samples for which y = 1. The effectiveness of Boosting for classification problems is not in spite of this phenomenon, but rather due to it. The fact that the absolute values of F m grow without bound as m increases, does not necessarily lead to an over-fit with regard to misclassification rate because the absolute value is irrelevant to a median-classifier; only the sign matters. However, if Boosting is to be used as a quantile classifier by thresholding at a value other than zero (as prescribed by a link such as that of Equation (1)), then it is clear that the absolute value of F m does matter. In fact, the tendency of Boosting to increase F m without bound guarantees that Boosting will eventually overfit p(x) completely. That is, Boosting will eventually estimate the p(x) by either 0 or 1 for every future x regardless of the threshold value specified by the link. Furthermore, by this same reasoning, the specific form of any monotonic link function is in fact irrelevant because the absolute value of F m is largely a function of m rather than a reflection on the true class probability. We illustrate this graphically in two ways. First we trace the values of F m for the points in the hold out sample for which the true CCPF is between.5 and.6. Using the link function in Equation (1) we can compute that the fitted values of F m for these points should ideally lie between 0 and.203. However, in Figure 3 we see a plot of the median of F m for the the points in the hold out sample for which the true p is between.5 and.6. This plot shows a general linear trend, and by m = 3000 iterations is near 1.5 which corresponds to a p(x) =.95. This suggests that we can obtain any estimate greater than 1/2 for the CCPF at these points by simply choosing the appropriate stopping time m. Another way to see this trend is by making histograms of the estimated probabilities. The last three plots in Figure 4 show the resulting estimates of p(x) from Fm Iterations Figure 3: Median(F m (x) :.5 p(x).6) boosting through the link at 10, 200 and 3000 iterations respectively for the 400 points in the hold-out sample. Comparison with the histogram of the true class probabilities (for the hold-out sample) shown in the first plot in Figure 4 illustrates the divergence of the probability estimates to {0, 1}. The problem is quite severe even for m = 200 iterations. Frequency Frequency Frequency Frequency Figure 4: True Probabilities and Estimates Using the Link Function at 10, 200 and 3000 iterations 3 CLASSIFICATION WITH UNEQUAL COSTS The standard classification problem uses simple misclassification rate based on 0-1 loss, that is, misclassification cost is symmetric in the class labels +1 and 1: the cost of misclassifying +1 is the same as misclassifying 1. An optimal classifier minimizes the out of sample misclassification rate. It is known that Boosting performs quite well for such problems across many applications and is resistant to over-fitting to a surprising degree. Symmetric 0-1 loss is historically motivated by a supposition (common in AI, not in statistics) that there always exists a true class label, that is, the class probability is either 0 or 1. In such situations the goal is always perfect classification, and 0-1 loss is a reasonable measure. In practice, however, it is common to find situations where unequal loss is appropriate and cost-weighted Bayes risk should be the target. It is an elementary fact that minimizing loss with unequal costs is equivalent to estimating the boundaries of conditional class-1 probabilities at thresholds other than 1/2. We outline the textbook derivation, drawing intuition from the Pima Indians Diabetes Data of the UCI ML Database: Let p(x) = P [Y = +1 x] and 1 p(x) = P [Y = 1 x] be the conditional proba-

5 bilities of diabetes (+1) and non-diabetes ( 1), respectively, and assume the cost of misclassifying a diabetic is 1 c = 0.90 and the cost of misclassifying a non-diabetic is c = [The benefit of scaling the costs to sum to 1 will be obvious shortly.] The following consideration is conditional on an arbitrary but fixed predictor vector x, hence we abbreviate p = p(x): The risk (=expected loss) of classifying as a diabetic is (1 p)c and the risk of classifying as a non-diabetic is p(1 c). The cost-weighted Bayes rule is then obtained by minimizing risk: Classify as diabetic when (1 p)c < p(1 c) or, equivalently, c < p, here: 0.1 < p, which is what we wanted to show. It is a handy convention to scale costs so they add up to 1, and hence the cost c of misclassifying a negative ( 1) equals the optimal threshold q on the probability p(x) of a positive (+1). The previous discussion implies the equivalence between the problem of classification with unequal cost and what we call the problem of quantile classification, that is, estimation of the region p(x) > q in which observations are classified as +1. Most classifiers (AdaBoost included) assume equal costs, c = 1 c = 0.5, which implies that they are median-classifiers: p(x) > q = 1/2. 4 Over/under-sampling Boosting algorithms are very good median classifiers but deficient quantile classifiers for q other than 1/2. We leave the Boosting algorithm as is and instead tilt the data to force the q-quantile to become the median. Simple over/under-sampling, also called stratification, can convert a median classifier into a q-classifier as follows: Over/undersampled Classification: 1. Let N +1 and N 1 be the marginal counts of labels +1 and 1. Choose k +1, k 1 > 0 for which k+1 k 1 = N +1 N 1 / q 1 q. 2. Select k +1 observations from the training set for which y i = 1 such that each observation has the same chance of being selected. 3. Select k 1 observations for which y i = 1 such that each observation has the same chance of being selected. 4. Obtain a median classifier from the combined sample of k +1 + k 1 points. Assume its output is a score function F m (x). 5. Estimate x as having p(x) > q if F m (x) > 0. Note here that for k ± < N ± (i.e. undersampling) the selection can be done by random sampling with or without replacement, and for k ± > N ± (i.e. oversampling) the selection can be done either by sampling with replacement or by simply augmenting the data by adding replicate observations. For example, if N +1 = 268 and N 1 = 500 as in the Pima Indians diabetes data (UCI-ML DB), and if the upper two deciles with q =.8 (q/(1 q) = 4) are to be estimated, then the ratio of the classes needs to be biased in favor of non-diabetics (class 1) by a factor of 4. This comes down to a ratio of k +1 /k 1 = 268/500/4 = Instead of 268 diabetics per 500 non-diabetics, we now need only 67 diabetics, which would be obtained by sampling from the 268. In order to use the data most efficiently, we will use replication of one or both classes (i.e. over-sampling) in our examples. Specifically, since we will be using values for q equal to t/10 for t = 1,..., 9 we will use k +1 = (10 t)n +1 and k 1 = tn 1, except in the case where q =.5 for which we simply leave the data unmodified, i.e. k +1 = N +1 and k 1 = N 1. In principle, over/under-sampling can be replaced with a weighting procedure if the median classifier permits weighted observations, but this is not usually the case for the algorithms we have in mind. There have been attempts to modify the reweighting scheme implicit in AdaBoost to adapt to different cost ratios and quantiles other than 1/2, but the results seem inconclusive (Fan et al., 1999; Ting, 2000). For the reader s convenience, we briefly give the justification for this re-weighting/re-sampling scheme. It can be inferred from Elkan (2001). We work conveniently in terms of populations/distributions as opposed to finite samples. Let X be the predictor random vector and Y the binary response random variable with values in {+1, 1}. Let further p(x) = P [Y = +1 x ], π = P [Y = +1] f +1 (x) = P [ x Y = +1], f 1 (x) = P [ x Y = 1], which are, respectively, the conditional probability of Y = +1 given X = x, the unconditional probability of Y = +1, and the conditional densities of X = x given Y = +1 and Y = 1. The densities f ±1 (x) describe the distributions of the two classes in predictor space. Now the joint distribution of (X, Y ) is given by p(y = +1, x) = f +1 (x) π, p(y = 1, x) = f 1 (x) π, and hence the conditional distribution of Y = 1 given X = x is p(x) = f +1 (x) π f +1 (x) π + f 1 (x) (1 π),

6 which is just Bayes theorem. Equivalently we have or p(x) 1 p(x) = f +1 (x) π f 1 (x) (1 π), p(x) 1 p(x) / π 1 π = f +1(x) f 1 (x). (2) Now compare this situation with another one that differs only in the mix of class labels, call them π and 1 π, as opposed to π and 1 π. Denote by p (x) the conditional probability of Y = 1 given X = x under this new mix. From Equation (2) it is obvious that p(x) and p (x) are functions of each other, best expressed in terms of odds: p (x) 1 p (x) = p(x) 1 p(x) 1 π π π 1 π Obviously thresholds q on p(x) and q on p (x) transform the same way. If π/(1 π) is the original unconditional ratio of class labels, we ask what ratio π /(1 π ) we need to map the desired q to q = 1/2 or, equivalently, q/(1 q) to q /(1 q ) = 1. The answer is given by the condition 1 = q 1 q = q 1 p 1 π π π 1 π. Solving for π /(1 π ), the desired marginal ratio of class labels is π 1 π = π 1 π / q 1 q, which justifies the algorithm above. 5 FIXING THE PROBLEM OF TIES: JITTERS FOR JOUS-BOOST The over-sampled Boosting algorithm is an effective q-classifier after a small number of iterations. However, just as with the link function as the number of iterations increases our simulations showed that it reverts to median classification (p(x) > 1/2) and the estimates for all quantiles converge to the estimate for the median. The reason for this asymptotic behavior has to do with the fact that the augmented training set includes many replicates of the same (x, y) pair, which is a major problem in general with oversampling. As Boosting re-weights the training samples, the tied points all change their weights jointly. Asymptotically this effectively undoes the over/undersampling. One solution to this problem is to add a small amount of random error to the predictor x so that the replicates are not duplicated. While this jittering does add undesirable noise to the estimates of the level curves of the function p(x), the amount of noise needed is not large. Specifically we do this by adding i.i.d. jitterings in d-dimensions to any points that occur more than once in the augmented dataset. We will refer to the algorithm that combines the over/under-sampled classification described before with this jittering as JOUS- Boost. To illustrate this procedure, we return to the concentric circle dataset from before. The algorithm was applied to the same training data but with noise simulated from the bivariate independent uniform distribution on ( 3.5, 3.5) 2. Estimates of p(x) equal to.9,.5 and.1 were again obtained. The top three plots in Figure 5 show these estimates obtained by running the algorithm for m = 10 iterations. Comparison with Figures 1 and 2 suggests the algorithm is just as accurate as the Boosting algorithm using the link function for estimating these quantiles early on despite the addition of the jitter. More importantly, due to the addition of the jitter the JOUS-Boost algorithm does not suffer from overfitting as the classifiers based on the link function do. This is illustrated by bottom three plots in Figure 5, which show the results when the algorithm is run for 3000 iterations, long after the training error is zero even for the augmented datasets. Comparison with Figure 1 we can see clearly that the estimates for q =.9,.5 and.1 on the hold-out sample remain quite stable even for extremely large m. Figure 5: Estimates of p(x) Quantiles at 10 Iterations (top row) and 3000 Iteration (bottom row) 6 ESTIMATING THE CCPF USING QUANTILE CLASSIFIERS We have shown how a median finder like Adaboost can be converted to a q-classifier by JOUS-ting the data. In this section, we consider a sensible, if ad hoc, algorithm to produce actual estimates of the CCPF p(x). Our procedure is straightforward. We first set a

7 quantization level δ > 2. Then we JOUS the data for a range of quantiles q = 1/δ,..., 1 1/δ. By definition, for each q and every x we estimate D q (x) = I{p(x) q} with ˆD q (x) {0, 1}. While it is clear that D q (x) is monotonically decreasing in q, the estimates are ˆD q (x) will not necessarily be monotonic, especially when p(x) is close to q. Thus, it is not clear how to best produce an estimate of p(x) from the sequence of quantile-based estimates ˆD q (x). One solution is to begin with the median. That is, first consider the estmate produced by Adaboost without any oversampling, undersampling or jitter. Denote this estimate by ˆD.5 (x). The algorithm P- JOUS Boost is as follows. If ˆD.5 (x) = 1 then let ˆp(x) = min{q >.5 : ˆDq (x) = 0} 1 2δ. If no such q is found take ˆp(x) = 1 1 2δ. If ˆD.5 (x) = 0 then let ˆp(x) = max{q <.5 : ˆDq (x) = 1} + 1 2δ. If no such q is found take ˆp(x) = 1 2δ. We will use this method with δ = 10. Thus our estimate of p(x) for any x will be one of {.05,.15,...,.95}. The method of selecting the values for k +1 and k 1 for the the nine runs of the JOUS-Boost algorithm corresponding to the nine values for q is the same as we described earlier and as we have been using for quantile estimation thus far. The only new feature here in the implementation of the JOUS-Boost algorithm is that in order to avoid introducing unnecessary randomness when applying the JOUS-Boost algorithm simultaneously for the different values of q, we recycle the jitter across the different values of q. By this we mean that, for example, the eight sets of replicates of the x values for which y = 1 for q =.8 have the same values for the jitter added to them as the eight of the nine for q =.9. Similarity since for q =.5 no noise is added, the augmented datasets for the other values of q all contain exactly one copy of the original dataset with no noise added. Figure 6 shows estimates of p(x) for the hold out sample using this algorithm applied to the same training set in the circle example. The estimates are coloredcoded in our usual scheme such that white is ˆp.1, green (light grey) is.1 < ˆp.5, red (dark grey) is.5 < ˆp.9, and black is ˆp >.9. It can be observed that this algorithm inheres the property of the JOUS- Boost algorithm in that it does not tend to a median classifier even for large m. 7 QUANTITATIVE PERFORMANCE COMPARISONS In this section we quantitatively compare the probabilities that come out of the algorithm in the previous section to a various collection of different algorithms. Figure 6: JOUS Estimates of p(x) at 10, 200 and 3000 Iterations We will consider standard statistical straw-men such as logistic regression as well as newer techniques like decision trees with 8-nodes and nearest neighbor classifiers. We also consider probability estimates that are produced by boosting through the link. We will conduct our tests on two different data sets, the simulated circle dataset and one real dataset. There is no single, established method for measuring the performance of class probability estimators (see Zadrozny and Elkan, 2001 for a discussion). We choose to use squared loss, which is calculated as 1 b b i=1 (ˆp(x i ) p(x i ))2 where the values x 1,..., x b are points in a hold-out sample not used to train the classifier and p(x) is the true CCPF. While this can be computed for simulated examples such as our circle example, for a real datset in which p(x) is unknown, the probabilities p(x i ) are replaced by Z i = (yi + 1)/2 where the yi are the y i values in the hold out sample. Note that for a fixed x this is still minimized in expectation when ˆp(x) = p(x) as a consequence of being a proper scoring rules (Buja, Stuetzle and Shen, 2004; Savage, 1973; Schervish, 1989; Shuford, Albert and Massengill, 1966). 7.1 SIMULATED CIRCLE DATASET The lower of the two curves in the first plot in Figure 7 displays the squared loss for the circle dataset from the previous sections as a function of the number of iterations calculated using the true probabilities in the hold out sample of size 400. A minimum squared loss.010 is achieved early on (13 iterations) and the curve appears to asymptote to.030 by 3000 iterations. If stopped when the training error first reaches zero (m = 209) the squared loss is.018. By comparison, the squared loss on the same hold out sample for the same training data is much larger (as expected) when the probabilities are obtained using the link. This is shown by the top curve in the first plot in Figure 7 which achieves a minimum of.024 and is as large as.120 after 3000 iterations. It is clear that the p-jous Boost algorithm is superior to obtaining probabilities through the link for this example. The p-jous Boost algorithm also displays supe-

8 Iterations Iterations Figure 7: Comparison of for p-jous Boost and Link on the Simulated Circle Dataset and Sonar Dataset rior performance in comparison to logistic regression, which yields a squared loss of.103. However, with x 2 1 and x 2 2 included, logistic regression achieved an extremely low squared loss of.003, which is not surprising since these quadratic terms give it the same structural form as the true model. The proposed algorithm also performs well relative to the k-nearest neighbor algorithm for this example. The first plot in Figure 8 displays the squared loss based on the probability estimates from the k-out-ofn nearest neighbor algorithm on the hold out sample as a function of k ranging from 1 to 100. These obtain a minimum similar to that for p-jous Boost of.009 at k = 24 but are worse than the apparent p-jous Boost asymptotic rate of.030 if k < 5 or k > Figure 8: using Nearest Neighbor and Rpart Trees for Circle Example servations. It described in detail by Gorman and Sejnowski (1988). Recall in the circle example we added noise simulated from the bivariate independent uniform distribution on ( 3.5, 3.5) 2. For this example we analogously use uniform ( σ j /4, σ j /4) variables for the noise on each of the d predictors where σ j is the standard deviation for the jth predictor in the dataset. We advocate the use of this as a rule since it has worked well for the examples we have considered; however, in principal the amount of noise chosen is a tuning parameter that could be chosen to optimize performance. Unlike the circle example in which the true CCPF was known, for this dataset we need to replace the p(x i ) by Z i = (yi + 1)/2 as described earlier. The squared loss was averaged over a ten-fold cross validation with m = 500 iterations for both p-jous Boost and boosting using the link. The results are displayed in the second plot in Figure 7. While both curves are generally decreasing as a function of the number of iterations, the curve for p-jous Boost is substantially lower throughout, indicating superior performance. Further, this general trend was observed throughout most of the ten individual folds, as well as on other ten-fold cross-validations (not shown). Comparison to logistic regression is difficult for this dataset, since the two classes are separable by a hyperplane, a situation which commonly occurs for large d. A simple solution to this problem was perform the logistic regression on the first k principle components where k < d. The first plot in Figure 9 shows the squared loss averaged over the 10 folds as a function of k. A minimum value of.166 was achieved at k = 13. This is substantially larger than the squared error for p-jous Boost which was.078 after m = 500 iterations. Finally, p-jous Boost clearly outperformed the trees fit using Rpart with respect to squared loss. The default tree in Rpart which uses some amount of crossvalidation gave a squared loss of.038. The second plot in Figure 8 gives squared loss for non-cross-validated trees from Rpart as a function of the number of nodes to which they were limited. The smallest squared loss among these is that of the 4 node tree which gave a squared loss of.031. The 3 node tree in the plot (which was the base learner for the p-jous Boost algorithm implemented) had a squared loss of SONAR DATASET We now consider the performance of the p-jous Boost algorithm of the famous sonar dataset. This dataset consists of d = 60 predictors for each of n = 208 ob Figure 9: using Logistic Regression, Nearest Neighbor and Rpart Trees for the Sonar Dataset Squared loss averaged over the same ten folds for k- nearest neighbor (on standardized data) and the noncross-validated trees are shown in the second and third plots in Figure 9. Comparison with the second plot in Figure7 shows the performance of p-jous Boost was better than both of these methods when averaged over the ten folds.

9 8 CONCLUDING REMARKS In conclusion we have presented an algorithm for making boosting cost sensitive through a combination off over-sampling, under-sampling, and jittering. We have shown that this algorithm produces reasonable estimates of the quantiles that are not sensitive to the number of iterations for which boosting is run. We have also presented a simple method for obtaining conditional class probability estimates from these quantiles. The probability estimates in turn are also not sensitive to the number of iterations and perform well relative some common methods. In contrast, using probabilities estimates obtained by applying a link to the score function from boosting results in estimates that move toward zero and one as boosting is run, and thus are not meaningful. Acknowledgements David Mease s work was supported by an NSF-DMS post-doctoral fellowship. References Buja, A., Stuetzle, W. and Shen, Y. (2004), A Study of Loss Functions for Classification and Class Probability Estimation, Preprint. Chan, P., and Stolfo, S. (1998), Towards Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection, in Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, Chawla, N.V., Bowyer, K. W., Hall, L.O., and Kegelmeyer, W.P. (2002), SMOTE: Synthetic Minority Over-Sampling Technique, Journal of Artificial Intelligence Research, Volume 16, Chawla, N.V., Lazarevic, A., Hall, L.O. and Bowyer, K.W. (2003), SMOTEBoost: Improving Prediction of the Minority Class in Boosting, in 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Estabrooks A., Jo, T. and Japkowicz, N., (2004), A Multiple Resampling Method for Learning from Imbalances Data Sets, Computational Intelligence, Volume 20, Number 1, Elkan, C. (2001), The Foundations of Cost-Sensitive Learning, in Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJ- CAI), Fan, W., Stolfo, S.J., Zhang, J. and Chan, P.K. (1999), AdaCost: Misclassification Cost-Sensitive Boosting, in Proceedings of the 16th International Conference on Machine Learning, Friedman, J., Hastie, T. and Tibshirani, R. (2000), Additive Logistic Regression: a Statistical View of Boosting, Annals of Statistics, Volume 28, Gorman, R.P. and Sejnowski, T.J. (1988), Analysis of Hidden Units in a Layered Network Trained to Classify Sonar Targets, Neural Networks, Volume 1, Savage, L.J. (1973), Elicitation of Personal Probabilities and Expectations, Journal of the American Statistical Association, Volume 66, Number 336, Schervish, M.J. (1989), A General Method for Comparing Probability Assessors, The Annals of Statistics, Volume 17, Number 4, Shuford, E.H., Albert, A., Massengill, H.E. (1966), Admissible Probability Measurement Procedures, Psychometrika Volume 31, Ting, K.M. (2000), A Comparative Study of Cost- Sensitive Boosting Algorithms, in Proceedings of the 17th International Conference on Machine Learning, Zadrozny, B. and Elkan, C. (2001), Obtaining Calibrated Probability Estimates From Decision Trees and Naive Bayesian Classifiers, in Proceedings of the Eighteenth International Conference on Machine Learning,

Boosted Classification Trees and Class Probability/ Quantile Estimation

University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 3-2007 Boosted Classification Trees and Class Probability/ Quantile Estimation David Mease San Jose State University