Cost-Weighted Boosting with Jittering and Over/Under-Sampling: JOUS-Boost

Size: px
Start display at page:

Download "Cost-Weighted Boosting with Jittering and Over/Under-Sampling: JOUS-Boost"

Transcription

1 Cost-Weighted Boosting with Jittering and Over/Under-Sampling: JOUS-Boost David Mease Statistics Department Wharton School University of Pennsylvania Philadelphia, PA Abraham J. Wyner Statistics Department Wharton School University of Pennsylvania Philadelphia, PA Andreas Buja Statistics Department Wharton School University of Pennsylvania Philadelphia, PA Abstract Most binary classifiers such as AdaBoost assume equal cost of misclassifying the two classes or, equivalently, classifying at the 1/2 quantile of conditional class probabilities. We approach the problem of unequal costs of misclassification or, equivalently, classification at quantiles other than 1/2, with the well-known practice of over/undersampling of the two classes. An algorithm which combines jittering of the data with over/under-sampling, called JOUS-Boost, is presented which is simple yet successful and preserves advantages such as relative protection against overfitting at arbitrary quantile boundaries and hence for arbitrary misclassification costs. We use collections of classifiers obtained from a grid of quantiles to form estimators of class probabilities. The estimated class probabilities are compared to those obtained by applying a link function to the score. 1 INTRODUCTION Most binary classifiers such as AdaBoost assume equal costs of misclassifying the two classes. Often, however, one is interested in classification with unequal costs, as in medical problems where a false negative is often much more serious than a false positive. Alternatively, classification problems are often formulated not in terms of differential costs of misclassification, but in terms of quantile thresholds, as in marketing when households with an estimated take-rate higher than, say, 0.80 are slated for a campaign. It is not difficult to see that the problem of binary classification with unequal costs is equivalent to the problem of thresholding conditional class probabilities at arbitrary quantiles. As a special case one obtains the equivalence of an equal-cost classifier and a median classifier that thresholds the conditional probability of both classes at 1/2. A related problem is that of mapping a classifier to a base rate of the two classes other than that for which the classifier was trained. For example, the training sample may have contained 10% positives, but one is interested in a classifier that performs well when there are 50% positives. A simple calculation shows that a change of base rate is equivalent to a change in the cost ratio and also to a change of the quantile on the class probabilities. These connections are lucidly explained by Elkan (2001), and we will recount them in Section 4 in a form suitable for our purposes. The problem of classifying with arbitrary cost ratios and/or imbalanced base rates has been addressed by a sizable literature. See a website dedicated to this topic at The problem of imbalance has been of particular interest in areas such as text classification where there often exist very few positives but a vast supply of negatives. One will then try to correct for this imbalance. One general class of approaches to the equivalent problems of unequal costs, arbitrary quantile thresholds, and imbalanced base rates rests on the idea of over- and under-sampling (Chan and Stolfo, 1998; Elkan, 2001; Estabrooks, Jo and Japkowicz, 2004). In these schemes one typically over-samples the minority class by sampling with replacement, and/or one under-samples the majority class by sampling without replacement. Sampling with replacement is necessary when the class size is increased, whereas sampling without replacement seems more natural when the class size is decreased. Note that sampling with replacement is bound to produce ties in the sample, the more the higher the sampling rate. An interesting variation of the idea is Synthetic Minority Over- Sampling ( SMOTE ) by Chawla et al. (2002, 2003) which avoids ties in the over-sampled minority class

2 by moving the sampled predictor points toward near neighbors in the minority class. The necessity to break ties is an issue that we will address in detail in Section 5. We will give evidence that ties can be a problem because classification methods may be driven more by the set of distinct data points than the multiplicities of the tied data points. Changing multiplicities, which is what sampling with replacement does, will therefore have a smaller than expected effect. Thus changing multiplicities should be accompanied by breaking the ties, as in SMOTE. We give, however, evidence that tie-breaking does not need to be sophisticated: random perturbations or jittering of the data points in predictor space work surprisingly well. We demonstrate this point with the combination of jittering and over/under-sampling with AdaBoost, here called JOUS-Boost. Jittering has additional beneficial effects: it is essentially a smoothing method that produces samples of the convolution of the distribution in predictor space with the distribution of the jitters. As a result, classifiers produced with JOUS are weakly continuous as a function of the training data and should hence have more stable sampling properties. Another general solution to the problem of classification at arbitrary quantile thresholds is by direct estimation of conditional class probability functions, or CCPF for short. This is the traditional approach in statistics, primarily in the form of logistic regression. Given an estimate of the CCPF as a function on predictor space, one can trivially achieve classification at arbitrary quantiles by thresholding the estimated CCPF at that quantile. If estimation of a of CCPF solves the problem of classification for arbitrary costs and quantiles and imbalances in one fell swoop, it is natural to wonder why one bothers to solve the problem by any other means. Why would there still be an interest in, for example, classifiers trained on over/under-sampled training data? The answer has to do with the fact that finding a good classifier is a robust endeavor, whereas estimation of a CCPF is a subtle business. The latter has to simultaneously contain the information for good classification at all quantiles. By comparison, a classifier is usually described by thresholding some score function at zero, but this function is quite arbitrary except for the requirement to be on the proper side of zero often enough to produce few errors on test samples. This robustness of the classification problem (often observed in the literature) is due to the crude nature of the criterion, namely, misclassification loss. We illustrate the difficulty of estimating the CCPF with the example of AdaBoost. Recent conceptual interpretations of AdaBoost by Friedman, Hastie and Tibshirani would suggest that a simple link function maps AdaBoost s scores to conditional class probabilities. We will give evidence that this is not so in practice: at the point when AdaBoost produces a successful classifier, its scores have typically diverged to such a degree that putting them through the link function produces values that cluster near 0 and 1 and are hence worthless as CCPF estimates. Thus AdaBoost is successful at estimating classification boundaries but not conditional class probabilities. Because of the relative difficulty of CCPF estimation compared to classification, it will always remain of interest to approach the latter problem directly. In fact, we take this argument as a license for traveling the opposite direction: we construct estimators of CCPF s from collections of classifiers computed from grids of quantiles. The precision of such estimators depends on the denseness of the quantile grids, but in view of the natural sampling variability of CCPF estimators, the granularity of the grid may not need to be very fine to compare with the noise level. Furthermore, if the primary use of the CCPF estimator is thresholding for classification, then the proposed estimator will be satisfactory for practical purposes if the grid contains all quantiles of actual interest. In the remainder of this article we first give evidence that, for example, AdaBoost does not produce useful estimates of the CCPF (Section 2). We then discuss classification with unequal cost (Section 3) and the use of over/under-sampling (Section 4). Next we indicate how ties defeat over/under-sampling and how the problem can be remedied with simple data perturbations or jittering. We illustrate the result with JOUS-Boost, the combination of AdaBoost with jittering and over/under-sampling (Section 5). We then describe the construction of CCPF estimators from collections of classifiers (Section 6) and we document their performance in an empirical study (Section 7). 2 ADABOOST DOES NOT ESTIMATE CONDITIONAL CLASS PROBABILITIES We use AdaBoost to illustrate the phenomenon that an estimation method produces useless CCPF s but useful classifiers, although conceptually it should be estimating the former also. We trace the reason to overfitting in terms of the criterion for CCPF estimation, here exponential loss, while at the same time performing well in terms of the classification criterion, namely, misclassification loss. We introduce customary notations. We are given

3 training data x 1,..., x n and y 1,..., y n where each x i is a d dimensional vector of predictors and y i { 1, +1} is the associated observed class label. To justify generalization, it is usually assumed that training data as well as any test data are i.i.d. samples from some population of (x, y) pairs. The most common implementation of Boosting is (discrete) Adaboost (Freund and Schapire, 1996). This algorithm is as follows. 1. Initialize weights w i = 1/n, i = 1,..., n and F (x i ) = 0 for all x i. 2. Fit the classifier f { 1, 1} to the training data using weights w i. 3. Compute the weighted error rate ɛ n i=1 w ii[y i f(x i )] and its log-odds, α log((1 ɛ)/ɛ). 4. Replace F by F + αf. 5. Replace the weights w i with w i w i e αi[yi f(xi)] and then renormalize: w i w i /( w i ). 6. With the new weights go back to step 2 above. The AdaBoost algorithm repeats this updating process m times and then stops. If we let F m (x) = F (x) denote the additive score function at the time of stopping, then the final classifier is I{F m (x) > 0}. It has been observed repeatedly that the performance of the procedure is quite insensitive to the choice of m as long as m is large enough: m = 200 seems to work very well in many real classification problems when the weak or base learner is a stump (a tree with just two leaves). For less weak learners such as trees with a larger number of leafs (we use 8 below), smaller values of m may suffice. Motivated by analogies between AdaBoost and additive logistic regression, Friedman, Hastie and Tibshirani (2000) proposed a connection between the score function F m (x) and a logistic regression model. They suggest that an estimate p m (x) of the CCPF p(x) can be obtained from F m through an (essentially) logistic link function: p m (x) = e Fm(x). (1) e Fm(x) + efm(x) This is clearly consistent for median classification, as the symmetry of the link implies that thresholding p m at 1/2 amounts to thresholding F m at 0. What is not clear is that this view of Boosting provides a useful tool for estimating the class probabilities p(x) in general. We examine this question visually through a simple two dimensional non-linear example. In our example, we restrict the domain of X to the square [0, 50] 2 and construct the level curves of p(x) to be concentric circles with center (25, 25). We take 1 r < 8 28 r p(x) = 20 8 r 28 0 r > 28 where r is the distance from the point (25, 25) in R 2. Figure 1 shows the circles for p(x) colored-coded such that white is p(x).1, green (light grey) is.1 < p(x).5, red (dark grey) is.5 < p(x).9. and black is p(x) > Figure 1: Three Level Curves for p(x) We now simulate n = 400 observations (x, y), where x is chosen uniform from the square in Figure 1 and y i { 1, 1} is randomly selected according to p(x) above. The goal of the simulation is to determine whether Boosting is able to recover p(x) through the logistic link function. Using this as the training set, Discrete Adaboost described above was implemented using as base classifiers trees of at most eight nodes fit by the Rpart package in the R language. The algorithm was run for up to m = 3000 iterations. Figure 2: Estimates of p(x) Quantiles at 10, 200 and 3000 Iterations The plots in Figure 2 display the resulting estimates for the contours of p(x) (i.e. the quantiles) using the same color-coding as before for a hold-out sample consisting of a grid of 400 points evenly spread over the region. The three plots correspond to m equal to 10, 200 and 3000 iterations. It can be seen in the first plot that at 10 iterations the estimated quantiles are a reasonable estimate of the true quantiles displayed in Figure 1. However, the second plot shows that by m = 200 iterations overfitting has begun in the sense that the estimates for the quantiles q =.1 and q =.9 have moved very close to the estimate for the quantile q =.5. The problem worsens as it is run longer, so that by m = 3000 iterations, the estimates for q =.1 and

4 q =.9 are virtually indistinguishable from the estimate for q =.5. Note that for this problem the training error first becomes zero at m = 209 iterations. The reason why this happens is straightforward. As the number of Boosting iterations grows, the scores F (x i ) for which y i = +1 tends towards, and scores for which y i = 1 tend toward. This occurs because the exponential loss is minimized at these limits (assuming the fitted class of functions F (x) is highly flexible and able to interpolate any finite sample, as is the case for most weighted sums of weak learners). But on hold-out samples we observe that when y = +1, Boosting usually returns a very large value for F, thus resulting in a correct classification when the threshold for F is zero. The analogue is true for observations in hold-out samples for which y = 1. The effectiveness of Boosting for classification problems is not in spite of this phenomenon, but rather due to it. The fact that the absolute values of F m grow without bound as m increases, does not necessarily lead to an over-fit with regard to misclassification rate because the absolute value is irrelevant to a median-classifier; only the sign matters. However, if Boosting is to be used as a quantile classifier by thresholding at a value other than zero (as prescribed by a link such as that of Equation (1)), then it is clear that the absolute value of F m does matter. In fact, the tendency of Boosting to increase F m without bound guarantees that Boosting will eventually overfit p(x) completely. That is, Boosting will eventually estimate the p(x) by either 0 or 1 for every future x regardless of the threshold value specified by the link. Furthermore, by this same reasoning, the specific form of any monotonic link function is in fact irrelevant because the absolute value of F m is largely a function of m rather than a reflection on the true class probability. We illustrate this graphically in two ways. First we trace the values of F m for the points in the hold out sample for which the true CCPF is between.5 and.6. Using the link function in Equation (1) we can compute that the fitted values of F m for these points should ideally lie between 0 and.203. However, in Figure 3 we see a plot of the median of F m for the the points in the hold out sample for which the true p is between.5 and.6. This plot shows a general linear trend, and by m = 3000 iterations is near 1.5 which corresponds to a p(x) =.95. This suggests that we can obtain any estimate greater than 1/2 for the CCPF at these points by simply choosing the appropriate stopping time m. Another way to see this trend is by making histograms of the estimated probabilities. The last three plots in Figure 4 show the resulting estimates of p(x) from Fm Iterations Figure 3: Median(F m (x) :.5 p(x).6) boosting through the link at 10, 200 and 3000 iterations respectively for the 400 points in the hold-out sample. Comparison with the histogram of the true class probabilities (for the hold-out sample) shown in the first plot in Figure 4 illustrates the divergence of the probability estimates to {0, 1}. The problem is quite severe even for m = 200 iterations. Frequency Frequency Frequency Frequency Figure 4: True Probabilities and Estimates Using the Link Function at 10, 200 and 3000 iterations 3 CLASSIFICATION WITH UNEQUAL COSTS The standard classification problem uses simple misclassification rate based on 0-1 loss, that is, misclassification cost is symmetric in the class labels +1 and 1: the cost of misclassifying +1 is the same as misclassifying 1. An optimal classifier minimizes the out of sample misclassification rate. It is known that Boosting performs quite well for such problems across many applications and is resistant to over-fitting to a surprising degree. Symmetric 0-1 loss is historically motivated by a supposition (common in AI, not in statistics) that there always exists a true class label, that is, the class probability is either 0 or 1. In such situations the goal is always perfect classification, and 0-1 loss is a reasonable measure. In practice, however, it is common to find situations where unequal loss is appropriate and cost-weighted Bayes risk should be the target. It is an elementary fact that minimizing loss with unequal costs is equivalent to estimating the boundaries of conditional class-1 probabilities at thresholds other than 1/2. We outline the textbook derivation, drawing intuition from the Pima Indians Diabetes Data of the UCI ML Database: Let p(x) = P [Y = +1 x] and 1 p(x) = P [Y = 1 x] be the conditional proba-

5 bilities of diabetes (+1) and non-diabetes ( 1), respectively, and assume the cost of misclassifying a diabetic is 1 c = 0.90 and the cost of misclassifying a non-diabetic is c = [The benefit of scaling the costs to sum to 1 will be obvious shortly.] The following consideration is conditional on an arbitrary but fixed predictor vector x, hence we abbreviate p = p(x): The risk (=expected loss) of classifying as a diabetic is (1 p)c and the risk of classifying as a non-diabetic is p(1 c). The cost-weighted Bayes rule is then obtained by minimizing risk: Classify as diabetic when (1 p)c < p(1 c) or, equivalently, c < p, here: 0.1 < p, which is what we wanted to show. It is a handy convention to scale costs so they add up to 1, and hence the cost c of misclassifying a negative ( 1) equals the optimal threshold q on the probability p(x) of a positive (+1). The previous discussion implies the equivalence between the problem of classification with unequal cost and what we call the problem of quantile classification, that is, estimation of the region p(x) > q in which observations are classified as +1. Most classifiers (AdaBoost included) assume equal costs, c = 1 c = 0.5, which implies that they are median-classifiers: p(x) > q = 1/2. 4 Over/under-sampling Boosting algorithms are very good median classifiers but deficient quantile classifiers for q other than 1/2. We leave the Boosting algorithm as is and instead tilt the data to force the q-quantile to become the median. Simple over/under-sampling, also called stratification, can convert a median classifier into a q-classifier as follows: Over/undersampled Classification: 1. Let N +1 and N 1 be the marginal counts of labels +1 and 1. Choose k +1, k 1 > 0 for which k+1 k 1 = N +1 N 1 / q 1 q. 2. Select k +1 observations from the training set for which y i = 1 such that each observation has the same chance of being selected. 3. Select k 1 observations for which y i = 1 such that each observation has the same chance of being selected. 4. Obtain a median classifier from the combined sample of k +1 + k 1 points. Assume its output is a score function F m (x). 5. Estimate x as having p(x) > q if F m (x) > 0. Note here that for k ± < N ± (i.e. undersampling) the selection can be done by random sampling with or without replacement, and for k ± > N ± (i.e. oversampling) the selection can be done either by sampling with replacement or by simply augmenting the data by adding replicate observations. For example, if N +1 = 268 and N 1 = 500 as in the Pima Indians diabetes data (UCI-ML DB), and if the upper two deciles with q =.8 (q/(1 q) = 4) are to be estimated, then the ratio of the classes needs to be biased in favor of non-diabetics (class 1) by a factor of 4. This comes down to a ratio of k +1 /k 1 = 268/500/4 = Instead of 268 diabetics per 500 non-diabetics, we now need only 67 diabetics, which would be obtained by sampling from the 268. In order to use the data most efficiently, we will use replication of one or both classes (i.e. over-sampling) in our examples. Specifically, since we will be using values for q equal to t/10 for t = 1,..., 9 we will use k +1 = (10 t)n +1 and k 1 = tn 1, except in the case where q =.5 for which we simply leave the data unmodified, i.e. k +1 = N +1 and k 1 = N 1. In principle, over/under-sampling can be replaced with a weighting procedure if the median classifier permits weighted observations, but this is not usually the case for the algorithms we have in mind. There have been attempts to modify the reweighting scheme implicit in AdaBoost to adapt to different cost ratios and quantiles other than 1/2, but the results seem inconclusive (Fan et al., 1999; Ting, 2000). For the reader s convenience, we briefly give the justification for this re-weighting/re-sampling scheme. It can be inferred from Elkan (2001). We work conveniently in terms of populations/distributions as opposed to finite samples. Let X be the predictor random vector and Y the binary response random variable with values in {+1, 1}. Let further p(x) = P [Y = +1 x ], π = P [Y = +1] f +1 (x) = P [ x Y = +1], f 1 (x) = P [ x Y = 1], which are, respectively, the conditional probability of Y = +1 given X = x, the unconditional probability of Y = +1, and the conditional densities of X = x given Y = +1 and Y = 1. The densities f ±1 (x) describe the distributions of the two classes in predictor space. Now the joint distribution of (X, Y ) is given by p(y = +1, x) = f +1 (x) π, p(y = 1, x) = f 1 (x) π, and hence the conditional distribution of Y = 1 given X = x is p(x) = f +1 (x) π f +1 (x) π + f 1 (x) (1 π),

6 which is just Bayes theorem. Equivalently we have or p(x) 1 p(x) = f +1 (x) π f 1 (x) (1 π), p(x) 1 p(x) / π 1 π = f +1(x) f 1 (x). (2) Now compare this situation with another one that differs only in the mix of class labels, call them π and 1 π, as opposed to π and 1 π. Denote by p (x) the conditional probability of Y = 1 given X = x under this new mix. From Equation (2) it is obvious that p(x) and p (x) are functions of each other, best expressed in terms of odds: p (x) 1 p (x) = p(x) 1 p(x) 1 π π π 1 π Obviously thresholds q on p(x) and q on p (x) transform the same way. If π/(1 π) is the original unconditional ratio of class labels, we ask what ratio π /(1 π ) we need to map the desired q to q = 1/2 or, equivalently, q/(1 q) to q /(1 q ) = 1. The answer is given by the condition 1 = q 1 q = q 1 p 1 π π π 1 π. Solving for π /(1 π ), the desired marginal ratio of class labels is π 1 π = π 1 π / q 1 q, which justifies the algorithm above. 5 FIXING THE PROBLEM OF TIES: JITTERS FOR JOUS-BOOST The over-sampled Boosting algorithm is an effective q-classifier after a small number of iterations. However, just as with the link function as the number of iterations increases our simulations showed that it reverts to median classification (p(x) > 1/2) and the estimates for all quantiles converge to the estimate for the median. The reason for this asymptotic behavior has to do with the fact that the augmented training set includes many replicates of the same (x, y) pair, which is a major problem in general with oversampling. As Boosting re-weights the training samples, the tied points all change their weights jointly. Asymptotically this effectively undoes the over/undersampling. One solution to this problem is to add a small amount of random error to the predictor x so that the replicates are not duplicated. While this jittering does add undesirable noise to the estimates of the level curves of the function p(x), the amount of noise needed is not large. Specifically we do this by adding i.i.d. jitterings in d-dimensions to any points that occur more than once in the augmented dataset. We will refer to the algorithm that combines the over/under-sampled classification described before with this jittering as JOUS- Boost. To illustrate this procedure, we return to the concentric circle dataset from before. The algorithm was applied to the same training data but with noise simulated from the bivariate independent uniform distribution on ( 3.5, 3.5) 2. Estimates of p(x) equal to.9,.5 and.1 were again obtained. The top three plots in Figure 5 show these estimates obtained by running the algorithm for m = 10 iterations. Comparison with Figures 1 and 2 suggests the algorithm is just as accurate as the Boosting algorithm using the link function for estimating these quantiles early on despite the addition of the jitter. More importantly, due to the addition of the jitter the JOUS-Boost algorithm does not suffer from overfitting as the classifiers based on the link function do. This is illustrated by bottom three plots in Figure 5, which show the results when the algorithm is run for 3000 iterations, long after the training error is zero even for the augmented datasets. Comparison with Figure 1 we can see clearly that the estimates for q =.9,.5 and.1 on the hold-out sample remain quite stable even for extremely large m. Figure 5: Estimates of p(x) Quantiles at 10 Iterations (top row) and 3000 Iteration (bottom row) 6 ESTIMATING THE CCPF USING QUANTILE CLASSIFIERS We have shown how a median finder like Adaboost can be converted to a q-classifier by JOUS-ting the data. In this section, we consider a sensible, if ad hoc, algorithm to produce actual estimates of the CCPF p(x). Our procedure is straightforward. We first set a

7 quantization level δ > 2. Then we JOUS the data for a range of quantiles q = 1/δ,..., 1 1/δ. By definition, for each q and every x we estimate D q (x) = I{p(x) q} with ˆD q (x) {0, 1}. While it is clear that D q (x) is monotonically decreasing in q, the estimates are ˆD q (x) will not necessarily be monotonic, especially when p(x) is close to q. Thus, it is not clear how to best produce an estimate of p(x) from the sequence of quantile-based estimates ˆD q (x). One solution is to begin with the median. That is, first consider the estmate produced by Adaboost without any oversampling, undersampling or jitter. Denote this estimate by ˆD.5 (x). The algorithm P- JOUS Boost is as follows. If ˆD.5 (x) = 1 then let ˆp(x) = min{q >.5 : ˆDq (x) = 0} 1 2δ. If no such q is found take ˆp(x) = 1 1 2δ. If ˆD.5 (x) = 0 then let ˆp(x) = max{q <.5 : ˆDq (x) = 1} + 1 2δ. If no such q is found take ˆp(x) = 1 2δ. We will use this method with δ = 10. Thus our estimate of p(x) for any x will be one of {.05,.15,...,.95}. The method of selecting the values for k +1 and k 1 for the the nine runs of the JOUS-Boost algorithm corresponding to the nine values for q is the same as we described earlier and as we have been using for quantile estimation thus far. The only new feature here in the implementation of the JOUS-Boost algorithm is that in order to avoid introducing unnecessary randomness when applying the JOUS-Boost algorithm simultaneously for the different values of q, we recycle the jitter across the different values of q. By this we mean that, for example, the eight sets of replicates of the x values for which y = 1 for q =.8 have the same values for the jitter added to them as the eight of the nine for q =.9. Similarity since for q =.5 no noise is added, the augmented datasets for the other values of q all contain exactly one copy of the original dataset with no noise added. Figure 6 shows estimates of p(x) for the hold out sample using this algorithm applied to the same training set in the circle example. The estimates are coloredcoded in our usual scheme such that white is ˆp.1, green (light grey) is.1 < ˆp.5, red (dark grey) is.5 < ˆp.9, and black is ˆp >.9. It can be observed that this algorithm inheres the property of the JOUS- Boost algorithm in that it does not tend to a median classifier even for large m. 7 QUANTITATIVE PERFORMANCE COMPARISONS In this section we quantitatively compare the probabilities that come out of the algorithm in the previous section to a various collection of different algorithms. Figure 6: JOUS Estimates of p(x) at 10, 200 and 3000 Iterations We will consider standard statistical straw-men such as logistic regression as well as newer techniques like decision trees with 8-nodes and nearest neighbor classifiers. We also consider probability estimates that are produced by boosting through the link. We will conduct our tests on two different data sets, the simulated circle dataset and one real dataset. There is no single, established method for measuring the performance of class probability estimators (see Zadrozny and Elkan, 2001 for a discussion). We choose to use squared loss, which is calculated as 1 b b i=1 (ˆp(x i ) p(x i ))2 where the values x 1,..., x b are points in a hold-out sample not used to train the classifier and p(x) is the true CCPF. While this can be computed for simulated examples such as our circle example, for a real datset in which p(x) is unknown, the probabilities p(x i ) are replaced by Z i = (yi + 1)/2 where the yi are the y i values in the hold out sample. Note that for a fixed x this is still minimized in expectation when ˆp(x) = p(x) as a consequence of being a proper scoring rules (Buja, Stuetzle and Shen, 2004; Savage, 1973; Schervish, 1989; Shuford, Albert and Massengill, 1966). 7.1 SIMULATED CIRCLE DATASET The lower of the two curves in the first plot in Figure 7 displays the squared loss for the circle dataset from the previous sections as a function of the number of iterations calculated using the true probabilities in the hold out sample of size 400. A minimum squared loss.010 is achieved early on (13 iterations) and the curve appears to asymptote to.030 by 3000 iterations. If stopped when the training error first reaches zero (m = 209) the squared loss is.018. By comparison, the squared loss on the same hold out sample for the same training data is much larger (as expected) when the probabilities are obtained using the link. This is shown by the top curve in the first plot in Figure 7 which achieves a minimum of.024 and is as large as.120 after 3000 iterations. It is clear that the p-jous Boost algorithm is superior to obtaining probabilities through the link for this example. The p-jous Boost algorithm also displays supe-

8 Iterations Iterations Figure 7: Comparison of for p-jous Boost and Link on the Simulated Circle Dataset and Sonar Dataset rior performance in comparison to logistic regression, which yields a squared loss of.103. However, with x 2 1 and x 2 2 included, logistic regression achieved an extremely low squared loss of.003, which is not surprising since these quadratic terms give it the same structural form as the true model. The proposed algorithm also performs well relative to the k-nearest neighbor algorithm for this example. The first plot in Figure 8 displays the squared loss based on the probability estimates from the k-out-ofn nearest neighbor algorithm on the hold out sample as a function of k ranging from 1 to 100. These obtain a minimum similar to that for p-jous Boost of.009 at k = 24 but are worse than the apparent p-jous Boost asymptotic rate of.030 if k < 5 or k > Figure 8: using Nearest Neighbor and Rpart Trees for Circle Example servations. It described in detail by Gorman and Sejnowski (1988). Recall in the circle example we added noise simulated from the bivariate independent uniform distribution on ( 3.5, 3.5) 2. For this example we analogously use uniform ( σ j /4, σ j /4) variables for the noise on each of the d predictors where σ j is the standard deviation for the jth predictor in the dataset. We advocate the use of this as a rule since it has worked well for the examples we have considered; however, in principal the amount of noise chosen is a tuning parameter that could be chosen to optimize performance. Unlike the circle example in which the true CCPF was known, for this dataset we need to replace the p(x i ) by Z i = (yi + 1)/2 as described earlier. The squared loss was averaged over a ten-fold cross validation with m = 500 iterations for both p-jous Boost and boosting using the link. The results are displayed in the second plot in Figure 7. While both curves are generally decreasing as a function of the number of iterations, the curve for p-jous Boost is substantially lower throughout, indicating superior performance. Further, this general trend was observed throughout most of the ten individual folds, as well as on other ten-fold cross-validations (not shown). Comparison to logistic regression is difficult for this dataset, since the two classes are separable by a hyperplane, a situation which commonly occurs for large d. A simple solution to this problem was perform the logistic regression on the first k principle components where k < d. The first plot in Figure 9 shows the squared loss averaged over the 10 folds as a function of k. A minimum value of.166 was achieved at k = 13. This is substantially larger than the squared error for p-jous Boost which was.078 after m = 500 iterations. Finally, p-jous Boost clearly outperformed the trees fit using Rpart with respect to squared loss. The default tree in Rpart which uses some amount of crossvalidation gave a squared loss of.038. The second plot in Figure 8 gives squared loss for non-cross-validated trees from Rpart as a function of the number of nodes to which they were limited. The smallest squared loss among these is that of the 4 node tree which gave a squared loss of.031. The 3 node tree in the plot (which was the base learner for the p-jous Boost algorithm implemented) had a squared loss of SONAR DATASET We now consider the performance of the p-jous Boost algorithm of the famous sonar dataset. This dataset consists of d = 60 predictors for each of n = 208 ob Figure 9: using Logistic Regression, Nearest Neighbor and Rpart Trees for the Sonar Dataset Squared loss averaged over the same ten folds for k- nearest neighbor (on standardized data) and the noncross-validated trees are shown in the second and third plots in Figure 9. Comparison with the second plot in Figure7 shows the performance of p-jous Boost was better than both of these methods when averaged over the ten folds.

9 8 CONCLUDING REMARKS In conclusion we have presented an algorithm for making boosting cost sensitive through a combination off over-sampling, under-sampling, and jittering. We have shown that this algorithm produces reasonable estimates of the quantiles that are not sensitive to the number of iterations for which boosting is run. We have also presented a simple method for obtaining conditional class probability estimates from these quantiles. The probability estimates in turn are also not sensitive to the number of iterations and perform well relative some common methods. In contrast, using probabilities estimates obtained by applying a link to the score function from boosting results in estimates that move toward zero and one as boosting is run, and thus are not meaningful. Acknowledgements David Mease s work was supported by an NSF-DMS post-doctoral fellowship. References Buja, A., Stuetzle, W. and Shen, Y. (2004), A Study of Loss Functions for Classification and Class Probability Estimation, Preprint. Chan, P., and Stolfo, S. (1998), Towards Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection, in Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, Chawla, N.V., Bowyer, K. W., Hall, L.O., and Kegelmeyer, W.P. (2002), SMOTE: Synthetic Minority Over-Sampling Technique, Journal of Artificial Intelligence Research, Volume 16, Chawla, N.V., Lazarevic, A., Hall, L.O. and Bowyer, K.W. (2003), SMOTEBoost: Improving Prediction of the Minority Class in Boosting, in 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Estabrooks A., Jo, T. and Japkowicz, N., (2004), A Multiple Resampling Method for Learning from Imbalances Data Sets, Computational Intelligence, Volume 20, Number 1, Elkan, C. (2001), The Foundations of Cost-Sensitive Learning, in Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJ- CAI), Fan, W., Stolfo, S.J., Zhang, J. and Chan, P.K. (1999), AdaCost: Misclassification Cost-Sensitive Boosting, in Proceedings of the 16th International Conference on Machine Learning, Friedman, J., Hastie, T. and Tibshirani, R. (2000), Additive Logistic Regression: a Statistical View of Boosting, Annals of Statistics, Volume 28, Gorman, R.P. and Sejnowski, T.J. (1988), Analysis of Hidden Units in a Layered Network Trained to Classify Sonar Targets, Neural Networks, Volume 1, Savage, L.J. (1973), Elicitation of Personal Probabilities and Expectations, Journal of the American Statistical Association, Volume 66, Number 336, Schervish, M.J. (1989), A General Method for Comparing Probability Assessors, The Annals of Statistics, Volume 17, Number 4, Shuford, E.H., Albert, A., Massengill, H.E. (1966), Admissible Probability Measurement Procedures, Psychometrika Volume 31, Ting, K.M. (2000), A Comparative Study of Cost- Sensitive Boosting Algorithms, in Proceedings of the 17th International Conference on Machine Learning, Zadrozny, B. and Elkan, C. (2001), Obtaining Calibrated Probability Estimates From Decision Trees and Naive Bayesian Classifiers, in Proceedings of the Eighteenth International Conference on Machine Learning,

Boosted Classification Trees and Class Probability/ Quantile Estimation

Boosted Classification Trees and Class Probability/ Quantile Estimation University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 3-2007 Boosted Classification Trees and Class Probability/ Quantile Estimation David Mease San Jose State University

More information

Evidence Contrary to the Statistical View of Boosting

Evidence Contrary to the Statistical View of Boosting Journal of Machine Learning Research 9 (2008) 131-156 Submitted 10/05; Revised 7/07; Published 2/08 Evidence Contrary to the Statistical View of Boosting David Mease Department of Marketing and Decision

More information

Evidence Contrary to the Statistical View of Boosting

Evidence Contrary to the Statistical View of Boosting University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2-2008 Evidence Contrary to the Statistical View of Boosting David Mease San Jose State University Abraham J. Wyner

More information

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel Lecture 2 Judging the Performance of Classifiers Nitin R. Patel 1 In this note we will examine the question of how to udge the usefulness of a classifier and how to compare different classifiers. Not only

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c15 2013/9/9 page 331 le-tex 331 15 Ensemble Learning The expression ensemble learning refers to a broad class

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Expert Systems with Applications

Expert Systems with Applications Expert Systems with Applications 36 (29) 5718 5727 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa Cluster-based under-sampling

More information

1 Introduction. Exploring a New Method for Classification with Local Time Dependence. Blakeley B. McShane 1

1 Introduction. Exploring a New Method for Classification with Local Time Dependence. Blakeley B. McShane 1 Exploring a New Method for Classification with Local Time Dependence Blakeley B. McShane 1 Abstract We have developed a sophisticated new statistical methodology which allows machine learning methods to

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

10701/15781 Machine Learning, Spring 2007: Homework 2

10701/15781 Machine Learning, Spring 2007: Homework 2 070/578 Machine Learning, Spring 2007: Homework 2 Due: Wednesday, February 2, beginning of the class Instructions There are 4 questions on this assignment The second question involves coding Do not attach

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

On Multi-Class Cost-Sensitive Learning

On Multi-Class Cost-Sensitive Learning On Multi-Class Cost-Sensitive Learning Zhi-Hua Zhou and Xu-Ying Liu National Laboratory for Novel Software Technology Nanjing University, Nanjing 210093, China {zhouzh, liuxy}@lamda.nju.edu.cn Abstract

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m ) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ

More information

Voting (Ensemble Methods)

Voting (Ensemble Methods) 1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

Lossless Online Bayesian Bagging

Lossless Online Bayesian Bagging Lossless Online Bayesian Bagging Herbert K. H. Lee ISDS Duke University Box 90251 Durham, NC 27708 herbie@isds.duke.edu Merlise A. Clyde ISDS Duke University Box 90251 Durham, NC 27708 clyde@isds.duke.edu

More information

JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA

JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA 1 SEPARATING SIGNAL FROM BACKGROUND USING ENSEMBLES OF RULES JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA 94305 E-mail: jhf@stanford.edu

More information

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Günther Eibl and Karl Peter Pfeiffer Institute of Biostatistics, Innsbruck, Austria guenther.eibl@uibk.ac.at Abstract.

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

On Multi-Class Cost-Sensitive Learning

On Multi-Class Cost-Sensitive Learning On Multi-Class Cost-Sensitive Learning Zhi-Hua Zhou, Xu-Ying Liu National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China {zhouzh, liuxy}@lamda.nju.edu.cn Abstract

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Classification objectives COMS 4771

Classification objectives COMS 4771 Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

When is undersampling effective in unbalanced classification tasks?

When is undersampling effective in unbalanced classification tasks? When is undersampling effective in unbalanced classification tasks? Andrea Dal Pozzolo, Olivier Caelen, and Gianluca Bontempi 09/09/2015 ECML-PKDD 2015 Porto, Portugal 1/ 23 INTRODUCTION In several binary

More information

Midterm Exam Solutions, Spring 2007

Midterm Exam Solutions, Spring 2007 1-71 Midterm Exam Solutions, Spring 7 1. Personal info: Name: Andrew account: E-mail address:. There should be 16 numbered pages in this exam (including this cover sheet). 3. You can use any material you

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

Discussion of Mease and Wyner: And yet it overfits

Discussion of Mease and Wyner: And yet it overfits Discussion of Mease and Wyner: And yet it overfits P. J. Bickel bickel@stat.berkeley.edu Department of Statistics University of California Berkeley, CA 9472-386 USA Ya acov Ritov Jerusalem, Israel yaacov.ritov@gmail.com

More information

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12 Ensemble Methods Charles Sutton Data Mining and Exploration Spring 2012 Bias and Variance Consider a regression problem Y = f(x)+ N(0, 2 ) With an estimate regression function ˆf, e.g., ˆf(x) =w > x Suppose

More information

Boosting Based Conditional Quantile Estimation for Regression and Binary Classification

Boosting Based Conditional Quantile Estimation for Regression and Binary Classification Boosting Based Conditional Quantile Estimation for Regression and Binary Classification Songfeng Zheng Department of Mathematics Missouri State University, Springfield, MO 65897, USA SongfengZheng@MissouriState.edu

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Be able to define the following terms and answer basic questions about them:

Be able to define the following terms and answer basic questions about them: CS440/ECE448 Section Q Fall 2017 Final Review Be able to define the following terms and answer basic questions about them: Probability o Random variables, axioms of probability o Joint, marginal, conditional

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University A Gentle Introduction to Gradient Boosting Cheng Li chengli@ccs.neu.edu College of Computer and Information Science Northeastern University Gradient Boosting a powerful machine learning algorithm it can

More information

Boosting: Foundations and Algorithms. Rob Schapire

Boosting: Foundations and Algorithms. Rob Schapire Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you

More information

Gradient Boosting (Continued)

Gradient Boosting (Continued) Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive

More information

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

COMS 4771 Lecture Boosting 1 / 16

COMS 4771 Lecture Boosting 1 / 16 COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,

More information

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20. 10-601 Machine Learning, Midterm Exam: Spring 2008 Please put your name on this cover sheet If you need more room to work out your answer to a question, use the back of the page and clearly mark on the

More information

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine CS 484 Data Mining Classification 7 Some slides are from Professor Padhraic Smyth at UC Irvine Bayesian Belief networks Conditional independence assumption of Naïve Bayes classifier is too strong. Allows

More information

Final Exam, Fall 2002

Final Exam, Fall 2002 15-781 Final Exam, Fall 22 1. Write your name and your andrew email address below. Name: Andrew ID: 2. There should be 17 pages in this exam (excluding this cover sheet). 3. If you need more room to work

More information

Harrison B. Prosper. Bari Lectures

Harrison B. Prosper. Bari Lectures Harrison B. Prosper Florida State University Bari Lectures 30, 31 May, 1 June 2016 Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 1 h Lecture 1 h Introduction h Classification h Grid Searches

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Obtaining Calibrated Probabilities from Boosting

Obtaining Calibrated Probabilities from Boosting Obtaining Calibrated Probabilities from Boosting Alexandru Niculescu-Mizil Department of Computer Science Cornell University, Ithaca, NY 4853 alexn@cs.cornell.edu Rich Caruana Department of Computer Science

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

A Brief Introduction to Adaboost

A Brief Introduction to Adaboost A Brief Introduction to Adaboost Hongbo Deng 6 Feb, 2007 Some of the slides are borrowed from Derek Hoiem & Jan ˇSochman. 1 Outline Background Adaboost Algorithm Theory/Interpretations 2 What s So Good

More information

Neural Networks and Ensemble Methods for Classification

Neural Networks and Ensemble Methods for Classification Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Tufts COMP 135: Introduction to Machine Learning

Tufts COMP 135: Introduction to Machine Learning Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)

More information

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

Midterm, Fall 2003

Midterm, Fall 2003 5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

More information

CSE 546 Final Exam, Autumn 2013

CSE 546 Final Exam, Autumn 2013 CSE 546 Final Exam, Autumn 0. Personal info: Name: Student ID: E-mail address:. There should be 5 numbered pages in this exam (including this cover sheet).. You can use any material you brought: any book,

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Generalized Boosted Models: A guide to the gbm package

Generalized Boosted Models: A guide to the gbm package Generalized Boosted Models: A guide to the gbm package Greg Ridgeway April 15, 2006 Boosting takes on various forms with different programs using different loss functions, different base models, and different

More information

Logistic Regression and Boosting for Labeled Bags of Instances

Logistic Regression and Boosting for Labeled Bags of Instances Logistic Regression and Boosting for Labeled Bags of Instances Xin Xu and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {xx5, eibe}@cs.waikato.ac.nz Abstract. In

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

Chapter 14 Combining Models

Chapter 14 Combining Models Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

Ordinal Classification with Decision Rules

Ordinal Classification with Decision Rules Ordinal Classification with Decision Rules Krzysztof Dembczyński 1, Wojciech Kotłowski 1, and Roman Słowiński 1,2 1 Institute of Computing Science, Poznań University of Technology, 60-965 Poznań, Poland

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

SPECIAL INVITED PAPER

SPECIAL INVITED PAPER The Annals of Statistics 2000, Vol. 28, No. 2, 337 407 SPECIAL INVITED PAPER ADDITIVE LOGISTIC REGRESSION: A STATISTICAL VIEW OF BOOSTING By Jerome Friedman, 1 Trevor Hastie 2 3 and Robert Tibshirani 2

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II) Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.

CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber. CS570 Data Mining Anomaly Detection Li Xiong Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber April 3, 2011 1 Anomaly Detection Anomaly is a pattern in the data that does not conform

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

1/sqrt(B) convergence 1/B convergence B

1/sqrt(B) convergence 1/B convergence B The Error Coding Method and PICTs Gareth James and Trevor Hastie Department of Statistics, Stanford University March 29, 1998 Abstract A new family of plug-in classication techniques has recently been

More information

Announcements Kevin Jamieson

Announcements Kevin Jamieson Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

A Bias Correction for the Minimum Error Rate in Cross-validation

A Bias Correction for the Minimum Error Rate in Cross-validation A Bias Correction for the Minimum Error Rate in Cross-validation Ryan J. Tibshirani Robert Tibshirani Abstract Tuning parameters in supervised learning problems are often estimated by cross-validation.

More information

How to evaluate credit scorecards - and why using the Gini coefficient has cost you money

How to evaluate credit scorecards - and why using the Gini coefficient has cost you money How to evaluate credit scorecards - and why using the Gini coefficient has cost you money David J. Hand Imperial College London Quantitative Financial Risk Management Centre August 2009 QFRMC - Imperial

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan Ensemble Methods NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan How do you make a decision? What do you want for lunch today?! What did you have last night?! What are your favorite

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information