Error Limiting Reductions Between Classification Tasks

Size: px
Start display at page:

Download "Error Limiting Reductions Between Classification Tasks"

Transcription

1 Alina Beygelzimer IBM T. J. Watson Research Center, Hawthorne, NY 1532 Varsha Dani Department of Computer Science, University of Chicago, Chicago, IL 6637 Tom Hayes Computer Science Division, University of California, Berkeley, CA 9472 John Langford TTI-Chicago, Chicago, IL 6637 Bianca Zadrozny IBM T. J. Watson Research Center, Yorktown Heights, NY Abstract We introduce a reduction-based model for analyzing supervised learning tasks. We use this model to devise a new reduction from multi-class cost-sensitive classification to binary classification with the following guarantee: If the learned binary classifier has error rate at most ɛ then the cost-sensitive classifier has cost at most 2ɛ times the expected sum of costs of all possible lables. Since cost-sensitive classification can embed any bounded loss finite choice supervised learning task, this result shows that any such task can be solved using a binary classification oracle. Finally, we present experimental results showing that our new reduction outperforms existing algorithms for multi-class cost-sensitive learning. 1 Introduction Supervised learning involves learning a function from supplied examples. There are many natural supervised learning tasks; classification alone comes in a number of different forms (e.g., binary, multi-class, costsensitive, importance weighted). Generally, there are two ways to approach a new task. The first is to solve it independently from other tasks. The second is to re- Appearing in Proceedings of the 22 nd International Conference on Machine Learning, Bonn, Germany, 25. Copyright 25 by the author(s)/owner(s). duce it to a task that has already been thoroughly analyzed, automatically transferring existing theory and algorithms from well-understood tasks to new tasks. We investigate (and advocate) the latter approach. Motivation. Informally, a reduction is a learning machine that solves one task using oracle (i.e., black-box) access to a solver for another task. Reductions present a powerful tool with a number of desirable properties: 1. If A can be reduced to B, then techniques for solving B can be used to solve A, transferring results from one learning task to another. This saves both research and development effort. 2. Reductions allow one to study an entire class of tasks by focusing on a single primitive (they also help to identify such primitives). Any progress on the primitive immediately implies progress for every task in the class. 3. Viewed as negative statements, reductions imply relative lower bounds stating that progress on one task cannot be achieved without making progress on some primitive. There is an essential discrepancy between theory and practice of supervised learning. On one hand, it is impossible to prove, without making any assumptions, that we can learn to predict well. (For example, in PAC-learning we might assume that all examples are sampled independently, and the problem is predicting an unknown decision list.) On the other hand, learning algorithms are often successfully used in practice. Reductions give us a way to cope with this discrepancy by making relative statements of the form Good per-

2 formance of the oracle on subproblems generated by the reduction translates into good performance on the original problem. Such statements can be made without any assumptions that cannot be verified or do not generally hold in practice. Thus reductions give us a tool for showing relative ability to learn (relative to basic, better understood primitives). We propose an analysis framework allowing us to prove such error transformation guarantees. Attempts to understand this framework in terms of sample complexity based models of learning (Vapnik & Chevonenkis, 1971; Valiant, 1984) will fail; it is important to understand that this is a different model of analysis. To test the model, we consider learning machines which have oracle access to a binary classification learner, and ask what can be done with such a machine. Our main technical contribution is Section 3 which presents a new reduction from cost-sensitive classification to binary classification and quantifies the way in which errors by the binary classification oracle induce errors in the original problem. Since costsensitive classification can (in some precise sense) express any bounded-loss finite-choice supervised learning task, the reduction shows that any such task can be solved using a binary classification oracle. To give further evidence that the proposed framework allows for a tractable analysis, we show how to analyze several existing algorithms in this model (Section 6). In addition to the above theoretical motivations, there is significant empirical evidence (see (Zadrozny et al., 23), (Freund & Schapire, 1997), (Dietterich & Bakiri, 1995)) that this style of analysis often produces learning algorithms that perform well in practice. Section 3.2 presents experimental results showing that our new reduction outperforms existing algorithms for multi-class cost-sensitive learning. 2 Basic Definitions This section defines the basic concepts. Definition 1. A supervised learning task is a tuple (K, Y, l), where K is the space of supervised advice available at training time, Y is a prediction space, and l : K Y [, ) is a loss function. We assume that l is nonconstant in both arguments, which is satisfied by any reasonable loss function. To test the definition, we instantiate it for common classification tasks. For binary classification, K = {, 1} (the space of labels given for training examples), Y = {, 1}, and l(k, y) = I(k y), where I( ) is the indicator function which is 1 when the argument is true and otherwise. A simple generalization to more than two labels gives multiclass classification with K = Y = {1,..., r} and l(k, y) = I(k y). In general, the space K of advice provided at training time need not equal the prediction space Y. We sometimes need this additional flexibility to provide some information to the learning algorithm about the loss of different choices. For example, in importance weighted classification, we want to specify that predicting some examples correctly is more important than predicting others, which is done by letting K = {1,..., r} [, ) and l( k 1, k 2, y) = k 2 I(k 1 y). In cost-sensitive classification, K is used to associate a cost to each prediction y Y by setting K = [, ) r and l( k 1,..., k r, y) = k y. In both examples above, the prediction space Y is {1,..., r}. Cost-sensitive classification is often useful in practice since each prediction may generally have a different associated loss, and the goal is to minimize the expected loss (instead of the number of misclassifications). A task is (implicitly) a set of learning problems, each of which requires more information to be fully specified. Definition 2. A supervised learning problem is a tuple (D, X, T ), where T = (K, Y, l) is a supervised learning task, X is an arbitrary feature space, and D is a distribution over X K. The goal in solving a supervised learning problem is to find a hypothesis h : X Y minimizing the expected loss E (x,k) D l(k, h(x)). Finding h is difficult because D is generally not known at training time. For any particular hypothesis h, the loss rate h D = E (x,k) D l(k, h(x)) is a fundamental quantity of interest. Definition 3. A supervised learning algorithm for task (K, Y, l) is a procedure mapping any finite set of examples in (X K) to a hypothesis h : X Y. We use these definitions in Section 4 to define an errorlimiting reductions model. Ultimately, the real test of a learning model is whether it allows for tractable analysis leading to useful learning algorithms. To motivate the formulations of the model, we first show an example of what can be done with this model. In particular, we present a new reduction from cost-sensitive classification to binary classification and prove its errortransformation bounds. To give further evidence, Section 6 shows that the model captures several existing machine learning algorithms. 3 Reduction from Cost Sensitive Classification to Binary Classification We present a new reduction, Weighted All-Pairs (WAP), from r-class cost sensitive classification to im-

3 portance weighted binary classification. It can be composed with the Costing reduction (Zadrozny et al., 23) to yield a reduction to binary classification. Since cost-sensitive classification can express any supervised learning task we consider (using the mapping K = l(k, Y ) for the task (K, Y, l)), this implies that any such task can be solved using binary classification. 3.1 The Algorithm WAP is a weighted version of the All-Pairs reduction (Hastie & Tibshirani, 1997) from multi-class to binary classification. The original reduction works by learning a classifier trained to predict, given a pair of classes (i, j) as a part of the feature space, Is label i more probable than label j?. To make a multiclass prediction, the All-Pairs reduction runs the classifier for every pair of labels and outputs the label i maximizing the number of yes answers, with ties broken arbitrarily. To solve cost-sensitive classification, WAP needs to account for the costs. The algorithms specifying the reduction are given below. 1 WAP-Train (Set of r-class cost sensitive examples S, importance weighed binary classifier learning algorithm B) Set S =. for all examples (x, k 1,..., k r ) in S do for all pairs (i, j) with 1 i < j r do Add an importance weighted example ( x, i, j, I(k i < k j ), v j v i ) to S. end for end for Return h = B(S ). The values v i used by the reduction are defined as follows: For a given cost-sensitive example (x, k 1,..., k r ), let L(t) be the function L(t) = {j k j t}, for t [, ). By shifting, we may assume that the minimum cost is, so that t implies L(t) 1. Optimizing the loss of the shifted problem is equivalent to optimizing the loss of the original problem. The values v i are defined as v i = k i 1/L(t)dt. Note that the values are order-preserving: k i < k j iff v i < v j for all i and j. We say that label i beats label j for input x if either i < j and h( x, i, j ) = 1, or i > j and h( x, j, i ) =. Note that if h makes no errors and k i k j, then label i beats label j exactly when k i < k j. WAP-Test outputs the label which beats the maximum number of other labels, with ties broken arbitrarily. 2 WAP-Test (classifier h, example x) for all pairs (i, j) with 1 i < j r do Evaluate h( x, i, j ) end for Output argmax i {j i beats j}. Before stating the error transform, we must first define the distribution WAP-Train(D) induced by the reduction on the importance-weighted binary classifier (to which the reduction makes a single call). To draw a sample from this distribution, we first draw a cost sensitive sample (x, k 1,..., k r ) from the input distribution D and then apply WAP-Train to the singleton set {(x, k 1,..., k r )} to get a sequence of ( r 2) examples for the binary classifier. Now we just sample uniformly from this set. Recall that the loss rate of classifier h on distribution D is denoted by h D. Theorem 1. (WAP error efficiency) For all importance weighted binary learning algorithms B and costsensitive datasets S, let h = WAP-Train(S, B). Then for all cost-sensitive test distributions D, WAP-Test(h) D 2h WAP-Train(D). This theorem states that the cost sensitive loss is bounded by twice the importance weighted loss on the induced importance weighted learning problem. Theorem 1 and the Costing reduction (Theorem 4 in Section 6.1) imply the following efficiency guarantees: Theorem 2. For all cost sensitive test distributions D, let D be the binary classification problem induced by WAP-Train and Costing, and h be the learned binary classifier. Then WAP-Test(h) D 2h D E x,k1,...,k r D r k i. i=1 For notational simplicity, assume that k 1 k r. Note that no generality is lost since the algorithm does not distinguish between the labels. The following insight is the key to the proof. Lemma 1. Suppose label i is the winner. Then, for every j {1,..., i 1}, there must be at least j/2 pairs (a, b), where a j < b, and b beats a. Proof. Consider the restricted tournament on {1,..., j}. Case 1: Suppose that some w beats at least j/2 of the others. If no label b > j beats any label a j, then w would beat at least j/2 + 1 more labels than any b > j; in particular, w would beat at least j/2 + 1 more labels than i. Thus, in order to have label i beat

4 as many labels as w, at least j/2 edges of the form (w, b), b > j or (a, i), a j must be reversed. Case 2: There is no label w {1,..., j} beating j/2 of the rest of {1,..., j}. This can only happen if j is odd and there is a j-way tie with (j 1)/2 losses per label in {1,..., j}. In this case, although every label beats (j+1)/2 more labels than any b > j, in particular i, it is still necessary to reverse at least (j + 1)/2 j/2 edges, in order to ensure that i > j beats as many labels as each of {1,..., j}. Proof. (Theorem 1) Suppose that our algorithm chooses the wrong label i for a particular example (x, k 1,..., k r ). We show that this requires the adversary to incur a comparable loss. Lemma 1 and the definition of v i imply that the penalty incurred to make label i win is at least ki L(t)/2 ki 1 dt L(t) 2 dt = k i 2. On the other hand, the total importance assigned to queries for this instance equals v j v i = i<j i<j = kr kj k i 1 L(t) dt = R(t)dt = r i=1 kr ki L(t)R(t) dt L(t) dt = r k i, i=1 where R(t) = r L(t) is the number of labels whose value is greater than t and the second equality follows from switching the order of summation and counting the number of times a pair (i, j) satisfies i < t < j. The second to last equality follows by writing R(t) as a sum of the r indicator functions for the events {k j > t}, and then switching the order of summation. Consequently, for every example (x, k 1,..., k r ), the total importance assigned to queries for x equals i k i, and the cost incurred by our algorithm on instance x is at most twice the importance of errors made by the binary classifier on instance x. Averaging over the choice of x shows that the cost is at most 2. This method of assigning importances is provably near-optimal. Theorem 3. (Lower bound) For any other assignments of importances w i,j to the points (x, i, j) in the above algorithm, there exists a distribution with expected cost ɛ/2. Proof. Consider examples ( ) 1 x,, r 1,..., 1 r 1. Suppose that we run our algorithm using some w i,j as the importance for the query (x, i, j). Any classifier which errs on (x, 1, i) and (x, 1, j), where i j, causes our algorithm to choose label 2 as the winner, thereby giving a cost of 1/(r 1), out of the total cost of 1. The importance of these two errors is w 1,i + w 1,j, out of the total importance of i,j w i,j. Choosing i and j so that w 1,i + w 1,j is minimal, the adversary s penalty is at most 2 r i=2 w 1,i/(r 1), and hence less than 2/(r 1) times the total importance for x. This shows that the cost of the reduction cannot be reduced below 1/2 merely by improving the choice of weights. 3.2 Experiments We conducted experiments comparing WAP to the original all-pairs reduction (Hastie & Tibshirani, 1997) and to state-of-the-art multi-class cost-sensitive learning algorithms. Using the C4.5 decision tree learner (Quinlan, 1993) as oracle classifier learner, we compared our reduction to the cost-sensitive learning algorithm MetaCost (Domingos, 1999). Using Boosted Naive Bayes (Elkan, 1997), we compared our reduction to the minimum expected cost approach after calibration of the probability estimates produced by the learner, as done by Zadrozny and Elkan (Zadrozny & Elkan, 22). We applied the methods to seven UCI multiclass datasets. Since these datasets do not have costs associated with them, we generated artificial costs in the same manner that was done in the MetaCost paper (except that we fixed the cost of classifying correctly at zero). We repeated the experiments for 2 different settings of the costs. The average and standard error of the test set costs obtained with each method are shown in Tables 1 and 2. We present results for two versions of this reduction. One is the exact version used in the proof above. In the other version, we learn one classifier per pair as is done in the original all-pairs reduction. More specifically, for each i and j (with i < j) we learn a classifier using the following mapping to generate training examples: (x, k 1,..., k r ) (x, I(k i < k j ), v i v j ). In the results, we see that the first version appears to work well for some problems and badly for others, while the second version is more stable. Perhaps this is because the training examples given to the learner in the first version are not drawn i.i.d. from a fixed distribution as learning algorithms often expect. 4 Error-Limiting Reductions Informally, a reduction R from task T = (K, Y, l) to another task T = (K, Y, l ) is a procedure that uses oracle access to a learning algorithm for T to solve T, and guarantees that good performance of the oracle on subproblems generated by the reduction translates

5 Dataset All-Pairs MinExpCost WAP v.1 WAP v.2 anneal 31 ± ± ± ± 43 ecoli 538 ± ± ± ± 56 glass 838 ± ± ± ±21 satellite 137 ± ± ± ± 8.5 splice 59.8 ± ± ± ± 3.5 solar 3989 ± ± ± ± 11 yeast 931±46 148± ± ±7.2 Table 1. Experimental results with Boosted Naive Bayes. Dataset All-Pairs MetaCost WAP v.1 WAP v.2 anneal 87 ± ± ± ± 2.8 ecoli 588 ± ± ± ± 5.9 glass 656 ± ± ± ±8.8 satellite 196 ± ± ± ± 5.9 splice 51.3 ± ± ± ± 4.2 solar 536 ± ± ± ±.66 yeast 981± ± ± ±1.3 Table 2. Experimental results with C4.5. into good performance on the original problem. To formalize this statement, we need to define a mapping from a measure D over prediction problems in (X K) to a measure D over generated subproblems in (X K ). In order to call the oracle, the reduction must take as input some training set S from (X K) and create some S from (X K ). We can think of this process as inducing a mapping from D to D according to D (S ) = E S D Pr R (S ), where the probability of generating S is over the random bits used in the reduction on input S. This rule defines D for a single invocation of the oracle. If the reduction makes several sequential calls, we need to define D for each invocation. Suppose that the first invocation produces h. We replace this invocation with the oracle that always returns h, and use the above definition to find D for the next invocation. Now, suppose that we have a test example (x, k) drawn from some arbitrary distribution D over X K. (We can think of D as being a distribution over (X K) having all probability mass on one-element sequences of examples.) The reduction maps this to a new measure D using the argument above. We can now state the formal definition. Definition 4. An error-limiting reduction R(S, A) from task (K, Y, l) to task (K, Y, l ) takes as input a finite set of examples S (X K) and a learning algorithm A for task (K, Y, l ), and outputs a hypothesis h : X Y that uses a collection of sub-hypotheses returned by A. For every such sub-hypothesis h, and every measure D over X K, the reduction induces a measure D over the inputs to h, as described above. For all X K and D, the reduction must satisfy h D g(max h,d h D ) for some continuous monotone nondecreasing function g : [, ) [, ) with g() =. We call g the errorlimitation function of the reduction. For this definition, we use the convention that max =, where max is the maximum over the empty set. Note again that none of the above definitions require that examples be drawn i.i.d. Analysis in this model is therefore directly applicable to all real-world problems which satisfy the type constraint. Notice that we do not stipulate that the training set and the test example (x, k) come from the same distribution. In fact, the definition is only dependent on the training distribution through the set of training examples S used to construct examples for the oracle. Perspective It is useful to consider an analogy with Turing reductions in complexity theory. In both cases, a reduction from task A to task B is a procedure that solves A using a solver for B as a black box. There are several differences, the most important being that complexity-theoretic reductions are often viewed as negative statements providing evidence that a problem is intractable. Here we view reductions as positive statements allowing us to solve new problems. Metrics of success We consider several ways to measure reductions. The metric essential to the defi-

6 nition of an error-limiting reduction is the error efficiency of the reduction for a given example, defined as the maximum ratio of the loss of the hypothesis output by the reduction on this example and the total loss of the oracle hypotheses on the examples generated by the reduction. We also want the transformation to be computationally efficient assuming that all oracle calls take unit time. Calls can be either adaptive (i.e., depend on the answers to previous queries) or parallel (i.e., all queries can be asked before any of them are answered). It is easy to see that there is a general transformation turning parallel calls into a single call to the oracle learning algorithm. To make one call, we can augment the feature space with the name of the call and then learn a classifier on the union of all training data. We used this trick in the WAP reduction in Section 3. 5 The Structure of Error Limiting Reductions Non-triviality of the definition The proposition below states that every reduction must make a nontrivial use of its oracle. Proposition 5.1. For every error-limiting reduction R, every training set S, and test example (x, k), there exists an oracle A such that E R [h(x)] = arg min l(k, y) =, y Y where h is the hypothesis output by R(S, A), and the expectation is over the randomization used by R in forming h. Proof. Recall that the error limitation property holds for all D, in particular for the deterministic D with all probability mass on (x, k). Thus if R just ignored its oracle we would have l(k, h(x)) = g(max ) = g() =. The fact that l(k, y) is nonconstant (in both arguments) implies that there exists (x, k) such that any h must satisfy l(k, h(x)) >, a contradiction. Consequently, at least one call to the oracle must be made. Next note that for all R, S, and (x, k), there exists an h (and thus an oracle A outputting h ) consistent with all examples induced by R on (x, k). If R is randomized, there exists a deterministic reduction R taking one additional argument (a randomly chosen string) with the same output behavior as R. Consequently, there exists an oracle A dependent on this random string which achieves error rate. Since g() =, this must imply that h(x) = arg min y Y l(k, y) =. Note, however, that while the notion of error-limiting reductions is non-trivial, it does not prevent a reduction from creating hard problems for the oracle. Composition Since our motivation for studying reductions is to reduce the number of distinct problems which must be considered, it is natural to want to compose two or more reductions to obtain new ones. It is easy to see that reductions defined above compose. Proposition 5.2. If R 12 is a reduction from task T 1 to T 2 with error limitation g 12 and R 23 is a reduction from task T 2 to T 3 with error limitation g 23, then the composition R 12 R 23 is a reduction from T 1 to T 3 with error limitation g 12 g 23. Proof. Given oracle access to any learning algorithm A for T 3, reduction R 23 yields a learning algorithm for T 2 such that for any X 2 and D 2, it produces a hypothesis with loss rate h D 2 g 23 (h D 3 ), where h D 3 is the maximum loss rate of a hypothesis h output by A (over all invocations). We also know that R 12 yields a learning algorithm for T 1 such that for all X 1 and D 1 we have h D1 g 12 (h D 2 ) g 12 (g 23 (h D 3 )), where h D 2 is the maximum loss rate of a hypothesis h output by R 23 (A). Finally, notice that g 12 g 23 is continuous, monotone nondecreasing, and g 12 (g 23 ()) =. 6 Examples of Reductions Here, we show that the notion of an error-limiting reduction is both tractable for analysis and describes several standard machine learning algorithms. 6.1 Reduction from Importance Weighted Classification to Classification In importance weighted classification, every example comes with an extra importance value which measures how (relatively) important a decision might be. This extra flexibility turns out to be useful in several applications, and as an intermediate step in several reductions. The Costing (Zadrozny et al., 23) algorithm is a (very simple) reduction from importance weighted classification to classification based on rejection sampling. Given an importance weighted dataset S = (x, k 1, k 2 ) an unweighted dataset is produced using rejection sampling: {(x, k 1 ) : (x, k 1, k 2 ) S and k 2 < t U(, w)} where t U(, w) is a uniform random draw from the interval [, w] and w is greater than k 2 for all examples. The classifier learner is applied to produce a classifier c for this new dataset. 1 1 The actual costing algorithm repeats this process and predicts according to the majority of learned c.

7 Although this algorithm is remarkably simple, composing costing with decision trees has been shown to yield superior performance on benchmark prediction problems (Zadrozny et al., 23). Theorem 4. (Costing error efficiency) For all importance weighted problems (D, X, T ), if the classifiers have error rate ɛ, costing has loss rate at most ɛe (x,k1,k 2) Dk 2. Proof. Trivial. See (Zadrozny et al., 23). 6.2 Reductions from Multiclass Classification to Binary Classification In this section we analyze several well known reductions for efficiency. Before we start, it is important to note that there are boosting algorithms for solving multiclass classification given weak binary importance weighted classification (Schapire, 1997), (Schapire & Singer, 1999). Some of these methods require extra powerful classification algorithms, but others are genuine reductions, as defined here. A reductionist approach for creating a multiclass boosting algorithm is to simply compose any one of the following reductions with binary boosting. The Tree Reduction In the tree reduction (which is well known, see chapter 15 of (Fox, 1997)), we construct r 1 binary classifiers distinguishing between the labels using a binary tree. The tree has depth log 2 r, and each internal node is associated with a set of labels that can reach the node, assuming that each classifier above it predicts correctly. The set of labels at that node is split in half, and a classifier is trained to distinguish between the two sets of labels. The root node starts with the set of all labels {1,..., r} which are all indistinguishable at that point. Predictions are made by following a chain of classifications from the root down to the leaves, each of which is associated with a unique label. Note that instead of r 1 parallel, one could use log 2 r adaptive queries. Theorem 5. (Tree error efficiency) For all multiclass problems (D, X, T ), if the binary classifiers have loss rate ɛ, the tree reduction has loss rate at most ɛ log 2 r. Proof. A multiclass error occurs only if some binary classifier on the path from the root to the leaf errs. The Error Correcting Output Codes (ECOC) Reduction Let C {, 1} n be an error-correcting code with l codewords a minimum distance of d apart. Let C i,j denote the i th bit of the j th codeword of C. The ECOC reduction (Dietterich & Bakiri, 1995) corresponding to C learns n classifiers: the i th classifier predicts C i,y where y is the correct label given input x. The loss rate for this reduction is at most 2nɛ/d, as we shall prove in Theorem 6 below. We mention three codes of particular interest. The first is when the codewords form a subset of the rows of a Hadamard matrix (an n n binary matrix with any two rows differing in exactly n/2 places). Such matrices exist and are easy to construct when n is a power of 2. Thus, for Hadamard codes, the number of classifiers needed is less than 2r (since a power of 2 exists between r and 2r), and the loss rate is at most 4ɛ. When the codewords correspond to binary representations of, 1,..., r 1, the ECOC reduction is similar to (although not the same due to decoding differences) the tree reduction above. Finally, if the codewords form the r r identity matrix, the ECOC reduction corresponds to the one-against-all reduction. Theorem 6. (ECOC ) For all multiclass problems (D, X, T ), if the binary classifiers have loss rate ɛ, the ECOC reduction has loss rate less than 2n d ɛ. This theorem is identical to Lemma 2 of (Guruswami & Sahai, 1999) (which was generalized in (Allwein et al., 21)), except that we generalize the statement to any distribution D generating (test) examples rather than the loss rate on a training set. Proof. When the loss rate is zero, the correct label is consistent with all n classifiers, and wrong labels are consistent with at most n d classifiers, since C has minimum distance d. In order to wrongly predict the multiclass label, at least d/2 binary classifiers must err simultaneously. Since every multiclass error implies at least a fraction d/2n of the binary classifiers erred, a binary loss rate of ɛ can yield at most a multiclass loss rate of 2nɛ/d. Sometimes ECOC is analyzed assuming the errors are independent. This gives much better results, but seems less justifiable in practice. For example, in any setting with label noise errors are highly dependent. We note that all reductions in this section preserve independence. Many common learning algorithms are built under the assumption that examples are drawn independently from some distribution, and it is a desirable property for reductions to preserve this independence when constructing datasets for the oracle. 7 Prior Work There has been much work on reductions in learning. One of our goals is to give a unifying framework, allowing a better understanding of strengths and weaknesses in forms of reductions. Boosting algorithms (Freund & Schapire, 1997; Kalai & Servedio, 23) can be thought of as reductions from

8 binary (or sometimes multiclass) classification with a small error rate to importance weighted classification with a loss rate of nearly 1 2. Given this, it is important to ask: Can t a learning machine apply boosting to solve any classification problem perfectly? The answer is No since it is impossible to always predict correctly whenever the correct prediction given features is not deterministic. As a result, the loss rate becomes unbounded from 1 2 after some number of rounds of boosting. Bagging (Breiman, 1996) can be seen as a selfreduction of classification to classification, which turns learning algorithms with high variance (i.e., dependence on the exact examples seen) into a classifier with lower variance. Bagging performance sometimes suffers because the examples fed into the classifier learning algorithm are not necessarily independently distributed according to the original distribution. Pitt and Warmuth (Pitt & Warmuth, 199) also introduced a form of a reduction between classification problems called prediction-preserving reducibility. There are several important differences between this notion and the notion presented here: (1) predictionpreserving reductions are typically used to show negative, non-predictability results, while we show positive results transferring learning algorithms between tasks; (2) reductions presented here are representation independent; (3) prediction-preserving reducibility is a computational notion, while the reductions here are informational. It should be noted that not all reductions in the learning literature are black box. For example, Allwein et al. Allwein et al. (21) give a reduction from multiclass to binary classification that uses a margin, which is a heuristic form of prediction confidence that can be extracted from many (but not all) classifiers. 8 Conclusion We introduced the analysis model of error-limiting reductions in learning. This model has several nice properties (often not shared with other methods of learning analysis). First, the analysis in this model does not depend upon any unprovable (or unobservable) assumptions such as independence. Error-limiting reductions often have tractable analysis and capture several common machine learning algorithms. They also often work well in practice. To test the model, we constructed, proved, and empirically tested an error-limiting reduction from multiclass cost-sensitive classification to binary classification. Since cost sensitive classification can express any supervised learning task considered here, this reduction can be used to reduce any such task to binary classification. References Allwein, E., Schapire, R., & Singer, Y. (21). Reducing multiclass to binary: A unifying approach for margin classifiers. J. of Machine Learning Research, 1, Breiman, L. (1996). Bagging predictors. Machine Learning, 26, Dietterich, T. G., & Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2, Domingos, P. (1999). Metacost: A general method for making classifiers cost-sensitive. Proceedings of the 5th KDD Conference (pp ). Elkan, C. (1997). Boosting and naive bayesian learning (Technical Report CS97-557). UC San Diego. Fox, J. (1997). Applied regression analysis, linear models, and related methods. Sage Publications. Freund, Y., & Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. of Comp. and Sys. Sci., 55, Guruswami, V., & Sahai, A. (1999). Multiclass learning, boosting, and error-correcting codes. Proceedings of the 12th Annual Conference on Computational Learning Theory (COLT) (pp ). Hastie, T., & Tibshirani, R. (1997). Classification by pairwise coupling. Advances in Neural Information Processing Systems 1 (NIPS) (pp ). Kalai, A., & Servedio, R. (23). Boosting in the presence of noise. Proceedings of the 35th Annual ACM Symposium on Theory of Computing (STOC) (pp ). Pitt, L., & Warmuth, M. (199). Prediction-preserving reducibility. J. of Comp. and Sys. Sci., 41, Quinlan, J. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann. Schapire, R. (1997). Using output codes to boost multiclass learning problems. Proceedings of the 14th International Conference on Machine Learning (ICML) (pp ). Schapire, R., & Singer, Y. (1999). Improved boosting using confidence-rated predictions. Machine Learning, 37, Valiant, L. (1984). A theory of the learnable. Communications of the ACM, 27, Vapnik, V. N., & Chevonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of probability and its applications, 16, Zadrozny, B., & Elkan, C. (22). Transforming classifier scores into accurate multiclass probability estimates. Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining (KDD) (pp ). Zadrozny, B., Langford, J., & Abe, N. (23). Cost sensitive learning by cost-proportionate example weighting. Proceedings of the 3rd ICDM Conference.

Error Limiting Reductions Between Classification Tasks

Error Limiting Reductions Between Classification Tasks Error Limiting Reductions Between Classification Tasks Keywords: supervised learning, reductions, classification Abstract We introduce a reduction-based model for analyzing supervised learning tasks. We

More information

Ensembles of Classifiers.

Ensembles of Classifiers. Ensembles of Classifiers www.biostat.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts ensemble bootstrap sample bagging boosting random forests error correcting

More information

SENSITIVE ERROR CORRECTING OUTPUT CODES

SENSITIVE ERROR CORRECTING OUTPUT CODES SENSITIVE ERROR CORRECTING OUTPUT CODES Abstract. Sensitive error correcting output codes are a reduction from cost sensitive classication to binary classication. They are a modication of error correcting

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

Discriminative Learning can Succeed where Generative Learning Fails

Discriminative Learning can Succeed where Generative Learning Fails Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,

More information

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

1/sqrt(B) convergence 1/B convergence B

1/sqrt(B) convergence 1/B convergence B The Error Coding Method and PICTs Gareth James and Trevor Hastie Department of Statistics, Stanford University March 29, 1998 Abstract A new family of plug-in classication techniques has recently been

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Data Warehousing & Data Mining

Data Warehousing & Data Mining 13. Meta-Algorithms for Classification Data Warehousing & Data Mining Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13.

More information

CS 6375: Machine Learning Computational Learning Theory

CS 6375: Machine Learning Computational Learning Theory CS 6375: Machine Learning Computational Learning Theory Vibhav Gogate The University of Texas at Dallas Many slides borrowed from Ray Mooney 1 Learning Theory Theoretical characterizations of Difficulty

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

Ensemble Methods for Machine Learning

Ensemble Methods for Machine Learning Ensemble Methods for Machine Learning COMBINING CLASSIFIERS: ENSEMBLE APPROACHES Common Ensemble classifiers Bagging/Random Forests Bucket of models Stacking Boosting Ensemble classifiers we ve studied

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity Universität zu Lübeck Institut für Theoretische Informatik Lecture notes on Knowledge-Based and Learning Systems by Maciej Liśkiewicz Lecture 5: Efficient PAC Learning 1 Consistent Learning: a Bound on

More information

Learnability of the Superset Label Learning Problem

Learnability of the Superset Label Learning Problem Learnability of the Superset Label Learning Problem Li-Ping Liu LIULI@EECSOREGONSTATEEDU Thomas G Dietterich TGD@EECSOREGONSTATEEDU School of Electrical Engineering and Computer Science, Oregon State University,

More information

Lecture 7: Passive Learning

Lecture 7: Passive Learning CS 880: Advanced Complexity Theory 2/8/2008 Lecture 7: Passive Learning Instructor: Dieter van Melkebeek Scribe: Tom Watson In the previous lectures, we studied harmonic analysis as a tool for analyzing

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

Boosting: Foundations and Algorithms. Rob Schapire

Boosting: Foundations and Algorithms. Rob Schapire Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you

More information

Yale University Department of Computer Science

Yale University Department of Computer Science Yale University Department of Computer Science Lower Bounds on Learning Random Structures with Statistical Queries Dana Angluin David Eisenstat Leonid (Aryeh) Kontorovich Lev Reyzin YALEU/DCS/TR-42 December

More information

Error-Correcting Tournaments

Error-Correcting Tournaments Error-Correcting Tournaments Alina Beygelzimer 1, John Langford 2, and Pradeep Raviumar 3 1 IBM Thomas J. Watson Research Center, Hawthorne, NY 10532, USA beygel@us.ibm.com 2 Yahoo! Research, New Yor,

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

COMS 4771 Lecture Boosting 1 / 16

COMS 4771 Lecture Boosting 1 / 16 COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?

More information

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 600.463 Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 25.1 Introduction Today we re going to talk about machine learning, but from an

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 10 LEARNING THEORY For nothing ought to be posited without a reason given, unless it is self-evident or known by experience or proved by the authority of Sacred

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning Ensembles. 293S T. Yang. UCSB, 2017. Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)

More information

A Large Deviation Bound for the Area Under the ROC Curve

A Large Deviation Bound for the Area Under the ROC Curve A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu

More information

On Multi-Class Cost-Sensitive Learning

On Multi-Class Cost-Sensitive Learning On Multi-Class Cost-Sensitive Learning Zhi-Hua Zhou and Xu-Ying Liu National Laboratory for Novel Software Technology Nanjing University, Nanjing 210093, China {zhouzh, liuxy}@lamda.nju.edu.cn Abstract

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

1 Differential Privacy and Statistical Query Learning

1 Differential Privacy and Statistical Query Learning 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 5: December 07, 015 1 Differential Privacy and Statistical Query Learning 1.1 Differential Privacy Suppose

More information

Learning Binary Classifiers for Multi-Class Problem

Learning Binary Classifiers for Multi-Class Problem Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

Midterm, Fall 2003

Midterm, Fall 2003 5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

More information

Generalization Error on Pruning Decision Trees

Generalization Error on Pruning Decision Trees Generalization Error on Pruning Decision Trees Ryan R. Rosario Computer Science 269 Fall 2010 A decision tree is a predictive model that can be used for either classification or regression [3]. Decision

More information

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees Improved Algorithms for Confidence-Rated Prediction with Error Guarantees Kamalika Chaudhuri Chicheng Zhang Department of Computer Science and Engineering University of California, San Diego 95 Gilman

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016 The Boosting Approach to Machine Learning Maria-Florina Balcan 10/31/2016 Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets

More information

From Batch to Transductive Online Learning

From Batch to Transductive Online Learning From Batch to Transductive Online Learning Sham Kakade Toyota Technological Institute Chicago, IL 60637 sham@tti-c.org Adam Tauman Kalai Toyota Technological Institute Chicago, IL 60637 kalai@tti-c.org

More information

Robust Reductions from Ranking to Classification

Robust Reductions from Ranking to Classification Robust Reductions from Ranking to Classification Maria-Florina Balcan 1, Nikhil Bansal, Alina Beygelzimer, Don Coppersmith 3, John Langford 4, and Gregory B. Sorkin 1 Carnegie Melon University, Pittsburgh,

More information

Experience-Efficient Learning in Associative Bandit Problems

Experience-Efficient Learning in Associative Bandit Problems Alexander L. Strehl strehl@cs.rutgers.edu Chris Mesterharm mesterha@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu Haym Hirsh hirsh@cs.rutgers.edu Department of Computer Science, Rutgers University,

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Boosting & Deep Learning

Boosting & Deep Learning Boosting & Deep Learning Ensemble Learning n So far learning methods that learn a single hypothesis, chosen form a hypothesis space that is used to make predictions n Ensemble learning à select a collection

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

On Pairwise Naive Bayes Classifiers

On Pairwise Naive Bayes Classifiers On Pairwise Naive Bayes Classifiers Jan-Nikolas Sulzmann 1, Johannes Fürnkranz 1, and Eyke Hüllermeier 2 1 Department of Computer Science, TU Darmstadt Hochschulstrasse 10, D-64289 Darmstadt, Germany {sulzmann,juffi}@ke.informatik.tu-darmstadt.de

More information

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m ) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

On the errors introduced by the naive Bayes independence assumption

On the errors introduced by the naive Bayes independence assumption On the errors introduced by the naive Bayes independence assumption Author Matthijs de Wachter 3671100 Utrecht University Master Thesis Artificial Intelligence Supervisor Dr. Silja Renooij Department of

More information

Empirical Risk Minimization, Model Selection, and Model Assessment

Empirical Risk Minimization, Model Selection, and Model Assessment Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

10.1 The Formal Model

10.1 The Formal Model 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

More information

Logistic Regression and Boosting for Labeled Bags of Instances

Logistic Regression and Boosting for Labeled Bags of Instances Logistic Regression and Boosting for Labeled Bags of Instances Xin Xu and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {xx5, eibe}@cs.waikato.ac.nz Abstract. In

More information

Littlestone s Dimension and Online Learnability

Littlestone s Dimension and Online Learnability Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David

More information

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection Improved Bounds on the Dot Product under Random Projection and Random Sign Projection Ata Kabán School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Learning = improving with experience Improve over task T (e.g, Classification, control tasks) with respect

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes Yoram Singer Hebrew University http://www.cs.huji.il/ singer Based on joint work with: Koby Crammer, Hebrew

More information

Robust Reductions from Ranking to Classification

Robust Reductions from Ranking to Classification Robust Reductions from Ranking to Classification Maria-Florina Balcan 1, Nikhil Bansal, Alina Beygelzimer, Don Coppersmith 3, John Langford 4, and Gregory B. Sorkin 1 Carnegie Melon University, Pittsburgh,

More information

Lecture 1: 01/22/2014

Lecture 1: 01/22/2014 COMS 6998-3: Sub-Linear Algorithms in Learning and Testing Lecturer: Rocco Servedio Lecture 1: 01/22/2014 Spring 2014 Scribes: Clément Canonne and Richard Stark 1 Today High-level overview Administrative

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Learning symmetric non-monotone submodular functions

Learning symmetric non-monotone submodular functions Learning symmetric non-monotone submodular functions Maria-Florina Balcan Georgia Institute of Technology ninamf@cc.gatech.edu Nicholas J. A. Harvey University of British Columbia nickhar@cs.ubc.ca Satoru

More information

A Pseudo-Boolean Set Covering Machine

A Pseudo-Boolean Set Covering Machine A Pseudo-Boolean Set Covering Machine Pascal Germain, Sébastien Giguère, Jean-Francis Roy, Brice Zirakiza, François Laviolette, and Claude-Guy Quimper Département d informatique et de génie logiciel, Université

More information

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification ABC-Boost: Adaptive Base Class Boost for Multi-class Classification Ping Li Department of Statistical Science, Cornell University, Ithaca, NY 14853 USA pingli@cornell.edu Abstract We propose -boost (adaptive

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

Multiclass Boosting with Repartitioning

Multiclass Boosting with Repartitioning Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006 Binary and Multiclass Problems Binary classification problems Y = { 1, 1} Multiclass classification problems Y

More information

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7 Bagging Ryan Tibshirani Data Mining: 36-462/36-662 April 23 2013 Optional reading: ISL 8.2, ESL 8.7 1 Reminder: classification trees Our task is to predict the class label y {1,... K} given a feature vector

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,

More information

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 1 Active Learning Most classic machine learning methods and the formal learning

More information

Random Classification Noise Defeats All Convex Potential Boosters

Random Classification Noise Defeats All Convex Potential Boosters Philip M. Long Google Rocco A. Servedio Computer Science Department, Columbia University Abstract A broad class of boosting algorithms can be interpreted as performing coordinate-wise gradient descent

More information

Bias-Variance in Machine Learning

Bias-Variance in Machine Learning Bias-Variance in Machine Learning Bias-Variance: Outline Underfitting/overfitting: Why are complex hypotheses bad? Simple example of bias/variance Error as bias+variance for regression brief comments on

More information

Computational Learning Theory

Computational Learning Theory 09s1: COMP9417 Machine Learning and Data Mining Computational Learning Theory May 20, 2009 Acknowledgement: Material derived from slides for the book Machine Learning, Tom M. Mitchell, McGraw-Hill, 1997

More information

PAC Generalization Bounds for Co-training

PAC Generalization Bounds for Co-training PAC Generalization Bounds for Co-training Sanjoy Dasgupta AT&T Labs Research dasgupta@research.att.com Michael L. Littman AT&T Labs Research mlittman@research.att.com David McAllester AT&T Labs Research

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Learning Sparse Perceptrons

Learning Sparse Perceptrons Learning Sparse Perceptrons Jeffrey C. Jackson Mathematics & Computer Science Dept. Duquesne University 600 Forbes Ave Pittsburgh, PA 15282 jackson@mathcs.duq.edu Mark W. Craven Computer Sciences Dept.

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

CS446: Machine Learning Fall Final Exam. December 6 th, 2016 CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

1 A Lower Bound on Sample Complexity

1 A Lower Bound on Sample Complexity COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #7 Scribe: Chee Wei Tan February 25, 2008 1 A Lower Bound on Sample Complexity In the last lecture, we stopped at the lower bound on

More information

Voting (Ensemble Methods)

Voting (Ensemble Methods) 1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Spring 2018 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy

More information

A Unified Bias-Variance Decomposition

A Unified Bias-Variance Decomposition A Unified Bias-Variance Decomposition Pedro Domingos Department of Computer Science and Engineering University of Washington Box 352350 Seattle, WA 98185-2350, U.S.A. pedrod@cs.washington.edu Tel.: 206-543-4229

More information

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c15 2013/9/9 page 331 le-tex 331 15 Ensemble Learning The expression ensemble learning refers to a broad class

More information

Hierarchical Boosting and Filter Generation

Hierarchical Boosting and Filter Generation January 29, 2007 Plan Combining Classifiers Boosting Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Hierarchical Structure Filters Combining Classifiers Combining Classifiers

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information