Reliable Agnostic Learning

Size: px

Start display at page:

Download "Reliable Agnostic Learning"

Ginger Day
5 years ago
Views:

Reliable Agnostic Learning Ada Tauan Kalai Microsoft Research Cabridge, MA adu@icrosoft.co Varun Kanade Georgia Tech Atlanta, GA varunk@cc.gatech.

In soe applications, like spa detection, false positive errors are serious probles.

1 Reliable Agnostic Learning Ada Tauan Kalai Microsoft Research Cabridge, MA Varun Kanade Georgia Tech Atlanta, GA Yishay Mansour Google Research and Tel Aviv University Abstract It is well-known that in any applications erroneous predictions of one type or another ust be avoided. In soe applications, like spa detection, false positive errors are serious probles. In other applications, like edical diagnosis, abstaining fro aking a prediction ay be ore desirable than aking an incorrect prediction. In this paper we consider different types of reliable classifiers suited for such situations. We foralize and study properties of reliable classifiers in the spirit of agnostic learning (Haussler, 992; Kearns, Schapire, and Sellie, 994), a PAC-like odel where no assuption is ade on the function being learned. We then give two algoriths for reliable agnostic learning under natural distributions. The first reliably learns DNF forulas with no false positives using ebership queries. The second reliably learns halfspaces fro rando exaples with no false positives or false negatives, but the classifier soeties abstains fro aking predictions. Introduction In any achine learning applications, a crucial requireent is that istakes of one type or another should be avoided at all cost. As a otivating exaple, consider spa detection, where classifying a correct eail as spa (a false positive) can have dire, or even fatal consequences. On the other hand, if a spa essage is not tagged correctly (a false negative) it is coparatively a saller proble. In other situations, abstaining Part of this research was done while the author was at Georgia Institute of Technology, supported in part by NSF SES-7347, and NSF CAREER award, and a SLOAN Fellowship The author was supported in part by NSF CCF This work was supported in part by the IST Prograe of the European Counity, under the PASCAL2 Network of Excellence, by a grant fro the Israel Science Foundation and by a grant fro United States-Israel Binational Science Foundation (BSF). This publication reflects the authors views only.!" $"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!" a)! b)! c)! Figure : Three different learning settings. Depending on the application and data, one ay be ore appropriate than the others. (a) The best (ost accurate) classifier (fro the class of halfspaces) for a typical agnostic learning proble. (b) The best positive-reliable halfspace classifier for a proble like spa-prediction in which false positives are to be avoided. (c) The best fully reliable halfspace sandwich classifier for a proble in which all errors are to be avoided. It predicts, +, or?. fro aking a prediction ight be better than aking wrong predictions; in a edical test it is preferable to have inconclusive predictions to wrong ones, so that the reaining predictions can be relied upon. A reliable classifier would have a pre-specified error bound for false positives or false negatives or both. We present foral odels for different types of reliable classifiers and give efficient algoriths for soe of the learning probles that arise. In agnostic learning [Hau92, KSS94], one would like provably efficient learning algoriths that ake no assuption about the function to be learned. The learner s goal is to nearly atch (within ) the accuracy of the best classifier fro a specified class of functions (See Figure a). Agnostic learning can be viewed as PAC learning [Val4] with arbitrary noise. In reliable agnostic learning, the learner s goal is to output a nearly reliable classifier whose accuracy nearly atches the accuracy of the best reliable classifier fro a specified class of functions (See Figure b-c). Since agnostic learning is extreely coputationally deanding, it is interesting that we can find efficient algoriths for reliably agnostic learning interesting classes of functions over natural distributions. Our

2 algoriths build on recent results in agnostic learning, one for learning decision trees [GKK] and one for learning halfspaces [KKMS5]. Throughout the paper, our focus is on coputationally efficient algoriths for learning probles. The contributions of this paper are the following: - We introduce a odel of reliable agnostic learning. Following prior work on agnostic learning, we consider both distribution-specific and distribution-free learning, as well as both learning fro ebership queries and fro rando exaples. We show that reliable agnostic learning is no harder that agnostic learning and no easier that PAC learning. - We give an algorith for reliably learning DNF (with alost no false positives) over the unifor distribution, using ebership queries. More generally, we show that if concept class C is agnostically learnable, then the class of disjunctions of concepts fro C is reliably learnable (with alost no false positives). - We give an algorith for reliably learning halfspace sandwich classifiers over the unit ball under unifor distribution. This algorith is fully reliable in the sense that it alost never akes istakes of any type. We also extend this algorith to have tolerant reliability, in which case a perissible rate of false positives and false negatives is specified, and the goal of the algorith is to achieve axial accuracy subject to these constraints. Positive Reliability: A positive-reliable classifier is one that never produces false positives. (One can siilarly define a negative-reliable classifier as one that never produces false negatives.) On the spa exaple, a positive-reliable DNF could be (Nigeria bank transaction illion) (Viagra COLT.... An eail should be alost certainly spa if it fits into any of a nuber of categories, where each category is specified by a sets of words that it ust contain and ust not contain. Our goal is to output a classifier that is (alost) positive-reliable and has false negative rate (alost) as low as the best positive-reliable classifier fro a fixed set of classifiers, such as size-s DNFs. Consider a positivereliable classifier that has least rate of false negatives. We require our algorith to output a classifier that has rate of false negatives (within ) as low as this best one, and require that the false positive rate of our classifier be less than. Our first algorith efficiently learns the class of DNF forulas in the positive reliable agnostic odel over the unifor distribution and uses ebership queries. Our odel is conceptually akin to a one-sided error agnostic noise odel; hence this ay be a step towards agnostically learning DNF. Note that our algorith also gives an alternative way of learning DNFs (without noise) that is soewhat different than Jackson s celebrated haronic sieve [Jac97]. More generally, we show that if a class of functions is efficiently agnostically learnable, then polynoial-sized disjunctions of concepts fro that class are efficiently learnable in the positive reliable agnostic odel. Siilarly, it can be shown that agnostically learning a class of functions iplies learning polynoial-size conjunctions of concepts fro that class in the negative reliable agnostic odel. A consequence of this is that a polynoial-tie algorith for agnostically learning DNFs over the unifor distribution would give an algorith for the challenging proble unifor-pac learning depth-3 circuits (since PAC learning is easier than reliable learning). Full Reliability: We consider the notion of full-reliability, which eans siultaneous positive and negative reliability. In order to achieve this, we need to consider partial classifiers, ones that ay predict positive, negative, or?, where a? eans no prediction. A partial classifier is fully reliable if it never produces false positives or false negatives. Given a concept class of partial classifiers, the goal is to find a (nearly) fully reliable classifier that is alost as accurate as the best fully reliable classifier fro the concept class. We show that reliable agnostic learning is easier (or no harder) than agnostic learning; if a concept class is efficiently learnable in the agnostic setting, it is also efficiently learnable in the positive-reliable, negativereliable, and full-reliable odels. Tolerant reliability: We also consider the notion of tolerant reliability, where we are willing to tolerate given rates τ +, τ of false positives and false negatives. In this ost general version, we consider the class of halfspace sandwich partial classifiers (halfspace sandwiches for short), in which the exaples that are in one halfspace are positive, exaples that are in another are negative, and the rest of the exaples are classified as?. We extend the agnostic halfspace algorith and analysis of Kalai, Klivans, Mansour, and Servedio [KKMS5] to the case of halfspace sandwiches. In particular, we show that given arbitrary rates τ +, τ, we can learn the class of halfspace sandwiches to within any constant over the unifor distribution in polynoial tie using rando exaples. Our algorith outputs a hypothesis h such that h has false positive and false negative rates close (within ) to τ + and τ respectively and has accuracy within of the best halfspace sandwich with false positive and false negative rates bounded by τ +, τ. Related Work: A classical approach to the proble of reliable classification is defining a loss function that has different penalties for different type of errors. For exaple, by having an infinite loss on false positive errors and a loss of one on false negative errors, we essentially define a positive-reliable classifier as one that iniizes the loss. The ain issue that arises is coputational, since there is no efficient way to copute a halfspace that would iniize a general loss function. The contribution of this paper is to show that one can define interesting sub-classes for which the coputational tasks

3 are feasible. Conceptually, our work is siilar to the study of cautious classifiers, i.e. classifiers which ay output unknown as a label[fho4]. The odels we introduce, in particular the full-reliable learning odel is very siilar to the bounded-iproveent odel in [Pie7]. The ethods we use to prove the reduction between agnostic learning and reliable learning are related to work on delegating classifiers [FFHO4] and cost-sensitive learning [Elk, ZLA3]. The cost-sensitive classification odel is a ore general odel than reliable learning. In particular, to siulate positive reliability, one would ipose a very large cost on false positives. However, we do not see how to extend our DNF algorith to this setting. Even when the istake costs are equal, the proble of agnostically learning DFN reains an open proble [GKK]. The requireent that there be alost no false positives sees to ake the proble significantly sipler. Extending our halfspace algorith to the cost-sensitive classification setting sees ore straightforward. The otivation of our work is highly related to the Neyan-Pearson criterion, an application of which is the following: given a joint distribution over points and labels, iniize the rate of false negatives subject to the condition that the rate of false positives is bounded by a given input paraeter. The Neyan-Pearson criterion classifies points based on the ratio between the likelihood of the point being labeled positive and negative. Neyan and Pearson [NP33] proved that the optial classification strategy is to choose a threshold on the ratio of the likelihoods. This ethod was applied to statistical learning in [CHHS2, SN5], to solve constrained version of epirical risk iniization or structural risk iniization. Unfortunately, the optiization probles that arise for ost classes of interest are coputationally intractable. In contrast, our work focuses on deriving coputationally efficient algoriths for soe interesting concept classes. The ain focus of our work is to develop foral odels for reliable learning and give algoriths with theoretical guarantees on their perforance. Our odels are siilar in spirit to Valiant s PAC learning odel [Val4] and agnostic learning odels by Kearns, Schapire and Sellie [KSS94]. In our algoriths we do not change the underlying distribution on the instance space, but soeties flip labels of instances. This allows us to convert distribution-specific agnostic learning algoriths into algoriths for reliable learning. Organization. For readability, the paper is divided into three parts. In section 2, we focus only on positive reliability. Section 3 contains a reduction fro agnostic learning to fully reliable learning. Finally, in section 4, we consider reliable learning with pre-specified perissible error rates. 2 Positive reliability The ain result we prove in this section is the following: if a concept class C is efficiently agnostically learnable, the coposite class of size-s disjunctions of concepts fro C is efficiently positive-reliably learnable. An application of this result is an algorith for positive reliably learning size-s DNF forulas, based on the recent result by Gopalan, Kalai and Klivans [GKK] for agnostically learning decision trees. (Recall that a onoial has a sall decision tree and therefore the class of DNF is included in the class of disjunctions of decision trees.) 2. Preliinaries Throughout the paper, we will consider X n n as the instance space (for exaple the boolean cube X n = {, } n or the unit ball X n = B n ). For each n, the target is an arbitrary function f n : X n [, ], which we interpret as f n (x) = Pr[y = x]. A distribution D n over X n together with f n, induces a joint distribution, (D n, f n ), over exaples X n {, } as follows: To draw a rando exaple (x, y), pick x X n according to distribution D n, set y = with probability f n (x), and otherwise y =. A false positive is a prediction of when the label is y =. Siilarly a false negative is a prediction of when the label is y =. The rate of false positives and negatives of a classifier c : X n {, } are defined below. When D n and f n are clear fro context, we will oit the. false + (c) = false + (c, D n, f n ) Pr [c(x) = y = ] = E [c(x)( f n (x))] (x,y) (D n,f n) x D n false (c) = false (c, D n, f n ) Pr [c(x) = y = ] = E [( c(x))f n (x)] (x,y) (D n,f n) x D n A classifier c is said to be positive-reliable if false + (c) =. In other words, it never akes false positive predictions. Although, we focus on positive reliability, an entirely siilar definition can be ade in ters of negative reliability. The error of classifier c is, err(c) = err(c, D n, f n ) Pr (x,y) (D n,f n) [c(x) y] = false +(c) + false (c) To keep notation siple, we will drop the subscript n, except in definitions. Oracles: As is standard in coputational learning theory, we consider two types of oracles: ebership query (MQ) and exaple (EX). Given a target function f : X [, ] we define the behavior of the two types of oracles below: - Mebership Query (MQ) oracle: For a query x X, the oracle returns y = with probability f(x), and y = otherwise, independently each tie it is invoked. - Exaple (EX) oracle: When invoked, the oracle draws x X according to the distribution D, sets y = with probability f(x), y = otherwise and returns (x, y).

4 Alost all our results hold for both types of oracles and none of our algoriths change the underlying distribution D over X. We use O(f) to denote an oracle, which ay be of either of these types. Note that neither type of oracle we consider is persistent either ay return (x, ) and (x, ) for the sae x. In this sense, our odel is siilar to the p-concept distribution odel for agnostic learning introduced in Kearns, Schapire and Sellie [KSS94]. Whenever an algorith A has access to oracle O(f) we denote it by A O(f). Agnostic Learning. Algorith A efficiently agnostically learns sequence of concept classes C n n, under distributions D n n, if there exists a polynoial p(n, /, /δ) such that, for every n,, δ > and every f n : X n [, ], with probability at least δ, A O(fn) (, δ) outputs hypothesis h that satisfies err(h, D n, f n ) in c C n err(c, D n, f n ) +. The tie coplexity of both A and h is bounded by p(n, /, /δ). Below we define positive-reliable learning and postpone definitions of full reliability and tolerant reliability to sections 3 and 4 respectively. Define the subset of positive-reliable classifiers (relative to a fixed distribution and target function), to be: C + = {c C false + (c) = }. Note that if we assue that C contains the classifier which predicts on all exaples, then C + is non-epty. Positive Reliable Learning. Algorith A efficiently positive reliably learns a sequence of concept classes C n n, under distributions D n n, if there exists a polynoial p(n, /, /δ) such that, for every n,, δ > and every f n : X n [, ], with probability at least δ, A O(fn) (, δ) outputs hypothesis h that satisfies false + (h, D n, f n ) and false (h, D n, f n ) in false (c, D n, f n ) +. c C n + The tie coplexity of both A and h is bounded by p(n, /, /δ). Each oracle call is assued to take unit tie. Hence, an upper bound on the run-tie is also an upper-bound on the saple coplexity, i.e., the nuber of oracle calls. 2.2 Positive reliably learning DNF Our ain theore for this section is the following: Theore. Let A be an algorith that (using oracle O(f)) efficiently agnostically learns concept class C under distribution D. There exists an algorith A that (using oracle O(f) and black-box access to A) efficiently positive reliably learns the class of size-s disjunctions of concepts fro C. Using the result on learning decision trees [GKK], we iediately get Corollary. The reainder of this section is devoted to proving Theore and Corollary. Corollary. There is an MQ algorith B such that for any n, s, B positive reliably learns the class of sizes DNF forulas on n variables in tie poly(n, s,, δ ) with respect to the unifor distribution. We show that if we have access to oracle O(f), we can siulate oracle O(f ) where f (x) = q(x)f(x) + ( q(x))r(x), where q, r : X [, ] are arbitrary functions and we only assue black-box access to q and r. Lea. Given access to oracle O(f) and black-box access to the functions q and r we can siulate oracle O(f ) for f = qf + ( q)r. Proof. We show this assuing that O(f) is an exaple oracle; the case when O(f) is a ebership query oracle is sipler. Let f = qf + ( q)r. To siulate O(f ), we do the following: First draw (x, y) fro O(f), with probability q(x), return (x, y). With probability q(x) do the following: Set y = with probability r(x), y = otherwise, return (x, y ). It is easy to see that Pr[y = x] = f (x), thus this siulates the oracle O(f ). When C is a concept class that is agnostically learnable, we give an algorith that uses the agnostic learner as a black-box and outputs a hypothesis which has a low rate of false positives. This hypothesis will have rate of false negatives close to optiu with respect to the positive-reliable concepts in C. Our construction odifies the target function f, leaving the distribution D unchanged, in such a way that false positives (with respect to the original function) are penalized uch ore than false negatives. By correctly choosing paraeters, the output of the black-box agnostic learner is close to the best positive-reliable classifier. Lea 2. Assue that algorith A O(f) (, δ) efficiently agnostically learns concept class C under distribution D in tie T (, δ). Then, A O(f ) ( 2 /2, δ), where f = (/2+/4)f, positive reliably learns C in tie T ( 2 /2, δ). Proof. Let f = ( 2 + 4) f. Let g : X {, } be an arbitrary function and let p = Ex D[f(x)]. We relate the quantities false + and false of g with respect to the functions f and f - siple calculations show that, false + (g, D, f ) = ( false + (g, D, f) + 2 ) (p false (g, D, f)) () 4 ( false (g, D, f ) = 2 + ) false (g, D, f) (2) 4 err(g, D, f ) = false + (g, D, f) + ( 2 ) p false (g, D, f) (3) Let c C be such that false + (c, D, f) = and false (c, D, f) = opt + = in c C + false (c, D, f), so that if h is the output of A O(f ) ( 2 2, δ), with probability at least δ, err(h, D, f ) err(c, D, f ) (4)

5 Substituting identity (3) in (4), once for c and once for h we get false + (h, D, f) 2 (opt + false (h, D, f)) (5) Dropping the non-negative ter false + (h, D, f) fro (5) and rearranging we get, 2 false (h, D, f ) 2 opt false (h, D, f) opt + + (6) input:, δ, M, O(f), CPRL set H := ; for i = to M { set := 2 log 2 δ ; draw Z = (x, y ),..., (x, y ) fro distribution (D, f); set ˆp i := j= ( H i (x j )); if ˆp i 2 { h i := CPRL(, δ, 2 3 ˆp i, O(f), H i ); } else { h i := ; } H i := H i h i ; } output H M For learning disjunction of concepts, we first need to learn the concept class on a subset of the instance space. We show that in the agnostic setting, learning on a subset of the instance space is only as hard as learning over the entire instance space. The siple reduction we present here does not see feasible in the noiseless case. Let S X be a subset, a distribution D over X induces a conditional distribution D S over S; for any T X, Pr D S [T ] = Pr D [T S]/ Pr D [S]. We assue access to the indicator function for the set S, i.e., I S : X {, } such that I S (x) = if x S and I S (x) = otherwise. If Pr D [S] is not negligible, it is possible to agnostically learn under conditional distribution D S using a blackbox agnostic learner. Lea 3. Suppose algorith A O(f) (, δ) efficiently agnostically learns C under distribution D in tie T (, δ). Let S X such that, Pr D [S] γ(= / poly). Then, algorith A O(f ) (γ, δ), where f = fi S + ( I S )/2, efficiently agnostically learns C under distribution D S in tie T (γ, δ). Proof. We first relate the errors with respect to f and f = fi S +( I S )/2. In words, f is f on S and rando outside S. Therefore, the error on any function g would be /2 outside of S and its error on D S inside S. More forally, for any function g : X {, } we have, err(g, D, f ) = E [g(x)( f (x)) + ( g(x))f (x)] x D = E [(g(x)( f(x)) + ( g(x))f(x))i S ] x D + E [( I S )/2] x D = Pr [S] err(g, D S, f) + ( Pr[S])/2 (7) D D Let c C be such that err(c, D S, f) = in c C err(c, D S, f) = opt. Using (7) we can conclude that c is also optial under distribution D with respect to function f. Algorith A O(f ) (γ, δ) outputs hypothesis h which satisfies with probability at least δ, err(h, D, f ) err(c, D, f ) + γ. Using (7) and since Pr D [S] γ we iediately get err(h, D S, f) opt +. We define algorith CPRL (conditional positive reliable learner) as follows. The input to the algorith Figure 2: Algorith Disjunction Learner is - the accuracy paraeter, δ - the confidence paraeter, γ - bound on the probability of S, O(f) oracle to access the target function and I S the indicator function for subset S. Algorith CPRL initially defines f = (/2 + /4)(fI S + ( I S )/2). By Lea we can saple fro O(f ) given access to O(f) and I S. Algorith CPRL runs A O(f ) ( 2 γ 2 /2, δ) and returns its output. We state the properties of algorith CPRL as Lea 4 Lea 4. Suppose algorith A O(f) (, δ) agnostically learns C under distribution D in tie T (, δ). For a subset S X with Pr D [S] γ(= / poly), algorith CPRL (which uses A as a black-box) with probability at least δ returns a hypothesis h such that false + (h, D S, f) and false (h, D S, f) in c C + false (c, D S, f) + Algorith CPRL has access to oracle O(f) and black-box access to I S. The running tie of algorith CPRL is T ( 2 γ 2 /2, δ). Let OR s (C) = {c c s c j C, j s} be the coposite class of size-s disjunctions of concepts fro C. Theore 2 states that if C is agnostically learnable under distribution D, then OR s (C) is positive reliably learnable under distribution D. We need two siple leas to prove Theore 2, whose proofs are eleentary and we oit. Lea 5. Let D be a distribution over X, f : X [, ] an arbitrary function and let c, h be such that false + (c) = and false (h) false (c) +, then Pr D [h(x) = ] Pr D [c(x) = ] Lea 6. Let D be a distribution over X, f : X [, ] an arbitrary function and let c, h be such that false + (c) =, false + (h) and Pr D [h(x) = ] Pr D [c(x) = ], then false (h) false (c) + Lea 5 states that if a hypothesis h has a rate of false negatives close to that of a concept c that predicts

6 no false positives, then h ust predict alost as often as c. Lea 6 states that if h is a hypothesis with low rate of false positives and if it predicts alost as often as a concept c which never predicts false positives, the rate of false negatives of h ust be close to that of c. Theore 2. Let C be a concept class that is agnostically learnable (in polynoial tie) under D and f : X [, ] be an arbitrary function. Algorith Disjunction- Learner (see Fig. 2) run with paraeters, δ, M = s log(2/) and access to oracle O(f) and black-box access to CPRL, with probability at least δ outputs a hypothesis H M, such that false + (H M, D, f) and false (H M, D, f) in (ψ, D, f) +. ψ OR s(c) + The running tie is polynoial in s,, δ. Proof. For definitions of H i and p i refer to figure 2. Let ϕ OR s (C) satisfy false + (ϕ, D, f) = and false (ϕ, D, f) = opt + = in ψ ORs(C) + false (ψ, D, f). Suppose ϕ = c c s, where c j C. Let S i = {x X H i (x) = }, and hence H i is an indicator function for S i. Let p i = Pr D [S i ] and p + = Pr D [ϕ(x) = ]. Our goal is to show that Pr D [H M (x) = ] p + and false + (H M, D, f) and then using Lea 6 we are done. In the ith iteration, we copute using a saple, ˆp i to estiate p i. Using Hoeffding s bound, with probability at least (δ/), ˆp i p i (/2). Let us suppose this holds for all M iterations, allowing our algorith a failure probability of δ/2 so far. In this scenario, if p i, ˆp i (/2) and ˆp i (3p i /2). Note that if p i <, then false (H i, D, f) < and Pr D [H i (x) = ] and hence we are done. Let D Si denote the conditional distribution given S i and in the ith iteration the call to CPRL returns a hypothesis h i such that false + (h i, D Si, f) and false (h i, D Si, f) opt i + where opt i = in c C + false (c, D Si, f). Note that for j =,..., s, false + (c j, D Si, f) =, and hence false (c j, D Si, f) opt i for all i. Thus we get, false (h i, D Si, f) false (c j, D Si, f) + and hence using Lea 5, Pr [h i (x) = ] Pr [c j (x) = ] D Si D Si Pr [h i (x) = ] D Si s Pr [ϕ(x) = ] D Si Let q i = p i = Pr D [H i (x) = ]. We define the quantity d i = p + q i and show by induction that d i ( ) p + i+ i ( ) s j= j. s To check the base step see that d = p + q = p + Pr D [h (x) = ] p + p + s +. For the induction step we have: d i+ = p + Pr [H i+(x) = ] D = p + Pr [H i(x) = ] Pr [H i(x) = h i+ (x) = ] D D = d i q i+ Pr [h i+ (x) = ] D Si+ d i q i+ s Pr [ϕ(x) = ] + D Si+ d i s (Pr [ϕ(x) = ] Pr [H i(x) = ]) + D D ( d i ) + s p + ( s ) i+ + i j= ( ) j s When M = s log 2, d M, and it is easily checked that false + (H M, D, f) and hence using Lea 6, false (H M, D, f) opt + +. The probability of failure soe call to CPRL is at ost M δ = δ/2, which cobined with probability of failure caused by incorrect estiation of soe ˆp i gives total failure probability at ost δ. Theore 2 is a ore precise restateent of Theore. This cobined with Lea 7 below (which is Theore fro [GKK]) proves Corollary. Lea 7. [GKK] The class of polynoial-size decision trees can be agnostically learned (using queries) to accuracy in tie poly(n,, δ ) with respect to the unifor distribution. 3 Full reliability In this section we deal with the case of full reliability. We are interested in obtaining a hypothesis that has low error rate, in ters of false positives and negatives both. In the noisy (agnostic) setting, this is not possible unless we allow our hypothesis to refrain fro aking a prediction. For this reason we use the notion of partial classifiers which predict a value fro the set {,,?}, where prediction of? is treated as uncertainty of the classifier. Recall, that our instance space is X n n and for each n, the target is an unknown function f n : X [, ]. Forally, a partial classifier is c : X n {,,?}. Let I(E) denote the indicator function for event E. The error (siilarly false +, false ) of a partial classifier can be defined as err(c, D n, f n ) = E [I(c(x) = )f n (x) + I(c(x) = )( f n (x))] x D n We define the uncertainty of a partial classifier c to be?(c, D n, f n ) = E [I(c(x) =?)] x Dn Finally we define the accuracy of a partial classifier c to be acc(c, D n, f n ) = E [I(c(x) = )( f n (x)) + I(c(x) = )f n (x)] x D n

7 Note that for a partial classifier c, we have err(c)+?(c)+ acc(c) =. Suppose that C n n is a sequence of concept classes, recall that we defined C n + = {c C n false + (c) = }. Siilarly we define Cn = {c C n false (c) = }. We can now define the class of partial classifiers derived fro C n as PC(C n ) = {(c +, c ) c + C n +, c Cn }. The definition ensures that for a partial classifier c = (c +, c ) PC(C n ), with probability a saple x D n satisfies c (x) c + (x). For any x, c(x) =, when both c + (x) = and c (x) =, c(x) =, when both c + (x) = and c (x) = and? otherwise. A fully reliable classifier is a partial classifier which akes no errors. We define fully reliable learning as: Full Reliable Learning. Algorith A efficiently fully reliably learns a sequence of concept classes C n n, under distributions D n n, if there exists a polynoial p(n, /, /δ) such that, for every n,, δ > and every f n : X n [, ], algorith A O(fn) (, δ), with probability at least δ, outputs a hypothesis h such that err(h, D n, f n ) and acc(h, D n, f n ) ax acc(c, D n, f n ). c PC(C n) The tie coplexity of both A and h is bounded by p(n, /, /δ). In section 2 we gave an algorith for positive reliable learning using a black-box agnostic learner for C n (using Lea 2). We define this as algorith PRL (positive reliable learner). The input to the algorith is - the accuracy paraeter, δ - the confidence paraeter and oracle O(f) - oracle access to the target function. The output of the algorith is a hypothesis h such that false + (h) and false (h) in c C + false (c) +. The algorith runs in tie polynoial in n, / and /δ. A syetric algorith NRL (negative reliable learner) also has identical properties with false positive and false negative error bounds interchanged. Our ain result of this section is showing that agnostic learning iplies fully reliable learning. Theore 3. If C is efficiently agnostically learnable under distribution D, it is efficiently fully reliably learnable under distribution D. Proof. A siple algorith to fully reliably learn C is the following: Let h + = PRL O(f) ( 4, δ 2 ), h = NRL O(f) ( 4, δ 2 ). Define h : X {,,?} as, if h + (x) = h (x) = h(x) = if h + (x) = h (x) =? otherwise We clai that hypothesis h is close to the best fully reliable partial classifier. Let c = (c +, c ) PC(C) be such that err(c, D, f) = and acc(c, D, f) = ax c PC(C) acc(c, D, f). We know that the following hold: false + (h +, D, f) 4 false (h +, D, f) false (c +, D, f) + 4 Siilarly, false (h, D, f) 4 false + (h, D, f) false + (c, D, f) + 4 Then err(h, D, f) false + (h +, D, f)+false (h, D, f) 2. Note that when h(x) =? then h + (x) h (x) and hence one of the akes an error. Therefore, we also have acc(h, D, f) err(h +, D, f)+err(h, D, f). We know that err(h +, D, f) = false + (h +, D, f) + false (h +, D, f) false (c +, D, f) + 2 Siilarly err(h, D, f) false + (c, D, f)+/2. Finally, we check that false (c +, D, f) + false + (c, D, f) = acc(c, D, f), hence acc(h, D, f) acc(c, D, f) 4 Reliable learning with tolerance In this section, we consider our final generalization where we allow tolerance rates τ + and τ for false positives and false negatives respectively. As in the case of fully reliable learning we need to consider the class of partial classifiers. Given a sequence of concept classes C n n, for each C n we define the class of sandwich classifiers as SC(C n ) = {(c +, c ) c + c }. For a sandwich classifier c = (c +, c ), given x, c(x) = when both c + (x) = and c (x) =, c(x) = when both c + (x) = and c (x) = and? otherwise. Sandwich classifiers are a generalization of the class of partial classifiers PC(C n ) that we defined in the previous section. In particular, the class PC(C n ) is the set of all sandwich classifiers with zero error. We call a sandwich classifier c SC(C n ), (τ +, τ )-tolerant reliable if false + (c, D n, f n ) τ + and false (c, D n, f n ) τ. Given acceptable tolerance rates of τ +, τ define opt as opt(τ +, τ ) = ax acc(c); c SC(C n) subject to: {false + (c) τ + ; false (c) τ } We show that algorith Sandwich Learner (Fig. 3) efficiently learns the class of halfspace sandwiches, over the unit ball B n, under the unifor distribution. With inor odifications, this algorith can also learn under log-concave distributions. A halfspace sandwich over B n can be visualized as two slices, one labeled, the other and the reaining ball labeled?. Our algorith is based on the results due to Kalai, Klivans, Mansour and Servedio [KKMS5], and the analysis is largely siilar. Algorith Sandwich Learner in each iteration draws a saple of size and sets up an l regression proble (- ), defined as: in deg(p) d deg(q) d i: y i = p(x i ) + q(x i ) ()

8 inputs:, T, d.draw labeled exaples Z t = (x t, y t ),..., (x t, y t ) (B n {, }) and set up the l polynoial regression proble (-). 2.Solve the regression proble, get polynoials p t (x), q t (x) and record z t to be the value of the objective function. 3.Repeat the above two steps T ties and take p = p s, q = q s to be the polynoials where z s was the least. 4.Output a randoized partial hypothesis h : B n {,,?} [, ]: h(x, ) = crop(in(p(x), q(x))) h(x, ) = crop(in( p(x), q(x)) h(x,?) = h(x, ) h(x, ) subject to Figure 3: Algorith Sandwich Learner i: y i = p(x i ) τ q(x i ) τ + 2 ax(p(x i ) q(x i ), ) i (9) () () This proble can be solved efficiently by setting it up as a linear prograing proble (see appendix of [KKMS5]). Thus, the running tie of the algorith is polynoial in, T and d. The function crop : R [, ] used in the algorith is defined as: crop(z) = z for z [, ], crop(z) = for z < and crop(z) = for z >. The output of the algorith is a randoized partial classifier. We define a randoized partial classifier as a function, c : X {,,?} [, ], so that c(x, ) + c(x, ) + c(x,?) =, and c(x, y) is interpreted the probability that the output for x is y. We define error and epirical error of such classifiers as, err(c, D) = E (x,y) D [c(x, y)] êrr(c, Z) = c(x i, y i ), i= Other quantities are defined siilarly. Theore 4 below shows that for any constant, the class of halfspace sandwiches over the unit ball B n is efficiently (τ +, τ )- tolerant reliably learnable. The running tie of the algorith is exponential in /. Theore 4. Let U be the unifor distribution on the unit ball B n and let f : B n [, ] be an arbitrary unknown function. There exists a polynoial P such that, for any, δ >, n, τ +, τ, algorith Sandwich-Learner with paraeters = P (n d /(δ)), T = log(2/δ), d = O(/ 2 ), with probability at least δ, returns a randoized hypothesis h which satisfies false + (h) τ + +, false (h) τ + and acc(h) opt(τ +, τ ), where opt is with respect to the class of halfspace sandwich classifiers. The proof of Theore 4 is given in the appendix. References [CHHS2] A. Cannon, J. Howse, D. Hush, and C. Scovel. Learning with the neyanpearson and in-ax criteria. Technical Report LA-UR-2-295, Los Alaos National Laboratory, 22. [Elk] C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the seventeenth International Joint Conference on Artificial Intelligence, pages , 2. [FFHO4] C. Ferri, P. Flach, and J. Hernández-Orallo. Delegating classifier. In ICML 4: Proceedings of the twenty-first International Conference on Machine Learning, pages 37 44, New York, NY, USA, 24. ACM. [FHO4] C. Ferri and J. Hernández-Orallo. Cautious classifiers. In Proceedings of first International Workshop on the ROC Analysis in Artificial Intelligence, pages 27 36, 24. [GKK] P. Gopalan, A.T. Kalai, and A.R. Klivans. Agnostically learning decision trees. In STOC : Proceedings of the fortieth annual ACM Syposiu on the Theory of Coputation, New York, NY, USA, 2. ACM. [Hau92] D. Haussler. Decision-theoretic generalizations of the PAC odel for neural networks and other applications. Inforation and Coputation, :7 5, 992. [Jac97] J.C. Jackson. An efficient ebershipquery algorith for learning DNF with respect to the unifor distribution. Journal of Coputer and Syste Sciences, 55(3):44 44, 997. [KKMS5] A.T. Kalai, A.R. Klivans, Y. Mansour, and R.A. Servedio. Agnostically learning halfspaces. In FOCS 5: Proceedings of the 46th annual IEEE Syposiu on Foundations of Coputer Science, pages 2, Washington, DC, USA, 25. IEEE Coputer Society. [KSS94] [NP33] [Pie7] M.J. Kearns, R.E. Schapire, and L.M. Sellie. Toward efficient agnostic learning. Machine Learning, 7(2):5 4, 994. J. Neyan and E. S. Pearson. On the proble of the ost efficient tests for statistical hypotheses. Philosophical Transactions of the Royal Society of London, Series A, Containing Papers of a Matheatical or Physical Character, 23:29 337, 933. Tadeusz Pietraszek. On the use of roc analysis for the optiization of abstaining clas-

9 sifiers. Machine Learning, 6(2):37 69, 27. [SN5] C. Scott and R. Nowak. A neyanpearson approach to statistical learning. IEEE Transactions on Inforation Theory, 5():36 39, 25. [Val4] L.G. Valiant. A theory of the learnable. Counications of the ACM, 27():34 42, 94. [ZLA3] B. Zadrozny, J. Langford, and N. Abe. Costsensitive learning by cost-proportionate exaple weighting. In ICDM 3: Proceeding of the third IEEE International Conference on Data Mining, page 435, Washington, DC, USA, 23. IEEE Coputer Society. A Proof of Theore 4 To prove Theore 4, we need Lea as an interediate step. Here B n is the unit ball. Lea. Let D be a distribution over B n and f : B n [, ] be an arbitrary function. Let τ +, τ be the required tolerance paraeters. Suppose c = (c +, c ) is a half-space sandwich classifier such that false + (c) τ +, false (c) τ and acc(c) = opt(τ +, τ ). If p +, p are 2 degree d polynoials such that E[ c + (x) p + (x) ] and E[ c (x) p (x) ] 2 2, and if =, then with 2 probability at least 2, p +, p are feasible solutions to the l Polynoial Regression Proble (-) and value of the objective function at p +, p is at ost false (c + ) + false + (c ) + 2. Proof. Using Hoeffding s bound when = 2 2, Pr[ false + (c + ) false + (c + ) + ] 6 (2) where false + (c + ) is the epirical estiate of false + (c + ) using a saple of size. Also by Markov s inequality, [ ] Pr c + (x i ) p + (x i ) i [ ] Pr c + (x i ) p + (x i ) 6 E[ c + (x) p + (x) ] i (3) 6 And hence with probability at least 7/, p + (x i ) ( c + (x i ) + c + (x i ) p + (x i ) ) false + (c + ) + / + / τ + + /4 (4) and hence p + satisfies constraint (9) in the regression proble. Siilarly one can show that with probability at least 7/, p satisfies constraint () of the regression proble. Observe that E[ax(p + (x) p (x), )] E[ p + (x) p (x)+c (x) c + (x) ] E[ p + (x) c + (x) ]+E[ p (x) c (x) ] /64. Here we use the fact that c (x) c + (x) for all x (according to our definition of SC(C)). Hence by Markov s inequality, Pr[ ax(p + (x i ) p i (x i ), ) ] (5) i By union bound with probability at least 5/, p +, p satisfy all the constraints of the regression proble. Let us assue we are in the event where all of (2-5) and the corresponding stateents in the case of false negatives are true. Using Hoeffding s bound, Pr[ false (c + ) false (c + )+/] /6 and Pr[ false + (c ) false + (c )+ /] /6. We allow ourselves a further / loss in probability so that these two events do not occur either. Thus with probability at least /2, p +, p are feasible solutions to the regression proble and the value of the objective is, i: y i = + p + (x i ) + i: y i = p (x i ) ( c + (x i ) + c + (x i ) p + (x i ) ) ( c (x i ) + c (x i ) p (x i ) ) false (c + ) + false + (c ) + /2 Proof of Theore 4. We use threshold functions θ t in our proof, where for any t [, ], θ t : R {, } is such that θ t (x) = if x t, and θ t (x) = otherwise. Although the threshold functions θ t do not occur in the algorith, they significantly siplify the proof. To get a decision using the randoized hypothesis that our algorith outputs, a siple technique is to choose a threshold uniforly in [, ] and use that to output a value in {,,?}. Thresholds over polynoials are half-spaces in a higher (n d ) diensional space, and we can use standard results fro VC theory to bound the difference between the epirical error rates and true error rates. We will use frequently the following useful observation: For any z R we have crop(z) = θ t (z)dt. For a degree d polynoial p : R n R, the function θ t p can be viewed a half-space in diension n d diension, where we extend the ters of the polynoial to a linear function. Using the classical VC theory, for any distribution D over B n, there exists a polynoial Q such that for = Q(n d,, δ ) a saple S of exaples drawn fro (D, f), with probability at least δ, êrr(θ t p) err(θ t p), where êrr(g) = (x,y) S I(g(x) y). Siilar bounds hold for false + and false, where false + (g, S) = (x,y) S( g(x))y and false (g, S) = (x,y) S g(x)( y).

10 Let c = (c +, c ) be the best linear sandwich classifier with respect to tolerance rates τ +, τ. Using results fro Kalai, Klivans, Mansour and Servedio [KKMS5], for d = O(/ 2 ) there exist polynoials p +, p such that E[ p + (x) c + (x) ] /2 and E[ p (x) c (x) ] /2. Thus when algorith Sandwich-Learner is run with T = log 2 δ by Lea, with probability at least δ/2, at least one of the iterations has value of the objective function saller than false (c + )+false + (c )+ /2. It can be checked that false (c + ) + false + (c ) = acc(c) = err(c)+?(c). We assue that we are in the case when the least objective function is saller than acc(c) + /2, allowing our algorith to fail with probability δ/2 so far. Let p, q be the polynoials which are solutions to the regression proble with the least objective function. We now analyze the quantities false +, false, acc of the randoized hypothesis h output by algorith Sandwich-Learner. We present the analysis of false +, and false can be done siilarly. We assue that the nuber of exaples is large enough, so that all the bounds due to VC theory that we require in the proof hold siultaneously with probability at least δ 2. This can be done easily by taking union bound using only polynoial nuber of exaples, say P (n d /(δ)). For the first part, consider hyperplanes of the for θ t p and θ t ( q), for any t [, ]. By the VC bound we have that false + (θ t p, S) false + (θ t p, U, f) /2 and false (θ t ( q), S) false + (θ t ( q), U, f) /2. The analysis then proceeds as follows, false + (h, U, f) = E [h(x, )( f(x))] x U [ x U = E = = x U θ t (in(p(x), q(x)))dt( f(x))] E [θ t (p(x))( f(x))]dt false + (θ t p, U, f)dt false + (θ t p, S)dt + 2 θ t (p(x i ))dt + 2 p(x i ) + 2 τ + + Next we analyze the quantity acc(h) = err(h)+?(h). Let S = {(x, y ),..., (x, y )}, naely we flip the label in S, and S = {(x, ),..., (x, )}, naely the saple S all with labels. Here we assue the following bounds hold false + (θ t ( p), S) false + (θ t All our requireents would be about coparing the error of a hyperplane to its observed error on the saple. We would need that the saple S = {(x, y ),..., (x, y )} is such that no hyperplane would have a difference larger than /. ( p), U, f) /, false + (θ t q, S) false + (θ t q, U, f) / and êrr(θ t (p q), S ) err(θ t (p q), U, ) /. Step (6) below holds because crop(a) = crop( a). Step (7) uses the fact that in(a, b) = a (a b)i(a > b) and ax(a, b) = b + (a b)i(a > b). Finally in step () we use that crop(a+b) crop(a) + crop(b). All these facts can be checked easily. acc(h) = E [h(x, )f(x) + h(x, )( f(x))] = E [( h(x, ))f(x) + ( h(x, ))( f(x))] = E [( crop(in(p(x), q(x))))f(x)+ ( crop(in( p(x), q(x))))( f(x))] = E [crop( in(p(x), q(x)))f(x)+ crop(ax(p(x), q(x)))( f(x))] (6) = E [crop( p(x) + (p(x) q(x))i(p(x) > q(x)))f(x) + crop(q(x) + (p(x) q(x))i(p(x) > q(x)))( f(x))] (7) E [crop( p(x))f(x) + crop(q(x))( f(x)) + crop(p(x) q(x))] () [ = E + θ t ( p(x))f(x)dt + θ t (p(x) q(x)) dt] θ t q(x)( f(x))dt ( = E [θ t ( p(x))f(x)] + E [θ t (q(x))( f(x))] ) + E [θ t (p(x) q(x))] dt = (false + (θ t ( p), U, f) + false + (θ t q, U, f) + err(θ t (p q), U, )) dt ( false + (θ t ( p), S) + false + (θ t q, S) + êrr(θ t (p q), S ) ) dt + 3 = θ t ( p(x i )) + i: y i = ) + θ t (p(x i ) q(x i )) dt + 3 = + i= i: y i = i θ t ( p(x i ))dt + θ t (p(x i ) q(x i ))dt + 3 θ t (q(x i )) θ t (q(x i ))dt (9)

11 + i: y i = i= p(x i ) + acc(c, U, f) + q(x i ) ax(p(x i ) q(x i ), ) + 3 where in the last inequality we used that fact that the first two ters are the objective function of the regression (and we assued that they are at ost false (c + )+ false + (c ) + /2, and the third ter is bounded in the regression by /. Hence acc(h) acc(c).

1 Proof of learning bounds

COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a