Activized Learning with Uniform Classification Noise

Activized Learning with Uniform Classification Noise Liu Yang Steve Hanneke June 9, 2011 Abstract We prove that for any VC class, it is possible to transform any passive learning algorithm into an active learning algorithm with strong asymptotic improvements in label complexity for every nontrivial distribution satisfying a uniform classification noise condition. This generalizes a similar result proven by [Han09] for the realizable case, and is the first result establishing that such general improvement guarantees are possible in the presence of restricted types of classification noise. 1 Introduction In many machine learning applications, there is an abundance of cheap unlabeled data, while obtaining enough labels for supervised learning requires significantly more time, effort, or other costs. It is therefore important to try to reduce the total number of labels needed for supervised learning. One of the most appealing approaches to this problem is active learning, a protocol in which the learning algorithm itself selects which of the unlabeled data points should be labeled, in an interactive (sequential) fashion. There is now a well-established literature full of compelling theoretical and empirical evidence indicating that active learning can significantly reduce the number of labels required for learning, compared to learning from randomly selected points (passive learning). However, there remain a number of fundamental open questions regarding how strong the theoretical advantages of active learning over passive learning truly are, particularly when faced with the challenge of noisy labels. At present, there is already a vast literature on the design and analysis of passive learning algorithms, built up over several decades by a substantial number of researchers. In approaching the problem of designing effective active learning algorithms, we might hope to circumvent the need for a commensurate amount of effort, by directly building upon the existing tried-and-true passive learning methods. By leveraging the increased power afforded by the active learning protocol, we may hope to further reduce the number of labels required to learn with these same methods. Carnegie Mellon University, Machine Learning Department, Email: liuy@cs.cmu.edu. Department of Statistics, Carnegie Mellon University, Email: shanneke@stat.cmu.edu. 1

Toward this end, [Han09] recently proposed a framework called activized learning, in which a passive learning algorithm is provided as a subroutine to an active metaalgorithm, which constructs data sets to feed into the passive subroutine, and uses the returned classifiers to inform the active learning process. The objective is to design this meta-algorithm in such a way as to guarantee that the number of label requests required to learn to a desired accuracy will always be significantly reduced compared to the number of random labeled examples the given passive learning algorithm would require to obtain a similar accuracy; in this case, we say the active meta-algorithm activizes the given passive algorithm. This reduction-based framework captures the typical approach to the design of active learning algorithms in practice (see e.g., [TK01, BP09, Set10]), and is appealing because it may inherit the tried-and-true properties (e.g., learning bias) of the given passive learning algorithm, while further reducing the number of labels required for learning. If an active meta-algorithm activizes every passive learning algorithm, under some stated conditions, then it is called a universal activizer under those conditions. In the original analysis, [Han09] proved that such universal activizers do exist under the condition that the target concept resides in a known space of finite VC dimension and that there is no label noise (the so-called realizable case). [Han11] also proved that there exist classification noise models under which there typically do not exist universal activizers, even with the Bayes optimal classifier in a known space of finite VC dimension. Thus, there is a question of what types of noise admit the existence of universal activizers. In this work, we study the classic uniform classification noise model of [AL88]. In this model, there is a target concept residing in a known concept space of finite VC dimension, and the labels in the training data are corrupted by independent and identically distributed noise variables. The probability that a given label in the training set differs from that of the target concept is referred to as the noise rate, and is always strictly less than 1/2 so that the target concept is also the unique Bayes optimal classifier. Below, we find that there do exist universal activizers under the uniform classification noise model. This represents the first general result establishing the existence of universal activizers in the presence of classification noise. Our proof of this result builds upon the established methods of [Han11], but requires several novel technical contributions in addition, including a rather interesting technique for handling the problem of adapting to the value of the noise rate. The paper is structured as follows. In Section 2, we formalize the setting and objective. This is followed in Section 3 with a description of a helpful method and result of [Han11]. We then proceed to construct two useful subroutines in Section 4, proving a relevant guarantee for each. Finally, in Section 5, we present our metaalgorithm and prove the main result: that the proposed meta-algorithm is indeed a universal activizer under the uniform classification noise model. 2 Notation and Definition We are interested in a statistical learning setting for binary classification, in which there is some joint distribution D XY on X { 1, +1}, and we denote by D the 2

marginal distribution of D XY on X. For any classifier h : X { 1, +1}, denote by er(h) = D XY ({(x, y) : h(x) y}) the error rate of h. There is additionally a set C of classifiers, called the concept space, and we denote by d the VC dimension of C [VC71, Vap82]; throughout this work, we will suppose d <, in which case C is called a VC class. We also denote by ν(c; D XY ) = inf h C er(h) the noise rate of D XY with respect to C. We will be interested in the set of distributions satisfying the uniform classification noise assumption of [AL88], which supposes there is an element h D XY C for which the Y i values are simply the h D XY (X i ) values, except corrupted by independently flipping each Y i to equal h D XY (X i ) with some constant probability (less than 1/2). Definition 1. For a given concept space C, define the set of uniform classification noise distributions UniformNoise(C) = {D XY : h D XY C, η(d XY ) [0, 1/2) such that for (X, Y ) D XY, P(Y h D XY (X) X) = η(d XY )}. For D XY UniformNoise(C), the classifier h D XY is referred to as the target function; note that we clearly have ν(c; D XY ) = η(d XY ), and h D XY is the minimizer of er(h) (both among classifiers in C, and indeed among all possible classifiers). In the learning problem, there is a sequence Z = {(X i, Y i )} i=1 where the (X i, Y i ) are independent and D XY -distributed; we denote by Z m = {(X i, Y i )} m i=1. The {X i} i=1 sequence is referred to as the unlabeled data sequence, while the Y i values are referred to as the labels. An active learning algorithm has direct access to the X i values in Z, but must request to observe the labels Y i one at a time. In the specific active learning protocol we study here, the active learning algorithm is given as input a budget n N, and is allowed to request the values of at most n labels; based on the X i values, the algorithm selects an index i 1 N, receives the value Y i1, then selects another index i 2, receives the value Y i2, etc. This continues for up to n rounds, and then the algorithm must halt and return a classifier. Definition 2. An active learning algorithm A achieves label complexity Λ(, ) if, for any joint distribution D XY, for any ε > 0, for any n Λ(ε, D XY ), E [er(a(n))] ε. Since some D XY have no classifiers h with er(h) ε for small ε > 0, we will be interested in analyzing the quantity Λ(ν(C; D XY ) + ε, D XY ), the number of labels sufficient to achieve expected error rate within ε of the best error rate achievable by classifiers in C (which, in the case of uniform classification noise, is also within ε of the best error rate achievable by any classifier). In the present context, we define a passive learning algorithm as any function A p ( ) mapping any finite sequence of labeled examples to a classifier. Definition 3. A passive learning algorithm A p achieves label complexity Λ(, ) if, for any joint distribution D XY, for any ε > 0, for any n Λ(ε, D XY ), E[er(A p (Z n ))] ε. For any m N and sequence L (X { 1, +1}) m, we additionally define the empirical error rate of a classifier h as er L (h) = m 1 (x,y) L 1[h(x) y]; when L = Z m, we abbreviate er m (h) = er Zm (h). We also use the notation V [(x, y)] = {h V : h(x) = y} for any V C. 3

Following [Han09, Han11], we now formally define what it means to activize a passive algorithm. An active meta-algorithm is a procedure A a taking as input two arguments, namely a passive learning algorithm A p and a label budget n N, and returning a classifier ĥ = A a(a p, n), such that A a (A p, ) is an active learning algorithm. Define ( the class ) of functions Polylog(1/ε) as those g s.t. k [0, ), g(ε) = O log k (1/ε). Here, and in all contexts below, the asymptotics are always considered as ε 0 (from above) when considering a function of ε, and as n when considering a function of n; all other quantities are considered constants in these asymptotics. In particular, we write g 1 (ε) = o(g 2 (ε)) (or equivalently g 2 (ε) = ω(g 1 (ε))) to mean lim ε 0 g 1 (ε)/g 2 (ε) = 0. For a label complexity Λ, we will consider any D XY for which Λ(, D XY ) is relatively small as being trivial, indicating that we need not concern ourselves with improving the label complexity for that D XY since it is already very small; for our purposes, relatively small simply means polylogarithmic. Furthermore, in keeping with the reduction style of the framework, we will only require our active learning methods to be effective when the given passive learning method has reasonable behavior. Formally, define the set Nontrivial(Λ; C) as those D XY for which, ε > 0, Λ(ν(C; D XY ) + ε, D XY ) <, and g Polylog(1/ε), Λ(ν(C; D XY ) + ε, D XY ) = ω(g(ε)). Finally, we define an activizer under uniform classification noise as follows. Definition 4. [Han09,Han11] We say an active meta-algorithm A a activizes a passive algorithm A p for C under UniformNoise(C) if the following condition holds. For any label complexity Λ p achieved by A p, the active learning algorithm A a (A p, ) achieves a label complexity Λ a such that D XY UniformNoise(C) Nontrivial(Λ p ; C), c [1, ) s.t. (letting ν = ν(c; D XY )) Λ a (ν + cε, D XY ) = o (Λ p (ν + ε, D XY )). In this case, A a is called an activizer for A p with respect to C under UniformNoise(C), and the active learning algorithm A a (A p, ) is called the A a -activized A p. If A a activizes every passive algorithm for C under UniformNoise(C), we say A a is a universal activizer for C under UniformNoise(C). This definition says that, for all nontrivial distributions satisfying the uniform classification noise model, the activized A p algorithm has a label complexity with a strictly slower rate of growth compared to that of the original A p algorithm. For instance, if the original label complexity of A p was Θ(1/ε), then a label complexity of O(log(1/ε)) for the activized A p algorithm would suffice to satisfy this condition (as would, for instance, O(1/ε 1/2 )). The two slight twists on this interpretation are the restriction to nontrivial distributions and the factor of c loss in the ε argument. As noted by [Han11], the restriction to some notion of nontrivial D XY is necessary, since we clearly cannot hope to improve over passive in certain trivial scenarios, such as when D has support on a single point; passive learning can have O(log(1/ε)) label complexity in this case. The implication of this definition is that the activized algorithm s label complexity is superior to any nontrivial upper bound on the passive method s label complexity. It is not known whether the loss in the ε argument, by a constant c, is really necessary in general (even for the realizable case). However, this only really makes a difference for 4

rather strange passive learning methods; in most cases, Λ p (ν + ε; D XY ) = poly(1/ε), in which case we can set c = 1 by increasing the leading constant on Λ a. Our analysis below reveals we can set this c arbitrarily close to 1, or even to (1 + o(1)). 2.1 Summary of Results In this work, we construct an active meta-algorithm, referred to as Meta-Algorithm 1 below, and prove that it is a universal activizer for C under UniformNoise(C). This is true for any VC class C. The significance of this result is primarily a deeper understanding of the advantages of active learning over passive learning. This first step beyond the realizable case in activized learning is particularly interesting in light of the established negative results indicating that there exist noise models for which there do not exist universal activizers [Han11]. The proof is structured as follows. We review a technique of [Han11] for active learning based on shatterable sets, represented by the method referred to as Subroutine 1 below. For our purposes, the important property of this technique is that it produces a set of labeled examples, where each example has either its true (noisy) label, or else has the label of the target function itself (i.e., the de-noised label). It also has the desirable property that the number of examples in this set is significantly larger than the number of label requests used to produce it. These properties, originally proved by [Han11], are summarized in Lemma 1. We may then hope that if we feed this labeled sample into the given passive learning algorithm, then as long as this sample is larger than the label complexity of that algorithm, it will produce a good classifier; since we used a much smaller number of label requests compared to the size of this sample, we would therefore have the desired improvements in label complexity. Unfortunately, it is not always so simple. The fact that some of the examples are de-noised turns out to be a problem, as there are passive algorithms whose performance may be highly dependent on the uniformity of the noise, so that their performance can actually degrade from denoising select data points. So our next step is to alter this sample to appear more like a typical sample from D XY ; that is, oddly enough, we need to re-noise the de-noised examples. The difficulty in re-noising the sample is that we do not know the value of the noise rate η(d XY ). Furthermore, estimating η(d XY ) to the required precision would require an excessive number of labeled examples, too large to obtain the desired performance gains. So we devise a means of getting what we need, without actually estimating η(d XY ), by a combination of coarse estimation and a kind of brute-force search. With this approximate noise rate in hand, we simply flip each de-noised label with probability η(d XY ), so that the sample now appears to be a typical sample for D XY. This method for re-noising the sample is referred to as Subroutine 2 below, and its effectiveness is described in Lemma 2. Feeding this sample to the passive algorithm then achieves the desired result. However, before we can conclude, there is some clean-up to be done, as the above techniques generate a variety of undesirable by-products that we must sort through to find the aforementioned re-noised de-noised labeled sample. Specifically, in addition to the large partially de-noised labeled sample, Subroutine 1 also generates several spurious sets of labeled examples, with no detectable way to determine which one is the 5

Subroutine 1: Input : label budget n, confidence parameter δ Output: pairs of labeled data sets (L 1, Q 1 ), (L 2, Q 2 ),..., (L d+1, Q d+1 ) 0. Request the { first m n = n/2 labels, {Y 1,..., Y mn }, and } let t m n 1. Let V = h C : er mn (h) min er m h n (h ) 2Ûm n (δ) C 2. For k = 1, 2,..., d + 1 3. ˆ (k) ˆP ( x : ˆP ( ) ) S X k 1 : V shatters S {x} V shatters S 1/2 4. Q k {}, L k {} { ( ) } 5. For m = m n + 1,..., m n + min n/ 6 2 k ˆ (k), n 33/32 6. If ˆP ( ) S X k 1 : V shatters S {X m } V shatters S 1/2 and t < n 7. Request the label Y m of( X m, and let Q k Q k {(m, Y m )} and t t + 1 ) 8. Else, let ŷ argmax ˆP S X k 1 : V [(X m, y)] does not shatter S V shatters S y { 1,+1} 9. L k L k {(X m, ŷ)} 10. Return (L 1, Q 1 ),..., (L d+1, Q d+1 ) Figure 1: Subroutine 1 produces pairs of labeled samples by a combination of queries and inferences. sample we are interested in. Furthermore, in addition to the re-noised data set resulting from adding noise at rate η(d XY ), Subroutine 2 also generates several samples renoised with noise rates differing significantly from η(d XY ). As such, our approach is to take all of the sets produced by Subroutine 1, run them all through Subroutine 2, and then run the passive algorithm with all of the resulting samples. This results in a large collection of classifiers. We know that at least one of them has the required error rate, and our task is simply to find one of comparable quality. So we perform a kind of tournament, comparing pairs of classifiers by querying for the labels of points where they disagree, and taking as the winner the one that makes fewer mistakes, reasoning that the one with better error rate will likely make fewer mistakes in such a comparison. After several rounds of this tournament, we emerge with an overall winner, which we then return. This tournament technique is referred to as Subroutine 3 below, and the guarantee on the quality of the classifier it selects is given in Lemma 3. The sections below include the details of these methods, with rigorous analyses of their behavior. 3 Active Learning Based on Shatterable Sets This section describes an approach to active learning investigated by [Han09, Han11], along with a particular result of [Han11] useful for our purposes. Recall that we say a set of classifiers V shatters {x 1,..., x m } X m if, y 1,..., y m { 1, +1}, h V s.t. i m, h(x i ) = y i. To simplify notation, define X 0 = { }, and say V shatters iff V {}; also suppose P(X 0 ) = 1. 6

Now consider the definition of Subroutine 1 given in Figure 1, based on a similar method of [Han11]. The quantity Ûm n (δ) is defined as follows, based on an analysis by [Kol06] of excess risk bounds based on a data-dependent Rademacher complexity. Let ξ 1, ξ 2,... denote a sequence of independent Rademacher random variables (i.e., uniform in { 1, +1}), also independent from all other random variables in the algorithm (i.e., Z). Define ˆR m = sup 1 m h1,h 2 C 2m i=1 ξ i (h 1 (X i ) h 2 (X i )), ˆD m = sup 1 m h1,h 2 C 2m i=1 h 1(X i ) h 2 (X i ), Ûm(δ) = 12 ˆR ln(16/δ) m +34 ˆDm m + 752 ln(16/δ) m. The quantities ˆP( ) in Subroutine 1 are estimators for their respective analogous quantities P( ), based purely on unlabeled examples (independent from those used in the rest of the algorithm). Their specific definitions are not particularly relevant to the present discussion, and are omitted here; the interested reader is referred to the original analysis of [Han09, Han11]. The following result was essentially proved by [Han11] (more precisely, it can easily be established by a combination of various lemmas of [Han11]); we omit the details here due to space limitations. Lemma 1. [Han11] For any VC class C and D XY UniformNoise(C), there exist constants k {1,..., d+1}, c, c (1, ), and a monotone sequence φ 1 (n) = ω(n) such that, n N, with probability at least 1 c exp{ c n 1/3 }, running Subroutine 1 with label budget n/2 and confidence parameter δ n = exp { n} results in L k Q k φ 1 (n) and er Lk (h D XY ) = 0. In other words, the set L k Q k has size n, and every label is either the true label (for Q k ) or the h D XY label (for L k ). 4 An Active Meta-algorithm for Uniform Classification Noise Re-noising the Sample At first glance, Lemma 1 might seem to have almost solved the problem already, aside from the fact that we need to specify a method for selecting the appropriate value of k. For k = k, the sample L k Q k represents a partially de-noised collection of labeled examples, which might intuitively seem even better to feed into the passive algorithm than a sample with the true (noisy) labels. However, this reasoning is naïve, since we are seeking a universal activizer, applicable to any passive learning algorithm. In particular, there are many passive learning algorithms that actually use the properties of the noise to their advantage in the learning process, so that altering the noise distribution of the sample may alter the behavior of the passive algorithm to ill effects. Indeed, there exist passive learning algorithms (e.g., logistic regression) whose performance can be made worse by de-noising select examples from a given sample. Thus, in light of the above considerations, we are tasked with the somewhat unusual problem of re-noising the de-noised labels, so that the labeled sample again appears to be a typical sample one might expect for the supervised learning problem: that is, essentially similar to an iid sample with distribution roughly D XY. Of course, if we had knowledge of the noise rate η(d XY ), it would be a simple matter to re-noise 7

Subroutine 2: Input : label budget n, pair of labeled data sets (L, Q) Output : sequence of 1 + n 3/4 labeled data sets R 0, R 1,..., R n 3/4 0. Let {(X l1, ŷ l1 ),..., (X ls, ŷ ls )} denote the first s = min{ L, n} elements of L 1. Request the labels Y l1, Y l2,..., Y ls 2. Calculate ˆη = s 1 s i=1 1[Y l i ŷ li ] 3. For each j {1, 2,..., n 3/4 } 4. Let η j = ˆη n 5/16 + 2jn 17/16, L (j) = {} 5. Let {χ ij } i N be a collection of iid { 1, +1} random variables with P(χ ij = 1) = η j, (independent from {χ ij } i N,j j and Z) 6. Let L (j) = {(X li, χ ij ŷ li ) : (X li, ŷ li ) L} 7. Let R 0 = L Q, and for each j {1, 2,..., n 3/4 }, let R j = L (j) Q 8. Return the sequence R 0, R 1,..., R n 3/4 Figure 2: Subroutine 2 returns several data sets, each re-noised according to a guess η j of η(d XY ). the sample by independently corrupting each of the de-noised labels with probability η(d XY ). In the absence of such direct information, it is natural to try to estimate η(d XY ). However, it turns out it would require too many labeled examples to estimate the noise rate to the precision necessary for re-noising the sample in a way that appears close enough to the D XY distribution to work well when we feed it into the passive algorithm. So instead, we employ a combination of estimation and search, which turns out to be sufficient for our purposes. Specifically, consider Subroutine 2 in Figure 2. When constructing the samples R j, we assume the union operator merges the two data sets in a way that preserves their original order in the unlabeled sequence (supposing each X i also implicitly records its index i). We have the following lemma. Lemma 2. Suppose D XY UniformNoise(C), φ 2 (n) = ω(n), Q X {X 1,..., X m }, and L X = {X 1,..., X m } \ Q X. Further suppose that Q = {(X i, Y i ) : X i Q X }, L = {(X i, h D XY (X i )) : X i L X }, n 33/32 L φ 2 (n), and that A p is a passive learning algorithm. Then there is a function q 1 (n) = o(1) such that, if {R i } i=1 n3/4 is the sequence of labeled data sets returned by Subroutine 2 when provided n and (L, Q) as inputs, then (letting ν = ν(c; D XY )) [ E min er (A p (R j )) ν j ] { (1+q 1 (n))e [er (A p (Z m )) ν]+(1+q 1 (n)) exp n 1/4}. Proof. We consider two cases. First, if η(d XY ) = 0, then R 0 = Z m, so that the result clearly holds with q 1 (n) = 0. For the remainder of the proof, suppose η(d XY ) > 0. Much of the reasoning below holds for any sufficiently large n. We will only concern ourselves with these large n values here; the value of q 1 (n) for smaller values of n can be set appropriately so that the lemma remains valid for smaller n as well (e.g., by setting q 1 (n) exp { n 1/4} for these small values). Since L φ 2 (n) = ω(n), for sufficiently large n we must have s = n. By Hoeffding s inequality, with probability 1 exp { n 1/4}, ˆη η(d XY ) c 1 n 3/8 8

Subroutine 3: Input : label budget n, sequence of classifiers h 1, h 2,..., h N Output : classifier hĵ 0. If N = 1, return the single classifier h 1 1. For each i {1, 2,... N/2 } 2. Let T in be the next n/n (previously untouched) unlabeled examples X i in the unlabeled sequence for which h 2i 1 (X i ) h 2i (X i ) (if they exist) 3. Request the label Y t for each X t T in and let S in = {(X t, Y t ) : X t T in } 4. Let h i = argmin h {h 2i 1,h 2i} er SiN (h) 5. If N is odd, let h N/2 = h N 6. Recursively call Subroutine 3 with label budget n/2 and classifiers h 1, h 2,... h N/2 and return the classifier it returns Figure 3: Subroutine 3 selects among a set of classifiers by a tournament of pairwise comparisons. for some (universal) constant c 1 (0, ). Thus, for sufficiently large n, on this event we have η(d XY ) [ˆη n 5/16, ˆη + n 5/16]. In particular, this means j = argmin j η j η(d XY ) has η j η(d XY ) 2n 17/16. Now for any sequence of labels y 1,..., y m { 1, +1} s.t. X i Q X = y i = Y i, we have P (R j = {(X i, y i )} m i=1 {X i} m i=1, Q, j ) P (Z m = {(X i, y i )} m i=1 {X i} m i=1, Q) max 0 r L (1 + 2n 17/16 max 0 r L (1 + 2n 17/16 η(d XY ) η(d XY ) ) n 33/32 ) r (1 + 2n 17/16 1 η(d XY ) { } exp 2n 1/32 /η(d XY ). η r j (1 η j ) L r η(d XY ) r (1 η(d XY )) L r ) L r ) L = (1 + 2n 17/16 η(d XY ) This final quantity approaches 1 as n, and we therefore define q 1 (n) = exp { 2n 1/32 /η(d XY ) } 1 = o(1). Since er(h) ν(c; D XY ) [0, 1], we have { E [er (A p (R j )) ν(c; D XY )] (1 + q 1 (n))e [er (A p (Z m ))] + exp n 1/4}. A Tournament for Classifier Selection Lemmas 1 and 2 together indicate that, if we feed each (L k, Q k ) pair into Subroutine 2 (using an appropriate fraction of the overall label budget for each call), then we need only find a way to select from among the returned labeled samples R j, or equivalently from among the set of classifiers h j = A p (R j ), so that the selected ĵ has er(hĵ) ν(c; D XY ) not too much larger than er(h j ) ν(c; D XY ). Here we develop and analyze such a procedure, based on running a tournament among these classifiers by pairwise comparisons. Specifically, consider Subroutine 3 specified in Figure 3. We have the following lemma. 9

Lemma 3. Suppose D XY UniformNoise(C). Then there exists a constant c (0, ) and a function q 2 (n) = o(1) such that, for any n N and any sequence of classifiers h 1, h 2,..., h N with 1 N (d + 1)(1 + (4n) 3/4 ), with probability at least 1 exp { cn 1/12}, the classifier hĵ returned from calling Subroutine 3 with label budget n and classifiers h 1, h 2,..., h N satisfies er(hĵ) ν(c; D XY ) (1 + q 2 (n)) min j (er(h j ) ν(c; D XY )). Proof. We proceed inductively (it is clear for N = 1). Suppose some i {1,..., N/2 } has P(h j (X) h D XY (X)) > (1 + n 1/12 )P(h k (X) h D XY (X)), where j, k {2i 1, 2i}. Then E[er SiN (h k )] (1 η(d XY ))(2+n 1/12 ) 1 +η(d XY ) 1 + n 1/12 2 + n 1/12 = 1 + η(d XY )n 1/12 2 + n 1/12. Denoting by p 1 this latter quantity, and letting ε 1 = (1/2 η(d XY ))n 1/12 1+η(D XY, a Chernoff )n 1/12 bound implies { } P(er SiN (h k ) > 1/2) = P (er SiN (h k ) > (1 + ε 1 )p 1 ) exp c 1 n 1/4 p 1 ε 2 1, for an appropriate choice of constant c 1 (0, ). Simplifying this last expression, we find that it is at most exp { c 2 n 1/12} for an appropriate constant c 2 (0, ). A union bound then implies that with probability at least 1 (N/2) exp { c 1 n 1/12}, for each i {1,..., N/2 }, we have P(h i (X) h D XY (X)) (1+n 1/12 ) min j {2i 1,2i} P(h j (X) h D XY (X)). Note that, although both n and N are reduced in the recursive calls, they will still satisfy the constraint on the size of N, and the sample sizes S = n/n remain essentially constant over recursive calls, so that this result can be applied to the recursive calls as well (tweaking the constant in the exponent can compensate for the variability due to the floor function). Thus, applying this argument inductively, combined with a union bound over the O(log 2 N) recursive calls, we have that there exists a constant c 2 (0, ) such that with probability at least 1 (N log 2 N) exp { c 2 n 1/12} 1 exp { cn 1/12} (for an appropriate c > 0), the returned classifier hĵ satisfies (for some appropriate constant c 3 (0, )) P ( hĵ(x) h D XY (X) ) ( 1 + n 1/12) c 3 log n min 1 j N P ( h j (X) h D XY (X) ). Note h, er(h) ν(c; D XY ) = (1 2η(D XY ))P(h(X) h D XY (X)). Also, as (1 + n 1/12 ) c3 log n exp { c 3 n 1/12 log n } approaches 1 as n, we can define q 2 (n) = exp { c 3 n 1/12 log n } 1 = o(1) to get er(hĵ) ν(c; D XY ) (1 + q 2 (n)) min 1 j N (er(h j ) ν(c; D XY )). 5 Main Result: A Universal Activizer under Uniform Classification Noise We are now finally ready for our main result, establishing the existence of universal activizers under uniform classification noise. Specifically, consider the active metaalgorithm described in Figure 4. We have the following result. 10

Meta-Algorithm 1: Input : passive learning algorithm A p, label budget n Output : classifier ĥ 0. Execute Subroutine 1 with label budget n/2 and confidence parameter δ = exp{ n} 1. Let (L 1, Q 1 ),..., (L d+1, Q d+1 ) be the returned pairs of labeled data sets 2. For each k {1,..., d + 1}, execute Subroutine 2 with budget n/(4(d + 1)) and (L k, Q k ) 3. Let R k0, R k1,..., R km denote the sequence of returned data sets (M = n/(4(d + 1)) 3/4 ) 4. Execute Subroutine 3 with label budget n/4 and classifiers {A p (R kj ) : k {1,..., d + 1}, j {0, 1,..., M}} 5. Return the classifier ĥ selected by this execution of Subroutine 3 Figure 4: Meta-Algorithm 1 is a universal activizer for any VC class under uniform noise. Theorem 1. For any VC class C, Meta-Algorithm 1 is a universal activizer for C under UniformNoise(C). Proof. Lemma 1 implies that with probability 1 exp{ Ω(n 1/3 )}, some pair (L k, Q k ) will satisfy the conditions of Lemma 2; more precisely, on this event, and conditioned on L k Q k, the pair (L k, Q k ) will be distributionally equivalent to a pair (L, Q) that satisfies these conditions. Thus, combining Lemmas 1 and 2, the law of total expectation, combined with the fact that er(h) ν(c; D XY ) [0, 1] for D XY UniformNoise(C), imply that (letting ν = ν(c; D XY )) E [ ] min er(a p(r kj )) ν k,j (1+o(1)) m φ 1(n) sup m φ 1(n) { } E [er(a p (Z m )) ν]+exp Ω(n 1/4 ). Finally, combining this with Lemma 3 implies that [er(ĥ) ] { } E ν (1 + o(1)) sup E [er(a p (Z m )) ν] + exp Ω(n 1/12 ). (1) For n = Ω(log 12 (1/ε)), the right term in (1) is < ε. For the left term, note that if A p achieves label complexity Λ p, then in order to make sup m φ1(n) E [er(a p (Z m )) ν] ε, it suffices to take n large enough so that φ 1 (n) Λ p (ε + ν; D XY ). Thus, since φ 1 (n) = ω(n), for D XY Nontrivial(Λ p ; C), the smallest N 2ε N such that every n N 2ε has (1 + o(1)) sup m φ1(n) E [er(a p (Z m )) ν] 2ε has N 2ε = o(λ p (ε + ν, D XY )). Therefore, since any O(log 12 (1/ε)) function is also o(λ p (ε + ν, D XY )) for D XY Nontrivial(Λ p ; C), we see that applying Meta-Algorithm 1 to A p results in an active learning algorithm that achieves a label compleity Λ a with the property that, for D XY UniformNoise(C) Nontrivial(Λ p ; C), Λ a (3ε + ν, D XY ) max { N 2ε, O(log 12 (1/12)) } = o (Λ p (ε + ν, D XY )), as required. Thus, Meta-Algorithm 1 activizes A p for C under UniformNoise(C). Since A p was arbitrary, Meta-Algorithm 1 is a universal activizer for C under UniformNoise(C), as claimed. 11

6 Conclusions We established the existence of universal activizers for arbitrary VC classes in the presence of uniform classification noise. This is the first result of this generality regarding the advantages active learning over passive learning in the presence of noise. Previously, [Han09, Han11] has argued that even seemingly benign noise models typically do not permit the existence of universal activizers. Thus, in an investigation of the existence of universal activizers, the key question going forward is whether there are more general noise models, nontrivially subsuming uniform classification noise, under which universal activizers still exist. References [AL88] D. Angluin and P. Laird. Learning from noisy examples. Machine Learning, 2:343 370, 1988. [BP09] J. Baldridge and A. Palmer. How well does active learning actually work? time-based evaluation of cost-reduction strategies for language documentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2009. [Han09] S. Hanneke. Theoretical Foundations of Active Learning. PhD thesis, Machine Learning Department, School of Computer Science, Carnegie Mellon University, 2009. [Han11] S. Hanneke. Activized learning: Transforming passive to active with improved label complexity. Manuscript, 2011. [Kol06] V. Koltchinskii. Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593 2656, 2006. [Set10] B. Settles. Active learning literature survey. http://active-learning.net, 2010. [TK01] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2, 2001. [Vap82] V. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer- Verlag, New York, 1982. [VC71] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264 280, 1971. 12