The Sample Complexity of Self-Verifying Bayesian Active Learning

Similar documents
A Bound on the Label Complexity of Agnostic Active Learning

Active Learning Class 22, 03 May Claire Monteleoni MIT CSAIL

Activized Learning with Uniform Classification Noise

A Theory of Transfer Learning with Applications to Active Learning

Plug-in Approach to Active Learning

Activized Learning with Uniform Classification Noise

Asymptotic Active Learning

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Minimax Analysis of Active Learning

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Adaptive Sampling Under Low Noise Conditions 1

Efficient Semi-supervised and Active Learning of Disjunctions

Active Learning: Disagreement Coefficient

Atutorialonactivelearning

The True Sample Complexity of Active Learning

Kernel Metric Learning For Phonetic Classification

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Understanding Generalization Error: Bounds and Decompositions

Practical Agnostic Active Learning

Active learning. Sanjoy Dasgupta. University of California, San Diego

Lecture 8. Instructor: Haipeng Luo

The True Sample Complexity of Active Learning

1 Differential Privacy and Statistical Query Learning

The sample complexity of agnostic learning with deterministic labels

FORMULATION OF THE LEARNING PROBLEM

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Discriminative Learning can Succeed where Generative Learning Fails

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Linear Classification and Selective Sampling Under Low Noise Conditions

Statistical Active Learning Algorithms

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

COMS 4771 Introduction to Machine Learning. Nakul Verma

Tsybakov noise adap/ve margin- based ac/ve learning

On the Sample Complexity of Noise-Tolerant Learning

Online Learning, Mistake Bounds, Perceptron Algorithm

Agnostic Active Learning

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

Robust Selective Sampling from Single and Multiple Teachers

ANALYSIS OF CRITERION FOR TORSIONAL IRREGULARITY OF SEISMIC STRUCTURES

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Learning symmetric non-monotone submodular functions

The Bayesian Group-Lasso for Analyzing Contingency Tables

arxiv: v4 [cs.lg] 7 Feb 2016

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

Generalization theory

Generalization, Overfitting, and Model Selection

Computational and Statistical Learning Theory

From Batch to Transductive Online Learning

Machine Learning

Co-Training and Expansion: Towards Bridging Theory and Practice

Buy-in-Bulk Active Learning

Learning with Rejection

Cost Complexity of Proactive Learning via a Reduction to Realizable Active Learning. November 2009 CMU-ML

10.1 The Formal Model

Impossibility Theorems for Domain Adaptation

Diameter-Based Active Learning

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Minimax Bounds for Active Learning

Beating the Hold-Out: Bounds for K-fold and Progressive Cross-Validation

IFT Lecture 7 Elements of statistical learning theory

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

Machine Learning

Efficient PAC Learning from the Crowd

Empirical Risk Minimization

Unlabeled Data: Now It Helps, Now It Doesn t

Does Unlabeled Data Help?

Fuzzy Limits of Functions

References for online kernel methods

Fast learning rates for plug-in classifiers under the margin condition

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Learnability of the Superset Label Learning Problem

Analysis of perceptron-based active learning

Minimax Bounds for Active Learning

Distributed Machine Learning. Maria-Florina Balcan Carnegie Mellon University

Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions

Lecture 29: Computational Learning Theory

Computational and Statistical Learning theory

Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions

Learning Combinatorial Functions from Pairwise Comparisons

Multilevel factor analysis modelling using Markov Chain Monte Carlo (MCMC) estimation

Multiclass Classification-1

Active Learning from Single and Multiple Annotators

PAC Generalization Bounds for Co-training

A Necessary Condition for Learning from Positive Examples

Consistency of Nearest Neighbor Methods

1 Learning Linear Separators

arxiv: v4 [cs.lg] 5 Nov 2014

On a Theory of Learning with Similarity Functions

Generalization bounds

Introduction to Machine Learning

Chapter One. The Real Number System

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

COMS 4771 Lecture Boosting 1 / 16

Solving Classification Problems By Knowledge Sets

Generalization and Overfitting

Transcription:

Liu Yang Steve Hanneke Jaime Carbonell shanneke@stat.cmu.edu Department of Statistics Carnegie Mellon Univsity liuy@cs.cmu.edu Machine Learning Department Carnegie Mellon Univsity jgc@cs.cmu.edu Language Technologies Institute Carnegie Mellon Univsity Abstract We prove that access to a prior distribution ov target functions can dramatically improve the sample complexity of self-tminating active learning algorithms, so that it is always bett than the known results for prior-dependent passive learning. In particular, this is in stark contrast to the analysis of prior-independent algorithms, whe the are simple known learning problems for which no self-tminating algorithm can provide this guarantee for all priors. 1 Introduction and Background Active learning is a powful form of supvised machine learning charactized by intaction between the learning algorithm and supvisor during the learning process. In this work, we consid a variant known as pool-based active learning, in which a learning algorithm is given access to a (typically vy large) collection of unlabeled examples, and is able to select any of those examples, request the supvisor to label it (in agreement with the target concept), then aft receiving the label, selects anoth example from the pool, etc. This sequential label-requesting process continues until some halting crition is reached, at which point the algorithm outputs a function, and the objective is for this function to closely approximate the (unknown) target concept in the future. The primary motivation behind poolbased active learning is that, often, unlabeled examples are inexpensive and available in abundance, while annotating those examples can be costly or time-consuming; as such, we often wish to select only the informative examples to be labeled, thus reducing information-redundancy to some extent, compared to the baseline of selecting the examples to be labeled uniformly at random from the pool (passive learning). Appearing in Proceedings of the14 th Intnational Confence on Artificial Intelligence and Statistics (AISTATS) 011, Fort Lauddale, FL, USA. Volume 15 of JMLR: W&CP 15. Copyright 011 by the authors. The has recently been an explosion of fascinating theoretical results on the advantages of this type of active learning, compared to passive learning, in tms of the numb of labels required to obtain a prescribed accuracy (called the sample complexity): e.g., FSST97, Das04, DKM09, Das05, Han07b, BHV10, BBL09, Wan09, Kää06, Han07a, DHM07, Fri09, CN08, Now08, BBZ07, Han11, Kol10, Han09, BDL09. In particular, BHV10 show that in noise-free binary classifi learning, for any passive learning algorithm for a concept space of finite VC dimension, the exists an active learning algorithm with asymptotically much small sample complexity for any nontrivial target concept. In lat work, Han09 strengthens this result by removing a ctain strong dependence on the distribution of the data in the learning algorithm. Thus, it appears the are profound advantages to active learning compared to passive learning. Howev, the ability to rapidly convge to a good classifi using only a small numb of labels is only one desirable quality of a machine learning method, and the are oth qualities that may also be important in ctain scenarios. In particular, the ability to vify the pformance of a learning method is often a crucial part of machine learning applications, as (among oth things) it helps us detmine wheth we have enough data to achieve a desired level of accuracy with the given method. In passive learning, one common practice for this vification is to hold out a random sample of labeled examples as a validation sample to evaluate the trained classifi (e.g., to detmine when training is complete). It turns out this technique is not feasible in active learning, since in ord to be really useful as an indicator of wheth we have seen enough labels to guarantee the desired accuracy, the numb of labeled examples in the random validation sample would need to be much larg than the numb of labels requested by the active learning algorithm itself, thus (to some extent) canceling the savings obtained by pforming active rath than passive learning. Anoth common practice in passive learning is to examine the training ror rate of the returned classifi, which can sve as a reasonable indicator of pformance (aft adjusting for model complexity). Howev, again this measure of pformance is not necessarily reasonable for active 816

learning, since the set of examples the algorithm requests the labels of is typically distributed vy diffently from the test examples the classifi will be applied to aft training. This reasoning indicates that pformance vification is (at best) a far more subtle issue in active learning than in passive learning. Indeed, BHV10 note that although the numb of labels required to achieve good accuracy is significantly small than passive learning, it is often the case that the numb of labels required to vify that the accuracy is good is not significantly improved. In particular, this phenomenon can dramatically increase the sample complexity of active learning algorithms that adaptively detmine how many labels to request before tminating. In short, if we require the algorithm both to learn an accurate concept and to know that its concept is accurate, then the numb of labels required by active learning is often not significantly small than the numb required by passive learning. We should note, howev, that the above results we proven for a learning scenario in which the target concept is consided a constant, and no information about the process that genates this concept is known a priori. Altnatively, we can consid a modification of this problem, so that the target concept can be thought of as a random variable, a sample from a known distribution (called a prior) ov the space of possible concepts. Such a setting has been studied in detail in the context of passive learning for noise-free binary classification. In particular, HKS9 found that for any concept space of finite VC dimension d, for any prior and distribution ov data points, O(d/ε) random labeled examples are sufficient for the expected ror rate of the Bayes classifi produced und the postior distribution to be at most ε. Furthmore, it is easy to construct learning problems for which the is an Ω(1/ε) low bound on the numb of random labeled examples required to achieve expected ror rate at most ε, by any passive learning algorithm; for instance, the problem of learning threshold classifis on0, 1 und a uniform data distribution and uniform prior is one such scenario. In the context of active learning (again, with access to the prior), FSST97 analyze the Quy by Committee algorithm, and find that if a ctain information gain quantity for the points requested by the algorithm is lowbounded by a value g, then the algorithm requires only O((d/g) log(1/ε)) labels to achieve expected ror rate at most ε. In particular, they show that this is satisfied for constant g for linear separators und a near-uniform prior, and a near-uniform data distribution ov the unit sphe. This represents a marked improvement ov the results of HKS9 for passive learning, and since the Quy by Committee algorithm is self-vifying, this result is highly relevant to the present discussion. Howev, the condition that the information gains be low-bounded by a constant is quite restrictive, and many intesting learning problems are precluded by this requirement. Furthmore, the exist learning problems (with finite VC dimension) for which the Quy by Committee algorithm makes an expected numb of label requests exceeding Ω(1/ε). To date, the has not been a genal analysis of how the value of g can behave as a function of ε, though such an analysis would likely be quite intesting. In the present pap, we take a more genal approach to the question of active learning with access to the prior. We are intested in the broad question of wheth access to the prior bridges the gap between the sample complexity of learning and the sample complexity of learning with vification. Specifically, we ask the following question. Can a prior-dependent self-tminating active learning algorithm for a concept class of finite VC dimension always achieve expected ror rate at most ε using o(1/ε) label requests? Aft some basic definitions in Section, we begin in Section 3 with a concrete example, namely intval classifis und a uniform data density but arbitrary prior, to illustrate the genal idea, and convey some of the intuition as to why one might expect a positive answ to this question. In Section 4, we present a genal proof that the answ is always yes. As the known results for the sample complexity of passive learning with access to the prior are typically 1/ε HKS9, and this is sometimes tight, this represents an improvement ov passive learning. The proof is simple and accessible, yet represents an important step in undstanding the problem of self-tmination in active learning algorithms, and the genal issue of the complexity of vification. Also, as this is a result that does not genally hold for prior-independent algorithms (even for their avage-case behavior induced by the prior) for ctain concept spaces, this also represents a significant step toward undstanding the inhent value of having access to the prior. Definitions and Preinaries First, we introduce some notation and formal definitions. We denote byx the instance space, representing the range of the unlabeled data points, and we suppose a distribution D on X, which we will ref to as the data distribution. We also suppose the existence of a sequence X 1,X,... of i.i.d. random variables, each with distribution D, refred to as the unlabeled data sequence. Though one could potentially analyze the achievable pformance as a function of the numb of unlabeled points made available to the learning algorithm (cf. Das05), for simplicity in the present work, we will suppose this unlabeled sequence is essentially inexhaustible, corresponding to the practical fact that unlabeled data are typically available in abundance as they are often relatively inexpensive to ob- 817

Liu Yang, Steve Hanneke, Jaime Carbonell tain. Additionally, the is a setcof measurable classifis h : X { 1,+1}, refred to as the concept space. We denote by d the VC dimension of C, and in our present context we will restrict ourselves to spaces C with d <, refred to as a VC class. We also have a probability distributionπ, called the prior, ovc, and a random variable h π, called the target function; we suppose h is independent from the data sequence X 1,X,... We adopt the usual notation for conditional expectations and probabilities ADD00; for instance,ea B can be thought of as an expectation of the value A, und the conditional distribution ofagiven the value ofb (which itself is random), and thus the value of EA B is essentially detmined by the value ofb. For any measurableh : X { 1,+1}, define the ror rate(h) = D({x : h(x) h (x)}). So far, this setup is essentially identical to that of HKS9, FSST97. The protocol in active learning is the following. An active learning algorithm A is given as input the prior π, the data distribution D (though see Section 5), and a value ε (0,1. It also (implicitly) depends on the data sequence X 1,X,..., and has an indirect dependence on the target functionh via the following type of intaction. The algorithm may inspect the values X i for any initial segment of the data sequence, select an index i N to request the label of; aft selecting such an index, the algorithm receives the value h (X i ). The algorithm may then select anoth index, request the label, receive the value of h on that point, etc. This happens for a numb of rounds, N(A,h,ε,D,π), before eventually the algorithm halts and returns a) classifi ĥ. An algorithm is said to be correct ife (ĥ ε for evy(ε,d,π); that is, given direct access to the prior and the data distribution, and given a specified value ε, a correct algorithm must be guaranteed to have expected ror rate at most ε. Define the expected sample complexity ofafor(x,c,d,π) to be the function SC(ε,D,π) = EN(A,h,ε,D,π): the expected numb of label requests the algorithm makes. We will be intested in proving that ctain algorithms achieve a sample complexity SC(ε, D, π) = o(1/ε). For some (X,C,D), it is known that the are π-independent algorithms (meaning the algorithm s behavior is independent of the π argument) A such that we always have EN(A,h,ε,D,π) h = o(1/ε); for instance, threshold classifis have this propty und any D, homogeneous linear separators have this propty und a uniform D on the unit sphe in k dimensions, and intvals with positive width on X = 0,1 have this propty und D = Uniform(0,1) (see e.g., Das05). It is straightforward to show that any suchawill also havesc(ε,d,π) = o(1/ε) for evy π. In particular, the law of total expectation and the dominated convgence theorem imply ε 0 ε SC(ε,D,π) = ε 0 ε EEN(A,h,ε,D,π) h = E ε 0 ε EN(A,h,ε,D,π) h = 0. In these cases, we can think of SC as a kind of avagecase analysis of these algorithms. Howev, the are also many (X,C,D) for which no such π-independent algorithm exists, achieving o(1/ε) sample complexity for all priors. For instance, this is the case for C as the space of intval classifis (including the empty intval) on X = 0,1 undd = Uniform(0,1) (this essentially follows from a proof of BHV10). Thus, any genal result on o(1/ε) expected sample complexity for π-dependent algorithms would signify that the is a real advantage to having access to the prior. 3 An Example: Intvals In this section, we walk through a simple and intuitive example, to illustrate how access to the prior makes a diffence in the sample complexity. For simplicity, in this example (only) we will suppose the algorithm may request the label of any point in X, not just those in the sequence {X i }; the same ideas can easily be adapted to the setting whe quies are restricted to{x i }. Specifically, consid X = 0,1, D uniform on 0,1, and the concept space of intval classifis, whe C = {I ± a,b : 0 < a b < 1}, whe I ± a,b(x) = +1 if x a,b and 1 othwise. For each classifi h C, let w(h) = P(h(x) = +1) (the width of the intval h). Consid an active learning algorithm that makes label requests at the locations (in sequence) 1/,1/4,3/4,1/8,3/8,5/8,7/8,1/16,3/16,... until (case 1) it encounts an example x with h (x) = +1 or until (case ) the set of classifis V C consistent with all obsved labels so far satisfies Ew(h ) V ε (which ev comes first). In case, the algorithm simply halts and returns the constant classifi that always predicts 1: call it h ; note that (h ) = w(h ). In case 1, the algorithm ents a second phase, in which it pforms a binary search (repeatedly quying the midpoint between the closest two 1 and +1 points, taking 0 and 1 as known negative points) to the left and right of the obsved positive point, halting aft log (/ε) label requests on each side; this results in estimates of the target s endpoints up to ±ε/, so that returning any classifi among the set V C consistent with these labels results in ror rate at most ε; in particular, if h is the classifi in V returned, then E( h) V ε. Denoting this algorithm by A, and ĥ the classifi it returns, we have ) ) V E (ĥ = E E (ĥ ε, 818

so that the algorithm is definitely correct. Note that case will definitely be satisfied aft at most ε label requests, and case 1 will definitely be satisfied aft at most w(h ) label requests, so that the algorithm nev makes more than max{w(h ),ε} + log (/ε) label requests. In particular, for any h with w(h ) > 0, N(A,h,ε,D,π) = o(1/ε). Abbreviating N(h ) = N(A,h,ε,D,π), we have EN(h ) = E N(h ) w(h ) = 0 P(w(h ) = 0) +E N(h ) w(h ) > 0 P(w(h ) > 0). (1) Since w(h ) > 0 N(h ) = o(1/ε), the dominated convgence theorem implies εe N(h ) w(h ) > 0 ε 0 = E ε 0 εn(h ) w(h ) > 0 = 0, so that the second tm in (1) iso(1/ε). IfP(w(h ) = 0) = 0, this completes the proof. We focus the rest of the proof on the first tm in (1), in the case that P(w(h ) = 0) > 0: i.e. the is nonzo probability that the target h labels the space almost all negative. Letting V denote the subset of C consistent with all requested labels, note that on the event w(h ) = 0, aft n label requests (for n a pow of ) we have max h V w(h) 1/n. Thus, for any value w ε (0,1), aft at most w ε label requests, on the event that w(h ) = 0, E w(h ) V = w(h)ih Vπ(dh)/π(V) w(h)iw(h) w ε π(dh)/π(v) = Ew(h )Iw(h ) w ε /π(v) Ew(h )Iw(h ) w ε P(w(h. () ) = 0) Now note that, by the dominated convgence theorem, w(h E )Iw(h ) w w 0 = E w w 0 w(h )Iw(h ) w = 0. w Thefore, Ew(h )Iw(h ) w = o(w). If we define w ε as the largest value of w for which Ew(h )Iw(h ) w εp(w(h ) = 0) (or, say, half the supremum if the maximum is not achieved), then we have w ε = ω(ε). Combined with (), this implies E N(h ) w(h ) = 0 = o(1/ε). w ε Thus, all of the tms in (1) are o(1/ε), so that in total EN(h ) = o(1/ε). In conclusion, for this concept space C and data distributiond, we have a correct active learning algorithm achieving a sample complexity SC(ε,D,π) = o(1/ε) for all priorsπ on C. 4 Main Result In this section, we present our main result: a genal result stating that o(1/ε) expected sample complexity is always achievable by some correct active learning algorithm, for any (X,C,D,π) for which C has finite VC dimension. Since the known results for the sample complexity of passive learning with access to the prior are typically Θ(1/ε), and since the are known learning problems (X,C,D,π) for which evy passive learning algorithm requires Ω(1/ε) samples, this o(1/ε) result for active learning represents an improvement ov passive learning. Additionally, as mentioned, this type of result is often not possible for algorithms lacking access to the prior π, as the are wellknown problems(x, C, D) for which no prior-independent correct algorithm (of the self-tminating type studied he) can achieve o(1/ε) sample complexity for evy prior π BHV10; in particular, the intvals problem studied above is one such example. First, we have a small lemma. Lemma 1. For any sequence of functionsφ n : C 0, ) such that, f C, φ n (f) = o(1/n) and n N, φ n (f) c/n (for an f-independent constant c (0, )), the exists a sequence φ n in0, ) such that φ n = o(1/n) and P( φ n (h ) > φ ) n = 0. Proof. For any constantδ (0, ), we have (by Markov s inequality and the dominated convgence theorem) P(nφ n(h ) > δ) 1 δ Enφ n(h ) = 1 δ E nφ n(h ) = 0. Thefore (by induction), the exists a divging sequence n i in N such that i sup n ni P ( nφ n (h ) > i) = 0. Invting this, let i n = max{i N : n i n}, and define φ n (h ) = (1/n) in. By construction, P ( φ n (h ) > φ ) n 0. Furthmore, ni = i n, so that we have implying φ n = o(1/n). n φ n = in = 0, Theorem 1. For any VC class C, the is a correct active learning algorithm that, for evy data distribution D and prior π, achieves expected sample complexity SC for (X,C,D,π) such that SC(ε,D,π) = o(1/ε). 819

Liu Yang, Steve Hanneke, Jaime Carbonell Our approach to proving Theorem 1 is via a reduction to established results about active learning algorithms that are not self-vifying. Specifically, consid a slightly diffent type of active learning algorithm than that defined above: namely, an algorithm A a that takes as input a budget n N on the numb of label requests it is allowed to make, and that aft making at most n label requests returns as output a classifi ĥn. Let us ref to any such algorithm as a budget-based active learning algorithm. Note that budget-based active learning algorithms are prior-independent (have no direct access to the prior). The following result was proven by Han09 (see also the related earli work of BHV10). Lemma. Han09 For any VC class C, the exists a constant c (0, ), a function R(n;f,D), and a (priorindependent) budget-based active learning algorithm A a such that D, f C,R(n;f,D) c/n and R(n;f,D) = o(1/n), ) h and E (ĥn R(n;h,D) (always), whe ĥn is the classifi returned by A a. 1 That is, equivalently, for any fixed value for the target function, the expected ror rate is o(1/n), whe the random variable in the expectation is only the data sequence X 1,X,... Our task in the proof of Theorem 1 is to convt such a budget-based algorithm into one of the form defined in Section 1: that is, a self-tminating priordependent algorithm, taking ε as input. Proof of Theorem 1. Consid A a, ĥ n, R, and c as in Lemma, and define { ) } n ε = min n N : E (ĥn ε. This value is accessible based purely on access to π and D. Furthmore, ) we clearly have (by construction) E (ĥnε ε. Thus, denoting by A a the active learning algorithm, taking(d,π,ε) as input, which runsa a (n ε ) and then returnsĥn ε, we have thata a is a correct algorithm (i.e., its expected ror rate is at most ε). As for the expected sample complexity SC(ε, D, π) achieved by A a, we have SC(ε,D,π) n ε, so that it remains only to bound n ε. By Lemma 1, the is a π- dependent function R(n;π,D) such that π, π({f C : R(n;f,D) > R(n;π,D)}) 0 and R(n;π,D) = o(1/n). 1 Furthmore, it is not difficult to see that we can take this R to be measurable in theh argument. Thefore, by the law of total expectation, ) ) E (ĥn h = E E (ĥn ER(n;h,D) c π({f C : R(n;f,D)>R(n;π,D)})+R(n;π,D) n = o(1/n). If n ε = O(1), then clearly n ε = o(1/ε) as needed. Othwise, since n ε is monotonic in ε, we must have n ε as ε 0. In particular, in this latt case we have ε 0 ε n ε ε 0 ε ( { 1+max = ε max ni E ε 0 n n ε 1 ε max ne ε 0 n n ε 1 = max ne ε 0 n n ε 1 so that n ε = o(1/ε), as required. ) (ĥn n n ε 1 : E ) (ĥn /ε > 1 ) (ĥn /ε ) (ĥn = supne 5 Dependence on D in the Learning Algorithm }) > ε ) (ĥn = 0, The dependence on D in the algorithm described in the proof is fairly weak, and we can einate ) any direct dependence on D by replacing (ĥn by a 1 ε/ confidence upp bound based on m ε = Ω ( 1 ε log 1 ε) i.i.d. unlabeled examples X 1,X,...,X m ε independent from the examples used by the algorithm: for instance, set aside in a pre-processing step, whe the bound is dived based on Hoeffding s inequality and a union bound ov the values of n that we check, of which the are at most O(1/ε). Then we simply increase the value of n (starting at some constant, such as 1) until 1 m ε m ε i=1 ( P ) h (X i) ĥn(x i) {X j } j,{x j} j ε/. The expected value of the smallest value ofnfor which this occurs is o(1/ε). Note that the probability only requires access to the prior π, not the data distribution D (the budgetbased algorithm A a of Han09 has no direct dependence on D); if desired for computational efficiency, this probability may also be estimated by a 1 ε/4 confidence upp bound based on Ω ( 1 ε log ε) 1 independent samples of h values with distribution π, whe for each sample we simulate the execution of A a (n) for that (simulated) target function in ord to obtain the returned classifi. In particular, note that no actual label requests to the oracle are required during this process of estimating the appropriate label budget n ε, as all executions of A a are simulated. 80

6 Inhent Dependence on π in the Sample Complexity We have shown that for evy priorπ, the sample complexity is bounded by a function that iso(1/ε). One might wond wheth it is possible that the asymptotic dependence on ε in the sample complexity can be prior-independent, while still being o(1/ε). That is, we can ask wheth the exists a (π-independent) function s(ε) = o(1/ε) such that, for all priors π, the is a correct π-dependent algorithm achieving a sample complexity SC(ε, D, π) = O(s(ε)), possibly involving π-dependent constants. Ctainly in some cases, such as threshold classifis, this is true. Howev, it seems this is not genally the case, and in particular it fails to hold for the space of intval classifis. For instance, consid a prior π on the space C of intval classifis, constructed as follows. We are given an arbitrary monotonic g(ε) = o(1/ε); since g(ε) = o(1/ε), the must exist (nonzo) functions q 1 (i) and q (i) such that i q 1 (i) = i q (i) = 0 and i N,g(q 1 (i)/ i+1 ) q (i) i ; furthmore, letting q(i) = max{q 1 (i),q (i)}, by monotonicity of g we also have i N,g(q(i)/ i+1 ) q(i) i, and i q(i) = 0. Then define a function p(i) with i Np(i) = 1 such that p(i) q(i) for infinitely many i N; for instance, this can be done inductively as follows. Let α 0 = 1/; for each i N, if q(i) > α i 1, set p(i) = 0 and α i = α i 1 ; othwise, set p(i) = α i 1 and α i = α i 1 /. Finally, for ({ each i N, and}) each j {0,1,..., i 1}, define π I ± j i,(j+1) i = p(i)/ i. We let D be uniform on X = 0,1. Then for each i N s.t. p(i) q(i), the is ap(i) probability the target intval has width i, and given this any algorithm requires i expected numb of requests to detmine which of these i intvals is the target, failing which the ror rate is at least i. In particular, letting ε i = p(i)/ i+1, any correct algorithm has sample complexity at least p(i) i for ε = ε i. Notingp(i) i q(i) i g(q(i)/ i+1 ) g(ε i ), this implies the exist arbitrarily small values ofε > 0 for which the optimal sample complexity is at least g(ε), so that the sample complexity is not o(g(ε)). For any s(ε) = o(1/ε), the exists a monotonic g(ε) = o(1/ε) such that s(ε) = o(g(ε)). Thus, constructing π as above for this g, we have that the sample complexity is not o(g(ε)), and thefore not O(s(ε)). So at least for the space of intval classifis, the specific o(1/ε) asymptotic dependence onεis inhently π-dependent. Refences ADD00 R. B. Ash and C. A. Doléans-Dade. Probability & Measure Theory. Academic Press, 000. BBL09 BBZ07 BDL09 M.-F. Balcan, A. Beygelzim, and J. Langford. Agnostic active learning. Journal of Comput and System Sciences, 75(1):78 89, 009. M.-F. Balcan, A. Brod, and T. Zhang. Margin based active learning. In Proceedings of the 0 th Confence on Learning Theory, 007. A. Beygelzim, S. Dasgupta, and J. Langford. Importance weighted active learning. In Proceedings of the Intnational Confence on Machine Learning, 009. BHV10 M.-F. Balcan, S. Hanneke, and J. Wortman Vaughan. The true sample complexity of active learning. Machine Learning, 80( 3):111 139, Septemb 010. CN08 Das04 Das05 R. Castro and R. Nowak. Minimax bounds for active learning. IEEE Transactions on Information Theory, 54(5):339 353, July 008. S. Dasgupta. Analysis of a greedy active learning strategy. In Advances in Neural Information Processing Systems, pages 337 344. MIT Press, 004. S. Dasgupta. Coarse sample complexity bounds for active learning. In Proc. of Neural Information Processing Systems (NIPS), 005. DHM07 S. Dasgupta, D. Hsu, and C. Monteleoni. A genal agnostic active learning algorithm. In Advances in Neural Information Processing Systems 0, 007. DKM09 S. Dasgupta, A. T. Kalai, and C. Monteleoni. Analysis of pceptron-based active learning. Journal of Machine Learning Research, 10:81 99, 009. Fri09 E. Friedman. Active learning for smooth problems. In Proceedings of the nd Confence on Learning Theory, 009. FSST97 Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the quy by committee algorithm. In Machine Learning, pages 133 168, 1997. Han07a S. Hanneke. A bound on the label complexity of agnostic active learning. In Proc. of the 4th Intnational Confence on Machine Learning, 007. Han07b S. Hanneke. Teaching dimension and the complexity of active learning. In Proc. of the 0th Annual Confence on Learning Theory (COLT), 007. 81

Liu Yang, Steve Hanneke, Jaime Carbonell Han09 S. Hanneke. Theoretical Foundations of Active Learning. PhD thesis, Carnegie Mellon Univsity, 009. Han11 S. Hanneke. Rates of convgence in active learning. The Annals of Statistics, 39(1):333 361, 011. HKS9 D. Haussl, M. Kearns, and R. Schapire. Bounds on the sample complexity of bayesian learning using information theory and the vc dimension. In Machine Learning, pages 61 74. Morgan Kaufmann, 199. Kää06 M. Kääriäinen. Active learning in the nonrealizable case. In Proc. of the 17th Intnational Confence on Algorithmic Learning Theory, 006. Kol10 V. Koltchinskii. Rademach complexities and bounding the excess risk in active learning. Journal of Machine Learning Research, To Appear, 010. Now08 R. D. Nowak. Genalized binary search. In Proceedings of the 46 th Annual Allton Confence on Communication, Control, and Computing, 008. Wan09 L. Wang. Sufficient conditions for agnostic active learnable. In Advances in Neural Information Processing Systems, 009. 8