Agnostic Active Learning Without Constraints

Size: px

Start display at page:

Download "Agnostic Active Learning Without Constraints"

Daniel Ramsey
5 years ago
Views:

1 Agnostic Active Learning Without Constraints Alina Beygelzimer IBM Research Hawthorne, NY John Langford Yahoo! Research New York, NY Daniel Hsu Rutgers University & University of Pennsylvania Tong Zhang Rutgers University Piscataway, NJ Abstract We present and analyze an agnostic active learning algorithm that works without keeping a version space. This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained throughout learning, and only hypotheses from this set are ever returned. By avoiding this version space approach, our algorithm sheds the computational burden and brittleness associated with maintaining version spaces, yet still allows for substantial improvements over supervised learning for classification. Introduction In active learning, a learner is given access to unlabeled data and is allowed to adaptively choose which ones to label. This learning model is motivated by applications in which the cost of labeling data is high relative to that of collecting the unlabeled data itself. Therefore, the hope is that the active learner only needs to query the labels of a small number of the unlabeled data, and otherwise perform as well as a fully supervised learner. In this work, we are interested in agnostic active learning algorithms for binary classification that are provably consistent, i.e. that converge to an optimal hypothesis in a given hypothesis class. One technique that has proved theoretically profitable is to maintain a candidate set of hypotheses sometimes called a version space, and to query the label of a point only if there is disagreement within this set about how to label the point. The criteria for membership in this candidate set needs to be carefully defined so that an optimal hypothesis is always included, but otherwise this set can be quickly whittled down as more labels are queried. This technique is perhaps most readily understood in the noise-free setting [, 2], and it can be extended to noisy settings by using empirical confidence bounds [3, 4, 5, 6, 7]. The version space approach unfortunately has its share of significant drawbacks. The first is computational intractability: maintaining a version space and guaranteeing that only hypotheses from this set are returned is difficult for linear predictors and appears intractable for interesting nonlinear predictors such as neural nets and decision trees []. Another drawback of the approach is its brittleness: a single mishap due to, say, modeling failures or computational approximations might cause the learner to exclude the best hypothesis from the version space forever; this is an ungraceful failure mode that is not easy to correct. A third drawback is related to sample re-usability: if labeled data is collected using a version space-based active learning algorithm, and we later decide to use a different algorithm or hypothesis class, then the earlier data may not be freely re-used because its collection process is inherently biased.

2 Here, we develop a new strategy addressing all of the above problems given an oracle that returns an empirical risk minimizing ERM hypothesis. As this oracle matches our abstraction of many supervised learning algorithms, we believe active learning algorithms built in this way are immediately and widely applicable. Our approach instantiates the importance weighted active learning framework of [5] using a rejection threshold similar to the algorithm of [4] which only accesses hypotheses via a supervised learning oracle. However, the oracle we require is simpler and avoids strict adherence to a candidate set of hypotheses. Moreover, our algorithm creates an importance weighted sample that allows for unbiased risk estimation, even for hypotheses from a class different from the one employed by the active learner. This is in sharp contrast to many previous algorithms e.g., [, 3, 8, 4, 6, 7] that create heavily biased data sets. We prove that our algorithm is always consistent and has an improved label complexity over passive learning in cases previously studied in the literature. We also describe a practical instantiation of our algorithm and report on some experimental results.. Related Work As already mentioned, our work is closely related to the previous works of [4] and [5], both of which in turn draw heavily on the work of [] and [3]. The algorithm from [4] extends the selective sampling method of [] to the agnostic setting using generalization bounds in a manner similar to that first suggested in [3]. It accesses hypotheses only through a special ERM oracle that can enforce an arbitrary number of example-based constraints; these constraints define a version space, and the algorithm only ever returns hypotheses from this space, which can be undesirable as we previously argued. Other previous algorithms with comparable performance guarantees also require similar example-based constraints e.g., [3, 5, 6, 7]. Our algorithm differs from these in that i it never restricts its attention to a version space when selecting a hypothesis to return, and ii it only requires an ERM oracle that enforces at most one example-based constraint, and this constraint is only used for selective sampling. Our label complexity bounds are comparable to those proved in [5] though somewhat worse that those in [3, 4, 6, 7]. The use of importance weights to correct for sampling bias is a standard technique for many machine learning problems e.g., [9, 0, ] including active learning [2, 3, 5]. Our algorithm is based on the importance weighted active learning IWAL framework introduced by [5]. In that work, a rejection threshold procedure called loss-weighting is rigorously analyzed and shown to yield improved label complexity bounds in certain cases. Loss-weighting is more general than our technique in that it extends beyond zero-one loss to a certain subclass of loss functions such as logistic loss. On the other hand, the loss-weighting rejection threshold requires optimizing over a restricted version space, which is computationally undesirable. Moreover, the label complexity bound given in [5] only applies to hypotheses selected from this version space, and not when selected from the entire hypothesis class as the general IWAL framework suggests. We avoid these deficiencies using a new rejection threshold procedure and a more subtle martingale analysis. Many of the previously mentioned algorithms are analyzed in the agnostic learning model, where no assumption is made about the noise distribution see also [4]. In this setting, the label complexity of active learning algorithms cannot generally improve over supervised learners by more than a constant factor [5, 5]. However, under a parameterization of the noise distribution related to Tsybakov s low-noise condition [6], active learning algorithms have been shown to have improved label complexity bounds over what is achievable in the purely agnostic setting [7, 8, 8, 6, 7]. We also consider this parameterization to obtain a tighter label complexity analysis. 2 Preliminaries 2. Learning Model Let D be a distribution over X Y where X is the input space and Y = {±} are the labels. Let X,Y X Y be a pair of random variables with joint distributiond. An active learner receives a sequence X,Y,X 2,Y 2,... of i.i.d. copies of X,Y, with the label Y i hidden unless it is explicitly queried. We use the shorthand a :k to denote a sequence a,a 2,...,a k so k = 0 correspond to the empty sequence. 2

3 Let H be a set of hypotheses mapping from X to Y. For simplicity, we assume H is finite but does not completely agree on any single x X i.e., x X, h,h H such that hx h x. This keeps the focus on the relevant aspects of active learning that differ from passive learning. The error of a hypothesis h : X Y is errh := PrhX Y. Let h := argmin{errh : h H} be a hypothesis of minimum error in H. The goal of the active learner is to return a hypothesis h H with errorerrh not much more than errh, using as few label queries as possible. 2.2 Importance Weighted Active Learning In the importance weighted active learning IWAL framework of [5], an active learner looks at the unlabeled data X,X 2,... one at a time. After each new point X i, the learner determines a probability P i [0,]. Then a coin with bias P i is flipped, and the label Y i is queried if and only if the coin comes up heads. The query probability P i can depend on all previous unlabeled examples X :i, any previously queried labels, any past coin flips, and the current unlabeled point X i. Formally, an IWAL algorithm specifies a rejection threshold functionp : X Y {0,} X [0,] for determining these query probabilities. Let Q i {0,} be a random variable conditionally independent of the current label Y i, and with conditional expectation Q i Y i X :i,y :i,q :i E[Q i Z :i,x i ] = P i := pz :i,x i. where Z j := X j,y j,q j. That is, Q i indicates if the label Y i is queried the outcome of the coin toss. Although the notation does not explicitly suggest this, the query probability P i = pz :i,x i is allowed to explicitly depend on a label Y j j < i if and only if it has been queried Q j =. 2.3 Importance Weighted Estimators We first review some standard facts about the importance weighting technique. For a function f : X Y R, define the importance weighted estimator of E[fX,Y] from Z :n X Y {0,} n to be fz :n := Q i fx i,y i. n P i i= Note that this quantity depends on a label Y i only if it has been queried i.e., only if Q i = ; it also depends on X i only if Q i =. Our rejection threshold will be based on a specialization of this estimator, specifically the importance weighted empirical error of a hypothesis h errh,z :n := n i= Q i P i [hx i Y i ]. In the notation of Algorithm, this is equivalent to errh,s n := /P i [hx i Y i ] n X i,y i,/p i S n where S n X Y R is the importance weighted sample collected by the algorithm. A basic property of these estimators is unbiasedness: E[ fz :n ] = /n n i= E[E[Q i/p i fx i,y i X :i,y :i,q :i ]] = /n n i= E[P i/p i fx i,y i ] = E[fX,Y]. So, for example, the importance weighted empirical error of a hypothesis h is an unbiased estimator of its true error errh. This holds for any choice of the rejection threshold that guarantees P i > 0. 3 A Deviation Bound for Importance Weighted Estimators As mentioned before, the rejection threshold used by our algorithm is based on importance weighted error estimates errh,z :n. Even though these estimates are unbiased, they are only reliable when 3

4 the variance is not too large. To get a handle on this, we need a deviation bound for importance weighted estimators. This is complicated by two factors that rules out straightforward applications of some standard bounds:. The importance weighted samples X i,y i,/p i or equivalently, the Z i = X i,y i,q i are not i.i.d. This is because the query probabilityp i and thus the importance weight/p i generally depends on Z :i and X i. 2. The effective range and variance of each term in the estimator are, themselves, random variables. To address these issues, we develop a deviation bound using a martingale technique from [9]. Let f : X Y [,] be a bounded function. Consider any rejection threshold function p : X Y {0,} X 0,] for whichp n = pz :n,x n is bounded below by some positive quantity which may depend on n. Equivalently, the query probabilities P n should have inverses /P n bounded above by some deterministic quantity r max which, again, may depend on n. The a priori upper bound r max on /P n can be pessimistic, as the dependence on r max in the final deviation bound will be very mild it enters in as loglogr max. Our goal is to prove a bound on fz :n E[fX,Y] that holds with high probability over the joint distribution of Z :n. To start, we establish bounds on the range and variance of each term W i := Q i /P i fx i,y i in the estimator, conditioned on X :i,y :i,q :i. Let E i [ ] denote E[ X :i,y :i,q :i ]. Note that E i [W i ] = E i [Q i ]/P i fx i,y i = fx i,y i, so if E i [W i ] = 0, then W i = 0. Therefore, the conditional range and variance are non-zero only if E i [W i ] 0. For the range, we have W i = Q i /P i fx i,y i /P i, and for the variance, E i [W i E i [W i ] 2 ] E i [Q 2 i ]/P2 i fx i,y i 2 /P i. These range and variance bounds indicate the form of the deviations we can expect, similar to that of other classical deviation bounds. Theorem. Pick any t 0 and n. Assume /P i r max for all i n, and let R n := /min{p i : i n fx i,y i 0} {}. With probability at least 23 + log 2 r max e t/2, Q i fx i,y i E[fX,Y] n P i 2Rn t 2t n + n + R nt 3n. i= We defer all proofs to the appendices. 4 Algorithm First, we state a deviation bound for the importance weighted error of hypotheses in a finite hypothesis class H that holds for all n. It is a simple consequence of Theorem and union bounds; the form of the bound motivates certain algorithmic choices to be described below. Lemma. Pick any δ 0,. For all n, let ε n := 6log23+nlog 2nnn+ H /δ n logn H /δ = O n. 3 LetZ,Z 2,... X Y {0,} be the sequence of random variables specified in Section 2.2 using a rejection thresholdp : X Y {0,} X [0,] that satisfiespz :n,x /n n for all z :n,x X Y {0,} n X and alln. The following holds with probability at least δ. For alln and all h H, errh,z :n errh,z :n errh errh εn P min,n h + ε n P min,n h where P min,n h = min{p i : i n hx i h X i } {}. We let C 0 = Olog H /δ 2 be a quantity such that ε n as defined in Eq. 3 is bounded as ε n C 0 logn+/n. The following absolute constants are used in the description of the rejection 4 4

5 Algorithm Notes: see Eq. for the definition of err importance weighted error, and Section 4 for the definitions of C 0, c, and c 2. Initialize: S 0 :=. For k =,2,...,n:. Obtain unlabeled data point X k. 2. Let h k := argmin{errh,s k : h H}, and h k := argmin{errh,s k : h H hx k h k X k }. Let G k := errh k,s k errh k,s k, and { { C P k := ifg k 0 logk C0 logk k + k = min, O s otherwise G 2 + C0logk } k G k k where s 0, is the positive solution to the equation c C0 logk G k = c + s k + c2 s c 2 + C0logk k Toss a biased coin withprheads = P k. If heads, then query Y k, and let S k := S k {X k,y k,/p k }. Else, lets k := S k. Return: h n+ := argmin{errh,s n : h H}. Figure : Algorithm for importance weighted active learning with an error minimization oracle. threshold and the subsequent analysis: c := , c 2 := 5, c 3 := c + 2/c 2 2, c 4 := c + c 3 2, c 5 := c 2 +c 3. Our proposed algorithm is shown in Figure. The rejection threshold Step 2 is based on the deviation bound from Lemma. First, the importance weighted error minimizing hypothesish k and the alternative hypothesish k are found. Note that both optimizations are over the entire hypothesis class H with h k only being required to disagree with h k on x k this is a key aspect where our algorithm differs from previous approaches. The difference in importance weighted errors G k of the two hypotheses is then computed. If G k C 0 logk/k + C 0 logk/k, then the query probability P k is set to. Otherwise, P k is set to the positive solution s to the quadratic equation in Eq. 2. The functional form ofp k is roughlymin{, /G 2 k +/G k C 0 logk/k }. It can be checked thatp k 0,] and thatp k is non-increasing withg k. It is also useful to note thatlogk/k is monotonically decreasing withk we use the conventionlog/0 =. In order to apply Lemma with our rejection threshold, we need to establish the very crude bound P k /k k for all k. Lemma 2. The rejection threshold of Algorithm satisfies pz :n,x /n n for all n and all z :n,x X Y {0,} n X. Note that this is a worst-case bound; our analysis shows that the probabilities P k are more like /polyk in the typical case. 5 Analysis 5. Correctness We first prove a consistency guarantee for Algorithm that bounds the generalization error of the importance weighted empirical error minimizer. The proof actually establishes a lower bound on 5

6 the query probabilities P i /2 for X i such that h n X i h X i. This offers an intuitive characterization of the weighting landscape induced by the importance weights /P i. Theorem 2. The following holds with probability at least δ. For any n, 0 errh n errh errh n,z :n errh 2C0 logn,z :n + + 2C 0logn n n. This implies, for all n, errh n errh + 2C0 logn n + 2C 0logn n. Therefore, the final hypothesis returned by Algorithm after seeing n unlabeled data has roughly the same error bound as a hypothesis returned by a standard passive learner with n labeled data. A variant of this result under certain noise conditions is given in the appendix. 5.2 Label Complexity Analysis We now bound the number of labels requested by Algorithm after n iterations. The following lemma bounds the probability of querying the label Y n ; this is subsequently used to establish the final bound on the expected number of labels queried. The key to the proof is in relating empirical error differences and their deviations to the probability of querying a label. This is mediated through the disagreement coefficient, a quantity first used by [4] for analyzing the label complexity of the A 2 algorithm of [3]. The disagreement coefficient θ := θh,h,d is defined as { PrX DISh θh },r,h,d := sup : r > 0 r where DISh,r := {x X : h H such that Prh X h X r and h x h x} the disagreement region around h at radius r. This quantity is bounded for many learning problems studied in the literature; see [4, 6, 20, 2] for more discussion. Note that the supremum can instead be taken over r > ǫ if the target excess error isǫ, which allows for a more detailed analysis. Lemma 3. Assume the bounds from Eq. 4 holds for all h H and n. For any n, E[Q n ] θ 2errh C0 logn +O θ n +θ C0log 2 n. n Theorem 3. With probability at least δ, the expected number of labels queried by Algorithm after n iterations is at most +θ 2errh n +O θ C 0 nlogn+θ C 0 log 3 n. The bound is dominated by a linear term scaled by errh, plus a sublinear term. The linear term errh n is unavoidable in the worst case, as evident from label complexity lower bounds [5, 5]. When errh is negligible e.g., the data is separable and θ is bounded as is the case for many problems studied in the literature [4], then the bound represents a polynomial label complexity improvement over supervised learning, similar to that achieved by the version space algorithm from [5]. 5.3 Analysis under Low Noise Conditions Some recent work on active learning has focused on improved label complexity under certain noise conditions [7, 8, 8, 6, 7]. Specifically, it is assumed that there exists constantsκ > 0 and0 < α such that PrhX h X κ errh errh α 5 for all h H. This is related to Tsybakov s low noise condition [6]. Essentially, this condition requires that low error hypotheses not be too far from the optimal hypothesish under the disagreement metric Prh X hx. Under this condition, Lemma 3 can be improved, which in turn yields the following theorem. 6

7 Theorem 4. Assume that for some value of κ > 0 and 0 < α, the condition in Eq. 5 holds for all h H. There is a constant c α > 0 depending only on α such that the following holds. With probability at least δ, the expected number of labels queried by Algorithm afterniterations is at most θ κ c α C 0 logn α/2 n α/2. Note that the bound is sublinear in n for all 0 < α, which implies label complexity improvements whenever θ is bounded an improved analogue of Theorem 2 under these conditions can be established using similar techniques. The previous algorithms of [6, 7] obtain even better rates under these noise conditions using specialized data dependent generalization bounds, but these algorithms also required optimizations over restricted version spaces, even for the bound computation. 6 Experiments Although agnostic learning is typically intractable in the worst case, empirical risk minimization can serve as a useful abstraction for many practical supervised learning algorithms in non-worst case scenarios. With this in mind, we conducted a preliminary experimental evaluation of Algorithm, implemented using a popular algorithm for learning decision trees in place of the required ERM oracle. Specifically, we use the J48 algorithm from Weka v3.6.2 with default parameters to select the hypothesis h k in each round k; to produce the alternative hypothesis h k, we just modify the decision tree h k by changing the label of the node used for predicting on x k. Both of these procedures are clearly heuristic, but they are similar in spirit to the required optimizations. We set C 0 = 8 and c = c 2 = these can be regarded as tuning parameters, with C 0 controlling the aggressiveness of the rejection threshold. We did not perform parameter tuning with active learning although the importance weighting approach developed here could potentially be used for that. Rather, the goal of these experiments is to assess the compatibility of Algorithm with an existing, practical supervised learning procedure. 6. Data Sets We constructed two binary classification tasks using MNIST and KDDCUP99 data sets. For MNIST, we randomly chose 4000 training 3s and 5s for training using the 3s as the positive class, and used all of the 902 testing 3s and 5s for testing. For KDDCUP99, we randomly chose 5000 examples for training, and another5000 for testing. In both cases, we reduced the dimension of the data to25 using PCA. To demonstrate the versatility of our algorithm, we also conducted a multi-class classification experiment using the entire MNIST data set all ten digits, so60000 training data and0000 testing data. This required modifying how h k is selected: we force h k x k h k x k by changing the label of the prediction node for x k to the next best label. We used PCA to reduce the dimension to Results We examined the test error as a function of i the number of unlabeled data seen, and ii the number of labels queried. We compared the performance of the active learner described above to a passive learner one that queries every label, so i and ii are the same using J48 with default parameters. In all three cases, the test errors as a function of the number of unlabeled data were roughly the same for both the active and passive learners. This agrees with the consistency guarantee from Theorem 2. We note that this is a basic property not satisfied by many active learning algorithms this issue is discussed further in [22]. In terms of test error as a function of the number of labels queried Figure 2, the active learner had minimal improvement over the passive learner on the binary MNIST task, but a substantial improvement over the passive learner on the KDDCUP99 task even at small numbers of label queries. For the multi-class MNIST task, the active learner had a moderate improvement over the passive learner. Note that KDDCUP99 is far less noisy more separable than MNIST3s vs5s task, so the results are in line with the label complexity behavior suggested by Theorem 3, which states that the label complexity improvement may scale with the error of the optimal hypothesis. Also, 7

8 Passive Active Passive Active test error 0.5 test error number of labels queried MNIST 3s vs 5s number of labels queried KDDCUP Passive Active Passive Active test error test error number of labels queried KDDCUP99 close-up number of labels queried x 0 4 MNIST multi-class close-up Figure 2: Test errors as a function of the number of labels queried. the results from MNIST tasks suggest that the active learner may require an initial random sampling phase during which it is equivalent to the passive learner, and the advantage manifests itself after this phase. This again is consistent with the analysis also see [4], as the disagreement coefficient can be large at initial scales, yet much smaller as the number of unlabeled data increases and the scale becomes finer. 7 Conclusion This paper provides a new active learning algorithm based on error minimization oracles, a departure from the version space approach adopted by previous works. The algorithm we introduce here motivates computationally tractable and effective methods for active learning with many classifier training algorithms. The overall algorithmic template applies to any training algorithm that i operates by approximate error minimization and ii for which the cost of switching a class prediction as measured by example errors can be estimated. Furthermore, although these properties might only hold in an approximate or heuristic sense, the created active learning algorithm will be safe in the sense that it will eventually converge to the same solution as a passive supervised learning algorithm. Consequently, we believe this approach can be widely used to reduce the cost of labeling in situations where labeling is expensive. Recent theoretical work on active learning has focused on improving rates of convergence. However, in some applications, it may be desirable to improve performance at much smaller sample sizes, perhaps even at the cost of improved rates as long as consistency is ensured. Importance sampling and weighting techniques like those analyzed in this work may be useful for developing more aggressive strategies with such properties. Acknowledgments This work was completed while DH was at Yahoo! Research and UC San Diego. 8

9 References [] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 52:20 22, 994. [2] S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing Systems 8, [3] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Twenty-Third International Conference on Machine Learning, [4] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In Advances in Neural Information Processing Systems 20, [5] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In Twenty-Sixth International Conference on Machine Learning, [6] S. Hanneke. Adaptive rates of convergence in active learning. In Twenty-Second Annual Conference on Learning Theory, [7] V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. Manuscript, [8] M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In Twentieth Annual Conference on Learning Theory, [9] R..S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 998. [0] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal of Computing, 32:48 77, [] M. Sugiyama, M. Krauledat, and K.-R. Müller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8: , [2] M. Sugiyama. Active learning for misspecified models. In Advances in Neural Information Processing Systems 8, [3] F. Bach. Active learning for misspecified generalized linear models. In Advances in Neural Information Processing Systems 9, [4] S. Hanneke. A bound on the label complexity of agnostic active learning. In Twenty-Fourth International Conference on Machine Learning, [5] M. Kääriäinen. Active learning in the non-realizable case. In Seventeenth International Conference on Algorithmic Learning Theory, [6] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statistics, 32:35 66, [7] R. Castro and R. Nowak. Upper and lower bounds for active learning. In Allerton Conference on Communication, Control and Computing, [8] R. Castro and R. Nowak. Minimax bounds for active learning. In Twentieth Annual Conference on Learning Theory, [9] T. Zhang. Data dependent concentration bounds for sequential prediction algorithms. In Eighteenth Annual Conference on Learning Theory, [20] E. Friedman. Active learning for smooth problems. In Twenty-Second Annual Conference on Learning Theory, [2] L. Wang. Sufficient conditions for agnostic active learnable. In Advances in Neural Information Processing Systems 22, [22] S. Dasgupta and D. Hsu. Hierarchical sampling for active learning. In Twenty-Fifth International Conference on Machine Learning,

10 A Proof of Deviation Bound for Importance Weighted Estimators The techniques here are mostly developed in [9]; for completeness, we detail the proofs for our particular application. The first two lemmas establish a basic bound in terms of conditional moment generating functions. Lemma 4. For all n and all functionals Ξ i := ξ i Z :i, [ n ] E exp Ξ i lne i [expξ i ] =. i= Proof. A straightforward induction onn. Lemma 5. For all t 0, λ R, n, and functionals Ξ i := ξ i Z :i, Pr λ Ξ i lne i [expλξ i ] t e t. i= i= Proof. The claim follows by Markov s inequality and Lemma 4 replacing Ξ i withλξ i. i= In order to specialize Lemma 5 for our purposes, we first analyze the conditional moment generating function of W i E i [W i ]. Lemma 6. If0 < λ < 3P i, then IfE i [W i ] = 0, then lne i [expλw i E i [W i ]] P i lne i [expλw i E i [W i ]] = 0. λ 2 2 λ/3p i. Proof. Letgx := expx x /x 2 forx 0, soexpx = +x+x 2 gx. Note thatgx is non-decreasing. Thus, E i [expλw i E i [W i ]] = E i [ +λwi E i [W i ]+λ 2 W i E i [W i ] 2 gλw i E i [W i ] ] = +λ 2 E i [ Wi E i [W i ] 2 gλw i E i [W i ] ] +λ 2 E i [ Wi E i [W i ] 2 gλ/p i ] = +λ 2 E i [ Wi E i [W i ] 2] gλ/p i +λ 2 /P i gλ/p i where the first inequality follows from the range bound W i /P i and the second follows from variance bounde i [W i E i [W i ] 2 ] /P i. Now the first claim follows from the definition ofgx, the facts expx x x 2 /2 x/3 for0 x < 3 and ln+x x. The second claim is immediate from the definition of W i and the fact E i [W i ] = fx i,y i. We now combine Lemma 6 and Lemma 5 to bound the deviation of the importance weighted estimator fz :n from /n n i= E i[w i ]. Lemma 7. Pick any t 0, n, and p min > 0, and let E be the joint event Then PrE e t. n W i n i= E i [W i ] i= p min 2t n + p min and min{p i : i n E i [W i ] 0} p min. t 3n 0

11 Proof. With foresight, let λ := 3p min 3p min 2t 3n + 3p min 2t 3n Note that 0 < λ < 3p min. By Lemma 6 and the choice of λ, we have that if min{p i : i n E i [W i ] 0} p min, then nλ lne i [expλw i E i [W i ]] λ p min 2 λ/3p min = t 6 p min 2n and i= t nλ = p min Let E be the event that n W i E i [W i ] nλ i= i= t 2n + p min. t 3n. 7 lne i [expλw i E i [W i ]] t nλ and lete be the eventmin{p i : i n E i [W i ] 0} p min. Together, Eq. 6 and Eq. 7 implye E E. And of course,e E E, sopre PrE E PrE e t by Lemma 5. To do away with the joint event in Lemma 7, we use the standard trick of taking a union bound over a geometric sequence of possible values forp min. Lemma 8. Pick any t 0 and n. Assume /P i r max for all i n, and let R n := /min{p i : i n E i [W i ] 0} {}. We have Pr W i E i [W i ] n n 2Rn t n + R nt 22+log 3n 2 r max e t/2. i= i= Proof. The assumption on P i implies R n r max. Let r j := 2 j for j m := log 2 r max. Then Pr W i 2Rn t E i [W i ] n n n + R nt 3n i= i= m = Pr W i 2Rn t E i [W i ] n n n + R nt 3n r j < R n r j j=0 i= i= m Pr W i 2rj t E i [W i ] + r j t n n n 3n R n r j j=0 i= i= m = Pr W i 2rj t/2 E i [W i ] + r jt/2 R n r j n n n 3n j=0 i= 2+log 2 r max e t/2 i= where the last inequality follows from Lemma 7. Replacing W i with W i bounds the probability of deviations in the other direction in exactly the same way. The claim then follows by the union bound. Proof of Theorem. By Hoeffding s inequality and the fact fx i,y i, we have 2t Pr fx i,y i E[fX,Y] n 2e t/2. n i= Since E i [W i ] = fx i,y i, the claim follows by combining this and Lemma 8 with the triangle inequality and the union bound.

12 B Remaining Proofs In this section, we use the notation ε k := C 0 logk +/k. B. Proof of Lemma 2 By induction on n. Trivial for n = since pempty sequence,x = for all x X, so now fix any n 2 and assume as the inductive hypothesis p n = pz :n 2,x /n n for all z :n 2,x X Y {0,} n 2 X. Fix anyz :n,x X Y {0,} n X, and consider the error difference g n := errh n,z :n errh n,z :n used to determine p n := pz :n,x. We only have to consider the case g n > ε n + ε n. By the inductive hypothesis and triangle inequality, we have g n 2n n. Solving the quadratic in Eq. 2 implies pn = c ε n + c 2 ε n +4 g n +c ε n +c 2 ε n c2 ε n 2 g n +c ε n +c 2 ε n 4 g n +c ε n +c 2 ε n c2 ε n > 2 dropping terms g n +c ε n +c 2 ε n c 2 ε n = g n +c ε n +c 2 ε n c 2 ε n sincec 2 c g n +c ε n +c ε n c2 ε n sinceg n > ε n +ε n c g n c 2 C 0 logn = c n g n c 2 C 0 logn 2c n n n inductive hypothesis > en n sincec 0 2, n 2, and c 2 C 0 log2/2c > /e n n sincen/n n e as required. B.2 Proof of Theorem 2 We condition on the δ probability event that the deviation bounds from Lemma hold also using Lemma 2. The proof now proceeds by induction on n. The claim is trivially true for n =. Now pick any n 2 and assume as the strong inductive hypothesis that 0 errh k errh errh k,z :k errh,z :k + 2ε k +2ε k 8 for all k n. We need to show Eq. 8 holds fork = n. Let P min := min{p i : i n h n X i h X i } {}. If P min /2, then Eq. 4 implies that Eq. 8 holds for k = n as needed. So assume for sake of contradiction that P min < /2, and let n 0 := max{i n : P i = P min h n X i h X i }. By definition of P n0, we have errh c c2 n 0,Z :n0 errh n0,z :n0 = c + εn0 + c 2 + ε n0. Pmin P min 2

13 Using this fact together with the inductive hypothesis, we have errh n 0,Z :n0 errh,z :n0 = errh n 0,Z :n0 errh n0,z :n0 +errh n0,z :n0 errh,z :n0 c c2 c + ε n0 + c 2 + ε n0 2ε n0 2ε n0 Pmin P min c = c + c2 2 ε n0 + c 2 ε n0. 9 Pmin P min We use the assumption P min < /2 to lower bound the righthand side to get the inequality errh n 0,Z :n0 errh,z :n0 > c 2 ε n0 +c 2 ε n0 > 0. which implies errh n 0,Z :n0 > errh,z :n0. Since h n 0 minimizes errh,z :n0 among hypotheses h H that disagree with h n0 on X n0, it must be that h agrees with h n0 on X n0. By transitivity and the definition ofn 0, we conclude thath n X n0 = h n 0 X n0 ; soerrh n,z :n0 errh n 0,Z :n0. Then errh n,z :n errh,z :n errh n errh ε n ε n P min P min errh n,z :n0 errh,z :n0 2 ε n0 2 ε n0 P min P min c 2 c + c2 2 2 ε n0 + c 2 ε n0 Pmin P min > c ε n0 +c 2 5 ε n0 where Eq. 4 is used in the first two inequalities, Eq. 9 and the fact errh n,z :n0 errh n 0,Z :n0 are used in the third inequality, and the fact P min < /2 is used in the last inequality. This final quantity is non-negative, so we have the contradiction errh n,z :n > errh,z :n. B.3 Proof of Lemma 3 First, we establish a property of the query probabilities that relates error deviations via P min to empirical error differences via P n. Both quantities play essential roles in bounding the label complexity through the disagreement metric structure around h. Lemma 9. Assume the bounds from Eq. 4 hold for all h H and n. For any n, we have P n c 3 P min, where P min := min{p i : i n hx i h X i } {} and h := { hn ifh n disagrees withh on X n h n ifh n disagrees withh on X n. 0 Proof. We can assume P min < /c 3, since otherwise the claim is trivial. Pick any n 0 n such that hx n0 h X n0 and P n0 = P min such an n 0 is guaranteed to exist given the above assumption. We now proceed as in the proof of Theorem 2. We first show a lower bound on errh,z :n0 errh,z :n0. Note that errh n 0,Z :n0 errh,z :n0 = errh n 0,Z :n0 errh n0,z :n0 +errh n0,z :n0 errh,z :n0 c c2 c + ε n0 + c 2 + ε n0 2ε n0 2ε n0 Pmin P min c = c + c2 2 ε n0 + c 2 ε n0 Pmin P min 3

14 where the inequality follows from Theorem 2. The righthand side is positive, so h must disagree with h n 0 on X n0. By transitivity recalling that hx n0 h X n0, h must agree with h n 0 on X n0. Therefore errh,z :n0 errh n 0,Z :n0 0, so the inequality in Eq. holds with h in place of h n 0 on the lefthand side. Now errh,z :n errh,z :n is related to errh,z :n0 errh,z :n0 through errh errh using the deviation bound from Eq. 4 as well as the fact ε n0 ε n : errh,z :n errh,z :n errh,z :n0 errh,z :n0 2 ε n0 2 ε n0 P min P min c 2 c + c2 2 2 ε n + c 2 ε n > 0. 2 Pmin P min Ifh = h n, thenerrh,z :n errh,z :n = errh n,z :n errh,z :n 0 by the minimality of errh n,z :n ; this contradicts Eq. 2. Therefore it must be that h = h n. In this case, errh,z :n errh,z :n errh n,z :n errh n,z :n c c2 = c + ε n + c 2 + ε n 3 Pn P n where the inequality follows from the minimality of errh n,z :n, and the subsequent step follows from the definition of P n. Combining the lower bound in Eq. 2 and the upper bound in Eq. 3 implies that c Pn ε n + c 2 P n ε n It is easily checked that this impliesp n c 3 P min. c 2 c2 2 2 ε n + 2 ε n. Pmin P min Proof of Lemma 3. Define h as in Eq. 0. By Lemma 9, we have min{p i : i n hx i h X i } {} P n /c 3. We first show that errh errh errh,z :n errh c3,z :n + ε n + c 3 ε n P n P n c4 ε n + c 5 ε n. 4 P n P n The first inequality follows from Eq. 4 and Lemma 9. For the second inequality, we consider two cases depending on h. If h = h n, then we bound errh,z :n errh,z :n from above by errh n,z :n errh n,z :n by definition ofhand minimality oferrh n,z :n, and then simplify errh n,,z :n errh n,z :n + c + c 3 c + ε n + Pn c3 ε n + c 3 ε n P n P n c2 +c 3 c 2 + ε n P n c4 P n ε n + c 5 P n ε n using the definition of P n and the facts c and c 2. If instead h = h n, then we use the facts errh,z :n errh,z :n = errh n,z :n errh,z :n 0 andc 3 min{c 4,c 5 }. If errh errh = γ > 0, then solving the quadratic inequality in Eq. 4 for P n gives the bound { P n min, 3 2 c4 γ 2 + c } 5 ε n. γ Iferrh errh γ, then by the triangle inequality we have Prh X hx errh +errh 2errh + γ 4

15 which in turn impliesx n DISh,2errh + γ. Note thatprx n DISh,2errh + γ θ 2errh + γ by definition of θ, soprerrh errh γ θ 2errh + γ. Let fγ := Prerrh errh γ/ γ be the probability density mass function of the error difference errh errh ; note that this error difference is a function of Z :n,x n. We compute the expected value of Q n by conditioning on errh errh and integrating an upper bound on E[Q n errh errh = γ] with respect tofγ. Let γ 0 > 0 be the positive solution to.5c 4 /γ 2 + c 5 /γε n =. It can be checked that γ 0 >.5c4 ε n. We have E[Q n ] = E[E[Q n Z :n,x n ]] the outer expectation is over Z :n,x n = E[Q n errh errh = γ] dγ 0 0 γ Prerrh errh γ γ Prerrh errh γ 3 2 c 4 +c 5 ε n Prerrh errh { γ min, 3 2 c4 γ 2 + c 5 γ c 4 +c 5 ε n + γ { min, 3 2 c4 γ 2 + c } 5 ε n dγ γ ε n } Prerrh errh γ dγ 2c4 γ 3 + c 5 γ 2 = 3 2 c 4 +c 5 ε n +θ 2errh 3 2 +θ 3 2 2c 4 +c 5 ln γ0 γ 0 ε n θ 2errh +γ dγ c 4 γ0 2 ε n +c 5 ε n γ c 4 +c 5 ε n +θ 2errh +θ 6c 4 ε n +θ 3c 5 4 ε n ln.5c 4 ε n where the first inequality uses the bound on E[Q n errh errh = γ]; the second inequality uses integration-by-parts; the third inequality uses the fact that the integrand from the previous line is 0 for 0 γ γ 0, as well as the bound on Prerrh errh γ; and the fourth inequality uses the definition of γ 0. B.4 Proof of Theorem 4 The theorem is a simple consequence of the following analogue of Lemma 3. Lemma 0. Assume that for some value of κ > 0 and 0 < α, the condition in Eq. 5 holds for all h H. Assume the bounds from Eq. 4 holds for all h H and n. There is a constant c α > 0 such that the following holds. For any n, α/2 C0 logn E[Q n ] θ κ c α. n Proof. For the most part, the proof is the same as that of Lemma 3. The key difference is to use the noise condition in Eq. 5 to directly bound PrhX h X κ errh errh α, which in turn implies the bound Prerrh errh γ θκγ α. As before, let γ 0 >.5c 4 ε n be the solution to.5c 4 /γ 2 +c 5 /γε n =. First consider the caseα <. Then, the expectation of Q n can be bounded as E[Q n ] 3 2 c 4 +c 5 ε n c 4 +c 5 ε n + 3 γ 0 2 2c4 2c4 γ c 4 +c 5 ε n +θ κ 3 2 γ 3 + c 5 γ 2 γ 3 + c 5 γ 2 2c4 2 α ε n Prerrh errh γ dγ ε n θκγ α dγ γ0 2 α + c 5 α γ0 α ε n. 5

16 The case α = is handled similarly. B.5 Analogue of Theorem 2 under Low Noise Conditions We first state a variant of Lemma that takes into account the probability of disagreement between a hypothesis h and the optimal hypothesis h. Lemma. There exists an absolute constant c > 0 such that the following holds. Pick any δ 0,. For alln, let ε n := c logn+ H /δ. n LetZ,Z 2,... X Y {0,} be the sequence of random variables specified in Section 2.2 using a rejection thresholdp : X Y {0,} X [0,] that satisfiespz :n,x /n n for all z :n,x X Y {0,} n X and alln. The following holds with probability at least δ. For alln and all h H, errh,z :n errh,z :n errh errh PrhX h X ε n ε n + P min,n h P min,n h where P min,n h = min{p i : i n hx i h X i } {}. Proof sketch. The proof of this lemma follows along the same lines as that of Lemma. A key difference comes in Lemma 7: the joint event is modified to also conjoin with E i [fx i,y i ] 0 a n i= for some fixed a > 0. In the proof, the parameter λ should be chosen as 3p min 2at 3n λ := 3p min. a+ 3p min 2at 3n Lemma 8 is modified to also take a union bound over a sequence of possible values for a in fact, only n + different values need to be considered. Finally, instead of combining with Hoeffding s inequality, we use Bernstein s inequality or a multiplicative form of Chernoff s bound so the resulting bound an analogue of Theorem involves an empirical average inside the square-root term: with probability at least On log 2 r max e t/2, Q i fx i,y i E[fX,Y] n P i O Rn A n t + R nt n 3n where i= A n := n fx i,y i 0. i= Finally, we apply this deviation bound to obtain uniform error bounds over all hypotheses H a few extra steps are required to replace the empirical quantity A n in the bound with a distributional quantity. Using the previous lemma, a modified version of Theorem 2 follows from essentially the same proof. We note that the quantity C := Olog H /δ used here may differ from C 0 by constant factors. Lemma 2. The following holds with probability at least δ. For any n, 0 errh n errh errh n,z :n errh,z :n 2Prhn X h + XC logn + 2C logn n n. This implies, for all n, errh n errh 2Prhn X h + XC logn + 2C 0logn n n. 6

17 Finally, using the noise condition to bound Prh n X h X κ errh n errh α, we obtain the final error bound. Theorem 5. The following holds with probability at least δ. For any n, errh n errh C 2 α logn +c κ n where c κ is a constant that depends only on κ. 7

Active learning. Sanjoy Dasgupta. University of California, San Diego

Active learning. Sanjoy Dasgupta. University of California, San Diego Active learning Sanjoy Dasgupta University of California, San Diego Exploiting unlabeled data A lot of unlabeled data is plentiful and cheap, eg. documents off the web speech samples images and video But