The Sample Complexity of Self-Verifying Bayesian Active Learning

Size: px
Start display at page:

Download "The Sample Complexity of Self-Verifying Bayesian Active Learning"

Transcription

1 Liu Yang Steve Hanneke Jaime Carbonell Department of Statistics Carnegie Mellon Univsity Machine Learning Department Carnegie Mellon Univsity Language Technologies Institute Carnegie Mellon Univsity Abstract We prove that access to a prior distribution ov target functions can dramatically improve the sample complexity of self-tminating active learning algorithms, so that it is always bett than the known results for prior-dependent passive learning. In particular, this is in stark contrast to the analysis of prior-independent algorithms, whe the are simple known learning problems for which no self-tminating algorithm can provide this guarantee for all priors. 1 Introduction and Background Active learning is a powful form of supvised machine learning charactized by intaction between the learning algorithm and supvisor during the learning process. In this work, we consid a variant known as pool-based active learning, in which a learning algorithm is given access to a (typically vy large) collection of unlabeled examples, and is able to select any of those examples, request the supvisor to label it (in agreement with the target concept), then aft receiving the label, selects anoth example from the pool, etc. This sequential label-requesting process continues until some halting crition is reached, at which point the algorithm outputs a function, and the objective is for this function to closely approximate the (unknown) target concept in the future. The primary motivation behind poolbased active learning is that, often, unlabeled examples are inexpensive and available in abundance, while annotating those examples can be costly or time-consuming; as such, we often wish to select only the informative examples to be labeled, thus reducing information-redundancy to some extent, compared to the baseline of selecting the examples to be labeled uniformly at random from the pool (passive learning). Appearing in Proceedings of the14 th Intnational Confence on Artificial Intelligence and Statistics (AISTATS) 011, Fort Lauddale, FL, USA. Volume 15 of JMLR: W&CP 15. Copyright 011 by the authors. The has recently been an explosion of fascinating theoretical results on the advantages of this type of active learning, compared to passive learning, in tms of the numb of labels required to obtain a prescribed accuracy (called the sample complexity): e.g., FSST97, Das04, DKM09, Das05, Han07b, BHV10, BBL09, Wan09, Kää06, Han07a, DHM07, Fri09, CN08, Now08, BBZ07, Han11, Kol10, Han09, BDL09. In particular, BHV10 show that in noise-free binary classifi learning, for any passive learning algorithm for a concept space of finite VC dimension, the exists an active learning algorithm with asymptotically much small sample complexity for any nontrivial target concept. In lat work, Han09 strengthens this result by removing a ctain strong dependence on the distribution of the data in the learning algorithm. Thus, it appears the are profound advantages to active learning compared to passive learning. Howev, the ability to rapidly convge to a good classifi using only a small numb of labels is only one desirable quality of a machine learning method, and the are oth qualities that may also be important in ctain scenarios. In particular, the ability to vify the pformance of a learning method is often a crucial part of machine learning applications, as (among oth things) it helps us detmine wheth we have enough data to achieve a desired level of accuracy with the given method. In passive learning, one common practice for this vification is to hold out a random sample of labeled examples as a validation sample to evaluate the trained classifi (e.g., to detmine when training is complete). It turns out this technique is not feasible in active learning, since in ord to be really useful as an indicator of wheth we have seen enough labels to guarantee the desired accuracy, the numb of labeled examples in the random validation sample would need to be much larg than the numb of labels requested by the active learning algorithm itself, thus (to some extent) canceling the savings obtained by pforming active rath than passive learning. Anoth common practice in passive learning is to examine the training ror rate of the returned classifi, which can sve as a reasonable indicator of pformance (aft adjusting for model complexity). Howev, again this measure of pformance is not necessarily reasonable for active 816

2 learning, since the set of examples the algorithm requests the labels of is typically distributed vy diffently from the test examples the classifi will be applied to aft training. This reasoning indicates that pformance vification is (at best) a far more subtle issue in active learning than in passive learning. Indeed, BHV10 note that although the numb of labels required to achieve good accuracy is significantly small than passive learning, it is often the case that the numb of labels required to vify that the accuracy is good is not significantly improved. In particular, this phenomenon can dramatically increase the sample complexity of active learning algorithms that adaptively detmine how many labels to request before tminating. In short, if we require the algorithm both to learn an accurate concept and to know that its concept is accurate, then the numb of labels required by active learning is often not significantly small than the numb required by passive learning. We should note, howev, that the above results we proven for a learning scenario in which the target concept is consided a constant, and no information about the process that genates this concept is known a priori. Altnatively, we can consid a modification of this problem, so that the target concept can be thought of as a random variable, a sample from a known distribution (called a prior) ov the space of possible concepts. Such a setting has been studied in detail in the context of passive learning for noise-free binary classification. In particular, HKS9 found that for any concept space of finite VC dimension d, for any prior and distribution ov data points, O(d/ε) random labeled examples are sufficient for the expected ror rate of the Bayes classifi produced und the postior distribution to be at most ε. Furthmore, it is easy to construct learning problems for which the is an Ω(1/ε) low bound on the numb of random labeled examples required to achieve expected ror rate at most ε, by any passive learning algorithm; for instance, the problem of learning threshold classifis on0, 1 und a uniform data distribution and uniform prior is one such scenario. In the context of active learning (again, with access to the prior), FSST97 analyze the Quy by Committee algorithm, and find that if a ctain information gain quantity for the points requested by the algorithm is lowbounded by a value g, then the algorithm requires only O((d/g) log(1/ε)) labels to achieve expected ror rate at most ε. In particular, they show that this is satisfied for constant g for linear separators und a near-uniform prior, and a near-uniform data distribution ov the unit sphe. This represents a marked improvement ov the results of HKS9 for passive learning, and since the Quy by Committee algorithm is self-vifying, this result is highly relevant to the present discussion. Howev, the condition that the information gains be low-bounded by a constant is quite restrictive, and many intesting learning problems are precluded by this requirement. Furthmore, the exist learning problems (with finite VC dimension) for which the Quy by Committee algorithm makes an expected numb of label requests exceeding Ω(1/ε). To date, the has not been a genal analysis of how the value of g can behave as a function of ε, though such an analysis would likely be quite intesting. In the present pap, we take a more genal approach to the question of active learning with access to the prior. We are intested in the broad question of wheth access to the prior bridges the gap between the sample complexity of learning and the sample complexity of learning with vification. Specifically, we ask the following question. Can a prior-dependent self-tminating active learning algorithm for a concept class of finite VC dimension always achieve expected ror rate at most ε using o(1/ε) label requests? Aft some basic definitions in Section, we begin in Section 3 with a concrete example, namely intval classifis und a uniform data density but arbitrary prior, to illustrate the genal idea, and convey some of the intuition as to why one might expect a positive answ to this question. In Section 4, we present a genal proof that the answ is always yes. As the known results for the sample complexity of passive learning with access to the prior are typically 1/ε HKS9, and this is sometimes tight, this represents an improvement ov passive learning. The proof is simple and accessible, yet represents an important step in undstanding the problem of self-tmination in active learning algorithms, and the genal issue of the complexity of vification. Also, as this is a result that does not genally hold for prior-independent algorithms (even for their avage-case behavior induced by the prior) for ctain concept spaces, this also represents a significant step toward undstanding the inhent value of having access to the prior. Definitions and Preinaries First, we introduce some notation and formal definitions. We denote byx the instance space, representing the range of the unlabeled data points, and we suppose a distribution D on X, which we will ref to as the data distribution. We also suppose the existence of a sequence X 1,X,... of i.i.d. random variables, each with distribution D, refred to as the unlabeled data sequence. Though one could potentially analyze the achievable pformance as a function of the numb of unlabeled points made available to the learning algorithm (cf. Das05), for simplicity in the present work, we will suppose this unlabeled sequence is essentially inexhaustible, corresponding to the practical fact that unlabeled data are typically available in abundance as they are often relatively inexpensive to ob- 817

3 Liu Yang, Steve Hanneke, Jaime Carbonell tain. Additionally, the is a setcof measurable classifis h : X { 1,+1}, refred to as the concept space. We denote by d the VC dimension of C, and in our present context we will restrict ourselves to spaces C with d <, refred to as a VC class. We also have a probability distributionπ, called the prior, ovc, and a random variable h π, called the target function; we suppose h is independent from the data sequence X 1,X,... We adopt the usual notation for conditional expectations and probabilities ADD00; for instance,ea B can be thought of as an expectation of the value A, und the conditional distribution ofagiven the value ofb (which itself is random), and thus the value of EA B is essentially detmined by the value ofb. For any measurableh : X { 1,+1}, define the ror rate(h) = D({x : h(x) h (x)}). So far, this setup is essentially identical to that of HKS9, FSST97. The protocol in active learning is the following. An active learning algorithm A is given as input the prior π, the data distribution D (though see Section 5), and a value ε (0,1. It also (implicitly) depends on the data sequence X 1,X,..., and has an indirect dependence on the target functionh via the following type of intaction. The algorithm may inspect the values X i for any initial segment of the data sequence, select an index i N to request the label of; aft selecting such an index, the algorithm receives the value h (X i ). The algorithm may then select anoth index, request the label, receive the value of h on that point, etc. This happens for a numb of rounds, N(A,h,ε,D,π), before eventually the algorithm halts and returns a) classifi ĥ. An algorithm is said to be correct ife (ĥ ε for evy(ε,d,π); that is, given direct access to the prior and the data distribution, and given a specified value ε, a correct algorithm must be guaranteed to have expected ror rate at most ε. Define the expected sample complexity ofafor(x,c,d,π) to be the function SC(ε,D,π) = EN(A,h,ε,D,π): the expected numb of label requests the algorithm makes. We will be intested in proving that ctain algorithms achieve a sample complexity SC(ε, D, π) = o(1/ε). For some (X,C,D), it is known that the are π-independent algorithms (meaning the algorithm s behavior is independent of the π argument) A such that we always have EN(A,h,ε,D,π) h = o(1/ε); for instance, threshold classifis have this propty und any D, homogeneous linear separators have this propty und a uniform D on the unit sphe in k dimensions, and intvals with positive width on X = 0,1 have this propty und D = Uniform(0,1) (see e.g., Das05). It is straightforward to show that any suchawill also havesc(ε,d,π) = o(1/ε) for evy π. In particular, the law of total expectation and the dominated convgence theorem imply ε 0 ε SC(ε,D,π) = ε 0 ε EEN(A,h,ε,D,π) h = E ε 0 ε EN(A,h,ε,D,π) h = 0. In these cases, we can think of SC as a kind of avagecase analysis of these algorithms. Howev, the are also many (X,C,D) for which no such π-independent algorithm exists, achieving o(1/ε) sample complexity for all priors. For instance, this is the case for C as the space of intval classifis (including the empty intval) on X = 0,1 undd = Uniform(0,1) (this essentially follows from a proof of BHV10). Thus, any genal result on o(1/ε) expected sample complexity for π-dependent algorithms would signify that the is a real advantage to having access to the prior. 3 An Example: Intvals In this section, we walk through a simple and intuitive example, to illustrate how access to the prior makes a diffence in the sample complexity. For simplicity, in this example (only) we will suppose the algorithm may request the label of any point in X, not just those in the sequence {X i }; the same ideas can easily be adapted to the setting whe quies are restricted to{x i }. Specifically, consid X = 0,1, D uniform on 0,1, and the concept space of intval classifis, whe C = {I ± a,b : 0 < a b < 1}, whe I ± a,b(x) = +1 if x a,b and 1 othwise. For each classifi h C, let w(h) = P(h(x) = +1) (the width of the intval h). Consid an active learning algorithm that makes label requests at the locations (in sequence) 1/,1/4,3/4,1/8,3/8,5/8,7/8,1/16,3/16,... until (case 1) it encounts an example x with h (x) = +1 or until (case ) the set of classifis V C consistent with all obsved labels so far satisfies Ew(h ) V ε (which ev comes first). In case, the algorithm simply halts and returns the constant classifi that always predicts 1: call it h ; note that (h ) = w(h ). In case 1, the algorithm ents a second phase, in which it pforms a binary search (repeatedly quying the midpoint between the closest two 1 and +1 points, taking 0 and 1 as known negative points) to the left and right of the obsved positive point, halting aft log (/ε) label requests on each side; this results in estimates of the target s endpoints up to ±ε/, so that returning any classifi among the set V C consistent with these labels results in ror rate at most ε; in particular, if h is the classifi in V returned, then E( h) V ε. Denoting this algorithm by A, and ĥ the classifi it returns, we have ) ) V E (ĥ = E E (ĥ ε, 818

4 so that the algorithm is definitely correct. Note that case will definitely be satisfied aft at most ε label requests, and case 1 will definitely be satisfied aft at most w(h ) label requests, so that the algorithm nev makes more than max{w(h ),ε} + log (/ε) label requests. In particular, for any h with w(h ) > 0, N(A,h,ε,D,π) = o(1/ε). Abbreviating N(h ) = N(A,h,ε,D,π), we have EN(h ) = E N(h ) w(h ) = 0 P(w(h ) = 0) +E N(h ) w(h ) > 0 P(w(h ) > 0). (1) Since w(h ) > 0 N(h ) = o(1/ε), the dominated convgence theorem implies εe N(h ) w(h ) > 0 ε 0 = E ε 0 εn(h ) w(h ) > 0 = 0, so that the second tm in (1) iso(1/ε). IfP(w(h ) = 0) = 0, this completes the proof. We focus the rest of the proof on the first tm in (1), in the case that P(w(h ) = 0) > 0: i.e. the is nonzo probability that the target h labels the space almost all negative. Letting V denote the subset of C consistent with all requested labels, note that on the event w(h ) = 0, aft n label requests (for n a pow of ) we have max h V w(h) 1/n. Thus, for any value w ε (0,1), aft at most w ε label requests, on the event that w(h ) = 0, E w(h ) V = w(h)ih Vπ(dh)/π(V) w(h)iw(h) w ε π(dh)/π(v) = Ew(h )Iw(h ) w ε /π(v) Ew(h )Iw(h ) w ε P(w(h. () ) = 0) Now note that, by the dominated convgence theorem, w(h E )Iw(h ) w w 0 = E w w 0 w(h )Iw(h ) w = 0. w Thefore, Ew(h )Iw(h ) w = o(w). If we define w ε as the largest value of w for which Ew(h )Iw(h ) w εp(w(h ) = 0) (or, say, half the supremum if the maximum is not achieved), then we have w ε = ω(ε). Combined with (), this implies E N(h ) w(h ) = 0 = o(1/ε). w ε Thus, all of the tms in (1) are o(1/ε), so that in total EN(h ) = o(1/ε). In conclusion, for this concept space C and data distributiond, we have a correct active learning algorithm achieving a sample complexity SC(ε,D,π) = o(1/ε) for all priorsπ on C. 4 Main Result In this section, we present our main result: a genal result stating that o(1/ε) expected sample complexity is always achievable by some correct active learning algorithm, for any (X,C,D,π) for which C has finite VC dimension. Since the known results for the sample complexity of passive learning with access to the prior are typically Θ(1/ε), and since the are known learning problems (X,C,D,π) for which evy passive learning algorithm requires Ω(1/ε) samples, this o(1/ε) result for active learning represents an improvement ov passive learning. Additionally, as mentioned, this type of result is often not possible for algorithms lacking access to the prior π, as the are wellknown problems(x, C, D) for which no prior-independent correct algorithm (of the self-tminating type studied he) can achieve o(1/ε) sample complexity for evy prior π BHV10; in particular, the intvals problem studied above is one such example. First, we have a small lemma. Lemma 1. For any sequence of functionsφ n : C 0, ) such that, f C, φ n (f) = o(1/n) and n N, φ n (f) c/n (for an f-independent constant c (0, )), the exists a sequence φ n in0, ) such that φ n = o(1/n) and P( φ n (h ) > φ ) n = 0. Proof. For any constantδ (0, ), we have (by Markov s inequality and the dominated convgence theorem) P(nφ n(h ) > δ) 1 δ Enφ n(h ) = 1 δ E nφ n(h ) = 0. Thefore (by induction), the exists a divging sequence n i in N such that i sup n ni P ( nφ n (h ) > i) = 0. Invting this, let i n = max{i N : n i n}, and define φ n (h ) = (1/n) in. By construction, P ( φ n (h ) > φ ) n 0. Furthmore, ni = i n, so that we have implying φ n = o(1/n). n φ n = in = 0, Theorem 1. For any VC class C, the is a correct active learning algorithm that, for evy data distribution D and prior π, achieves expected sample complexity SC for (X,C,D,π) such that SC(ε,D,π) = o(1/ε). 819

5 Liu Yang, Steve Hanneke, Jaime Carbonell Our approach to proving Theorem 1 is via a reduction to established results about active learning algorithms that are not self-vifying. Specifically, consid a slightly diffent type of active learning algorithm than that defined above: namely, an algorithm A a that takes as input a budget n N on the numb of label requests it is allowed to make, and that aft making at most n label requests returns as output a classifi ĥn. Let us ref to any such algorithm as a budget-based active learning algorithm. Note that budget-based active learning algorithms are prior-independent (have no direct access to the prior). The following result was proven by Han09 (see also the related earli work of BHV10). Lemma. Han09 For any VC class C, the exists a constant c (0, ), a function R(n;f,D), and a (priorindependent) budget-based active learning algorithm A a such that D, f C,R(n;f,D) c/n and R(n;f,D) = o(1/n), ) h and E (ĥn R(n;h,D) (always), whe ĥn is the classifi returned by A a. 1 That is, equivalently, for any fixed value for the target function, the expected ror rate is o(1/n), whe the random variable in the expectation is only the data sequence X 1,X,... Our task in the proof of Theorem 1 is to convt such a budget-based algorithm into one of the form defined in Section 1: that is, a self-tminating priordependent algorithm, taking ε as input. Proof of Theorem 1. Consid A a, ĥ n, R, and c as in Lemma, and define { ) } n ε = min n N : E (ĥn ε. This value is accessible based purely on access to π and D. Furthmore, ) we clearly have (by construction) E (ĥnε ε. Thus, denoting by A a the active learning algorithm, taking(d,π,ε) as input, which runsa a (n ε ) and then returnsĥn ε, we have thata a is a correct algorithm (i.e., its expected ror rate is at most ε). As for the expected sample complexity SC(ε, D, π) achieved by A a, we have SC(ε,D,π) n ε, so that it remains only to bound n ε. By Lemma 1, the is a π- dependent function R(n;π,D) such that π, π({f C : R(n;f,D) > R(n;π,D)}) 0 and R(n;π,D) = o(1/n). 1 Furthmore, it is not difficult to see that we can take this R to be measurable in theh argument. Thefore, by the law of total expectation, ) ) E (ĥn h = E E (ĥn ER(n;h,D) c π({f C : R(n;f,D)>R(n;π,D)})+R(n;π,D) n = o(1/n). If n ε = O(1), then clearly n ε = o(1/ε) as needed. Othwise, since n ε is monotonic in ε, we must have n ε as ε 0. In particular, in this latt case we have ε 0 ε n ε ε 0 ε ( { 1+max = ε max ni E ε 0 n n ε 1 ε max ne ε 0 n n ε 1 = max ne ε 0 n n ε 1 so that n ε = o(1/ε), as required. ) (ĥn n n ε 1 : E ) (ĥn /ε > 1 ) (ĥn /ε ) (ĥn = supne 5 Dependence on D in the Learning Algorithm }) > ε ) (ĥn = 0, The dependence on D in the algorithm described in the proof is fairly weak, and we can einate ) any direct dependence on D by replacing (ĥn by a 1 ε/ confidence upp bound based on m ε = Ω ( 1 ε log 1 ε) i.i.d. unlabeled examples X 1,X,...,X m ε independent from the examples used by the algorithm: for instance, set aside in a pre-processing step, whe the bound is dived based on Hoeffding s inequality and a union bound ov the values of n that we check, of which the are at most O(1/ε). Then we simply increase the value of n (starting at some constant, such as 1) until 1 m ε m ε i=1 ( P ) h (X i) ĥn(x i) {X j } j,{x j} j ε/. The expected value of the smallest value ofnfor which this occurs is o(1/ε). Note that the probability only requires access to the prior π, not the data distribution D (the budgetbased algorithm A a of Han09 has no direct dependence on D); if desired for computational efficiency, this probability may also be estimated by a 1 ε/4 confidence upp bound based on Ω ( 1 ε log ε) 1 independent samples of h values with distribution π, whe for each sample we simulate the execution of A a (n) for that (simulated) target function in ord to obtain the returned classifi. In particular, note that no actual label requests to the oracle are required during this process of estimating the appropriate label budget n ε, as all executions of A a are simulated. 80

6 6 Inhent Dependence on π in the Sample Complexity We have shown that for evy priorπ, the sample complexity is bounded by a function that iso(1/ε). One might wond wheth it is possible that the asymptotic dependence on ε in the sample complexity can be prior-independent, while still being o(1/ε). That is, we can ask wheth the exists a (π-independent) function s(ε) = o(1/ε) such that, for all priors π, the is a correct π-dependent algorithm achieving a sample complexity SC(ε, D, π) = O(s(ε)), possibly involving π-dependent constants. Ctainly in some cases, such as threshold classifis, this is true. Howev, it seems this is not genally the case, and in particular it fails to hold for the space of intval classifis. For instance, consid a prior π on the space C of intval classifis, constructed as follows. We are given an arbitrary monotonic g(ε) = o(1/ε); since g(ε) = o(1/ε), the must exist (nonzo) functions q 1 (i) and q (i) such that i q 1 (i) = i q (i) = 0 and i N,g(q 1 (i)/ i+1 ) q (i) i ; furthmore, letting q(i) = max{q 1 (i),q (i)}, by monotonicity of g we also have i N,g(q(i)/ i+1 ) q(i) i, and i q(i) = 0. Then define a function p(i) with i Np(i) = 1 such that p(i) q(i) for infinitely many i N; for instance, this can be done inductively as follows. Let α 0 = 1/; for each i N, if q(i) > α i 1, set p(i) = 0 and α i = α i 1 ; othwise, set p(i) = α i 1 and α i = α i 1 /. Finally, for ({ each i N, and}) each j {0,1,..., i 1}, define π I ± j i,(j+1) i = p(i)/ i. We let D be uniform on X = 0,1. Then for each i N s.t. p(i) q(i), the is ap(i) probability the target intval has width i, and given this any algorithm requires i expected numb of requests to detmine which of these i intvals is the target, failing which the ror rate is at least i. In particular, letting ε i = p(i)/ i+1, any correct algorithm has sample complexity at least p(i) i for ε = ε i. Notingp(i) i q(i) i g(q(i)/ i+1 ) g(ε i ), this implies the exist arbitrarily small values ofε > 0 for which the optimal sample complexity is at least g(ε), so that the sample complexity is not o(g(ε)). For any s(ε) = o(1/ε), the exists a monotonic g(ε) = o(1/ε) such that s(ε) = o(g(ε)). Thus, constructing π as above for this g, we have that the sample complexity is not o(g(ε)), and thefore not O(s(ε)). So at least for the space of intval classifis, the specific o(1/ε) asymptotic dependence onεis inhently π-dependent. Refences ADD00 R. B. Ash and C. A. Doléans-Dade. Probability & Measure Theory. Academic Press, 000. BBL09 BBZ07 BDL09 M.-F. Balcan, A. Beygelzim, and J. Langford. Agnostic active learning. Journal of Comput and System Sciences, 75(1):78 89, 009. M.-F. Balcan, A. Brod, and T. Zhang. Margin based active learning. In Proceedings of the 0 th Confence on Learning Theory, 007. A. Beygelzim, S. Dasgupta, and J. Langford. Importance weighted active learning. In Proceedings of the Intnational Confence on Machine Learning, 009. BHV10 M.-F. Balcan, S. Hanneke, and J. Wortman Vaughan. The true sample complexity of active learning. Machine Learning, 80( 3): , Septemb 010. CN08 Das04 Das05 R. Castro and R. Nowak. Minimax bounds for active learning. IEEE Transactions on Information Theory, 54(5): , July 008. S. Dasgupta. Analysis of a greedy active learning strategy. In Advances in Neural Information Processing Systems, pages MIT Press, 004. S. Dasgupta. Coarse sample complexity bounds for active learning. In Proc. of Neural Information Processing Systems (NIPS), 005. DHM07 S. Dasgupta, D. Hsu, and C. Monteleoni. A genal agnostic active learning algorithm. In Advances in Neural Information Processing Systems 0, 007. DKM09 S. Dasgupta, A. T. Kalai, and C. Monteleoni. Analysis of pceptron-based active learning. Journal of Machine Learning Research, 10:81 99, 009. Fri09 E. Friedman. Active learning for smooth problems. In Proceedings of the nd Confence on Learning Theory, 009. FSST97 Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the quy by committee algorithm. In Machine Learning, pages , Han07a S. Hanneke. A bound on the label complexity of agnostic active learning. In Proc. of the 4th Intnational Confence on Machine Learning, 007. Han07b S. Hanneke. Teaching dimension and the complexity of active learning. In Proc. of the 0th Annual Confence on Learning Theory (COLT),

7 Liu Yang, Steve Hanneke, Jaime Carbonell Han09 S. Hanneke. Theoretical Foundations of Active Learning. PhD thesis, Carnegie Mellon Univsity, 009. Han11 S. Hanneke. Rates of convgence in active learning. The Annals of Statistics, 39(1): , 011. HKS9 D. Haussl, M. Kearns, and R. Schapire. Bounds on the sample complexity of bayesian learning using information theory and the vc dimension. In Machine Learning, pages Morgan Kaufmann, 199. Kää06 M. Kääriäinen. Active learning in the nonrealizable case. In Proc. of the 17th Intnational Confence on Algorithmic Learning Theory, 006. Kol10 V. Koltchinskii. Rademach complexities and bounding the excess risk in active learning. Journal of Machine Learning Research, To Appear, 010. Now08 R. D. Nowak. Genalized binary search. In Proceedings of the 46 th Annual Allton Confence on Communication, Control, and Computing, 008. Wan09 L. Wang. Sufficient conditions for agnostic active learnable. In Advances in Neural Information Processing Systems,

A Bound on the Label Complexity of Agnostic Active Learning

A Bound on the Label Complexity of Agnostic Active Learning A Bound on the Label Complexity of Agnostic Active Learning Steve Hanneke March 2007 CMU-ML-07-103 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Machine Learning Department,

More information

Active Learning Class 22, 03 May Claire Monteleoni MIT CSAIL

Active Learning Class 22, 03 May Claire Monteleoni MIT CSAIL Active Learning 9.520 Class 22, 03 May 2006 Claire Monteleoni MIT CSAIL Outline Motivation Historical framework: query learning Current framework: selective sampling Some recent results Open problems Active

More information

Activized Learning with Uniform Classification Noise

Activized Learning with Uniform Classification Noise Activized Learning with Uniform Classification Noise Liu Yang Steve Hanneke June 9, 2011 Abstract We prove that for any VC class, it is possible to transform any passive learning algorithm into an active

More information

A Theory of Transfer Learning with Applications to Active Learning

A Theory of Transfer Learning with Applications to Active Learning A Theory of Transfer Learning with Applications to Active Learning Liu Yang 1, Steve Hanneke 2, and Jaime Carbonell 3 1 Machine Learning Department, Carnegie Mellon University liuy@cs.cmu.edu 2 Department

More information

Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X

More information

Activized Learning with Uniform Classification Noise

Activized Learning with Uniform Classification Noise Activized Learning with Uniform Classification Noise Liu Yang Machine Learning Department, Carnegie Mellon University Steve Hanneke LIUY@CS.CMU.EDU STEVE.HANNEKE@GMAIL.COM Abstract We prove that for any

More information

Asymptotic Active Learning

Asymptotic Active Learning Asymptotic Active Learning Maria-Florina Balcan Eyal Even-Dar Steve Hanneke Michael Kearns Yishay Mansour Jennifer Wortman Abstract We describe and analyze a PAC-asymptotic model for active learning. We

More information

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 1 Active Learning Most classic machine learning methods and the formal learning

More information

Minimax Analysis of Active Learning

Minimax Analysis of Active Learning Journal of Machine Learning Research 6 05 3487-360 Submitted 0/4; Published /5 Minimax Analysis of Active Learning Steve Hanneke Princeton, NJ 0854 Liu Yang IBM T J Watson Research Center, Yorktown Heights,

More information

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University Sample and Computationally Efficient Active Learning Maria-Florina Balcan Carnegie Mellon University Machine Learning is Shaping the World Highly successful discipline with lots of applications. Computational

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Efficient Semi-supervised and Active Learning of Disjunctions

Efficient Semi-supervised and Active Learning of Disjunctions Maria-Florina Balcan ninamf@cc.gatech.edu Christopher Berlind cberlind@gatech.edu Steven Ehrlich sehrlich@cc.gatech.edu Yingyu Liang yliang39@gatech.edu School of Computer Science, College of Computing,

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

Atutorialonactivelearning

Atutorialonactivelearning Atutorialonactivelearning Sanjoy Dasgupta 1 John Langford 2 UC San Diego 1 Yahoo Labs 2 Exploiting unlabeled data Alotofunlabeleddataisplentifulandcheap,eg. documents off the web speech samples images

More information

The True Sample Complexity of Active Learning

The True Sample Complexity of Active Learning The True Sample Complexity of Active Learning Maria-Florina Balcan Computer Science Department Carnegie Mellon University ninamf@cs.cmu.edu Steve Hanneke Machine Learning Department Carnegie Mellon University

More information

Kernel Metric Learning For Phonetic Classification

Kernel Metric Learning For Phonetic Classification Knel Metric Learning For Phonetic Classification Jui-Ting Huang, Xi Zhou, Mark Hasegawa-Johnson, and Thomas Huang Beckman Institute, Univsity of Illinois at Urbana-Champaign, Urbana, IL 680, USA Dept.

More information

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees Improved Algorithms for Confidence-Rated Prediction with Error Guarantees Kamalika Chaudhuri Chicheng Zhang Department of Computer Science and Engineering University of California, San Diego 95 Gilman

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Practical Agnostic Active Learning

Practical Agnostic Active Learning Practical Agnostic Active Learning Alina Beygelzimer Yahoo Research based on joint work with Sanjoy Dasgupta, Daniel Hsu, John Langford, Francesco Orabona, Chicheng Zhang, and Tong Zhang * * introductory

More information

Active learning. Sanjoy Dasgupta. University of California, San Diego

Active learning. Sanjoy Dasgupta. University of California, San Diego Active learning Sanjoy Dasgupta University of California, San Diego Exploiting unlabeled data A lot of unlabeled data is plentiful and cheap, eg. documents off the web speech samples images and video But

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

The True Sample Complexity of Active Learning

The True Sample Complexity of Active Learning The True Sample Complexity of Active Learning Maria-Florina Balcan 1, Steve Hanneke 2, and Jennifer Wortman Vaughan 3 1 School of Computer Science, Georgia Institute of Technology ninamf@cc.gatech.edu

More information

1 Differential Privacy and Statistical Query Learning

1 Differential Privacy and Statistical Query Learning 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 5: December 07, 015 1 Differential Privacy and Statistical Query Learning 1.1 Differential Privacy Suppose

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016 Machine Learning 10-701, Fall 2016 Computational Learning Theory Eric Xing Lecture 9, October 5, 2016 Reading: Chap. 7 T.M book Eric Xing @ CMU, 2006-2016 1 Generalizability of Learning In machine learning

More information

Discriminative Learning can Succeed where Generative Learning Fails

Discriminative Learning can Succeed where Generative Learning Fails Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,

More information

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees Improved Algorithms for Confidence-Rated Prediction with Error Guarantees Kamalika Chaudhuri Chicheng Zhang Department of Computer Science and Engineering University of California, San Diego 95 Gilman

More information

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan Foundations For Learning in the Age of Big Data Maria-Florina Balcan Modern Machine Learning New applications Explosion of data Classic Paradigm Insufficient Nowadays Modern applications: massive amounts

More information

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds Lecture 25 of 42 PAC Learning, VC Dimension, and Mistake Bounds Thursday, 15 March 2007 William H. Hsu, KSU http://www.kddresearch.org/courses/spring2007/cis732 Readings: Sections 7.4.17.4.3, 7.5.17.5.3,

More information

Linear Classification and Selective Sampling Under Low Noise Conditions

Linear Classification and Selective Sampling Under Low Noise Conditions Linear Classification and Selective Sampling Under Low Noise Conditions Giovanni Cavallanti DSI, Università degli Studi di Milano, Italy cavallanti@dsi.unimi.it Nicolò Cesa-Bianchi DSI, Università degli

More information

Statistical Active Learning Algorithms

Statistical Active Learning Algorithms Statistical Active Learning Algorithms Maria Florina Balcan Georgia Institute of Technology ninamf@cc.gatech.edu Vitaly Feldman IBM Research - Almaden vitaly@post.harvard.edu Abstract We describe a framework

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Tsybakov noise adap/ve margin- based ac/ve learning

Tsybakov noise adap/ve margin- based ac/ve learning Tsybakov noise adap/ve margin- based ac/ve learning Aar$ Singh A. Nico Habermann Associate Professor NIPS workshop on Learning Faster from Easy Data II Dec 11, 2015 Passive Learning Ac/ve Learning (X j,?)

More information

On the Sample Complexity of Noise-Tolerant Learning

On the Sample Complexity of Noise-Tolerant Learning On the Sample Complexity of Noise-Tolerant Learning Javed A. Aslam Department of Computer Science Dartmouth College Hanover, NH 03755 Scott E. Decatur Laboratory for Computer Science Massachusetts Institute

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

Agnostic Active Learning

Agnostic Active Learning Agnostic Active Learning Maria-Florina Balcan Carnegie Mellon University, Pittsburgh, PA 523 Alina Beygelzimer IBM T. J. Watson Research Center, Hawthorne, NY 0532 John Langford Yahoo! Research, New York,

More information

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science

More information

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning

More information

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity Universität zu Lübeck Institut für Theoretische Informatik Lecture notes on Knowledge-Based and Learning Systems by Maciej Liśkiewicz Lecture 5: Efficient PAC Learning 1 Consistent Learning: a Bound on

More information

Robust Selective Sampling from Single and Multiple Teachers

Robust Selective Sampling from Single and Multiple Teachers Robust Selective Sampling from Single and Multiple Teachers Ofer Dekel Microsoft Research oferd@microsoftcom Claudio Gentile DICOM, Università dell Insubria claudiogentile@uninsubriait Karthik Sridharan

More information

ANALYSIS OF CRITERION FOR TORSIONAL IRREGULARITY OF SEISMIC STRUCTURES

ANALYSIS OF CRITERION FOR TORSIONAL IRREGULARITY OF SEISMIC STRUCTURES 3 th World Confence on Earthquake Engineing Vancouv, B.C., Canada August -6, 2004 Pap No. 465 ANALYSIS OF CRITERION FOR TORSIONAL IRREGULARITY OF SEISMIC STRUCTURES Nina ZHENG Zhihong YANG 2 Cheng SHI

More information

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan Foundations For Learning in the Age of Big Data Maria-Florina Balcan Modern Machine Learning New applications Explosion of data Modern ML: New Learning Approaches Modern applications: massive amounts of

More information

Learning symmetric non-monotone submodular functions

Learning symmetric non-monotone submodular functions Learning symmetric non-monotone submodular functions Maria-Florina Balcan Georgia Institute of Technology ninamf@cc.gatech.edu Nicholas J. A. Harvey University of British Columbia nickhar@cs.ubc.ca Satoru

More information

The Bayesian Group-Lasso for Analyzing Contingency Tables

The Bayesian Group-Lasso for Analyzing Contingency Tables Sudhir Raman SUDHIR.RAMAN@UNIBAS.CH Department of Comput Science, Univsity of Basel, Bnoullistr. 16, CH-4056 Basel, Switzland Thomas J. Fuchs THOMAS.FUCHS@INF.ETHZ.CH Institute of Computational Science,

More information

arxiv: v4 [cs.lg] 7 Feb 2016

arxiv: v4 [cs.lg] 7 Feb 2016 The Optimal Sample Complexity of PAC Learning Steve Hanneke steve.hanneke@gmail.com arxiv:1507.00473v4 [cs.lg] 7 Feb 2016 Abstract This work establishes a new upper bound on the number of samples sufficient

More information

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 8: Boosting (and Compression Schemes) Boosting the Error If we have an efficient learning algorithm that for any distribution

More information

From Batch to Transductive Online Learning

From Batch to Transductive Online Learning From Batch to Transductive Online Learning Sham Kakade Toyota Technological Institute Chicago, IL 60637 sham@tti-c.org Adam Tauman Kalai Toyota Technological Institute Chicago, IL 60637 kalai@tti-c.org

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Co-Training and Expansion: Towards Bridging Theory and Practice

Co-Training and Expansion: Towards Bridging Theory and Practice Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan Computer Science Dept. Carnegie Mellon Univ. Pittsburgh, PA 53 ninamf@cs.cmu.edu Avrim Blum Computer Science Dept. Carnegie

More information

Buy-in-Bulk Active Learning

Buy-in-Bulk Active Learning Buy-in-Bulk Active Learning Liu Yang Machine Learning Department, Carnegie Mellon University liuy@cs.cmu.edu Jaime Carbonell Language Technologies Institute, Carnegie Mellon University jgc@cs.cmu.edu Abstract

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Cost Complexity of Proactive Learning via a Reduction to Realizable Active Learning. November 2009 CMU-ML

Cost Complexity of Proactive Learning via a Reduction to Realizable Active Learning. November 2009 CMU-ML Cost Complexity of Proactive Learning via a Reduction to Realizable Active Learning Liu Yang Jaime Carbonell November 2009 CMU-ML-09-113 Cost Complexity of Proactive Learning via a Reduction to Realizable

More information

10.1 The Formal Model

10.1 The Formal Model 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

More information

Impossibility Theorems for Domain Adaptation

Impossibility Theorems for Domain Adaptation Shai Ben-David and Teresa Luu Tyler Lu Dávid Pál School of Computer Science University of Waterloo Waterloo, ON, CAN {shai,t2luu}@cs.uwaterloo.ca Dept. of Computer Science University of Toronto Toronto,

More information

Diameter-Based Active Learning

Diameter-Based Active Learning Christopher Tosh Sanjoy Dasgupta Abstract To date, the tightest upper and lower-bounds for the active learning of general concept classes have been in terms of a parameter of the learning problem called

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Minimax Bounds for Active Learning

Minimax Bounds for Active Learning Minimax Bounds for Active Learning Rui M. Castro 1,2 and Robert D. Nowak 1 1 University of Wisconsin, Madison WI 53706, USA, rcastro@cae.wisc.edu,nowak@engr.wisc.edu, 2 Rice University, Houston TX 77005,

More information

Beating the Hold-Out: Bounds for K-fold and Progressive Cross-Validation

Beating the Hold-Out: Bounds for K-fold and Progressive Cross-Validation Beating the Hold-Out: Bounds for K-fold and Progressive Cross-Validation Avrim Blum School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 avrim+@cscmuedu Adam Kalai School of Computer

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

12.1 A Polynomial Bound on the Sample Size m for PAC Learning 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 12: PAC III Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 In this lecture will use the measure of VC dimension, which is a combinatorial

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Efficient PAC Learning from the Crowd

Efficient PAC Learning from the Crowd Proceedings of Machine Learning Research vol 65:1 24, 2017 Efficient PAC Learning from the Crowd Pranjal Awasthi Rutgers University Avrim Blum Nika Haghtalab Carnegie Mellon University Yishay Mansour Tel-Aviv

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Unlabeled Data: Now It Helps, Now It Doesn t

Unlabeled Data: Now It Helps, Now It Doesn t institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Fuzzy Limits of Functions

Fuzzy Limits of Functions Fuzzy Limits of Functions Mark Burgin Department of Mathematics University of California, Los Angeles 405 Hilgard Ave. Los Angeles, CA 90095 Abstract The goal of this work is to introduce and study fuzzy

More information

References for online kernel methods

References for online kernel methods References for online kernel methods W. Liu, J. Principe, S. Haykin Kernel Adaptive Filtering: A Comprehensive Introduction. Wiley, 2010. W. Liu, P. Pokharel, J. Principe. The kernel least mean square

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017 Selective Prediction Binary classifications Rong Zhou November 8, 2017 Table of contents 1. What are selective classifiers? 2. The Realizable Setting 3. The Noisy Setting 1 What are selective classifiers?

More information

Learnability of the Superset Label Learning Problem

Learnability of the Superset Label Learning Problem Learnability of the Superset Label Learning Problem Li-Ping Liu LIULI@EECSOREGONSTATEEDU Thomas G Dietterich TGD@EECSOREGONSTATEEDU School of Electrical Engineering and Computer Science, Oregon State University,

More information

Analysis of perceptron-based active learning

Analysis of perceptron-based active learning Analysis of perceptron-based active learning Sanjoy Dasgupta 1, Adam Tauman Kalai 2, and Claire Monteleoni 3 1 UCSD CSE 9500 Gilman Drive #0114, La Jolla, CA 92093 dasgupta@csucsdedu 2 TTI-Chicago 1427

More information

Minimax Bounds for Active Learning

Minimax Bounds for Active Learning IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. X, MONTH YEAR Minimax Bounds for Active Learning Rui M. Castro, Robert D. Nowak, Senior Member, IEEE, Abstract This paper analyzes the potential advantages

More information

Distributed Machine Learning. Maria-Florina Balcan Carnegie Mellon University

Distributed Machine Learning. Maria-Florina Balcan Carnegie Mellon University Distributed Machine Learning Maria-Florina Balcan Carnegie Mellon University Distributed Machine Learning Modern applications: massive amounts of data distributed across multiple locations. Distributed

More information

Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions

Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions Matthew J. Streeter Computer Science Department and Center for the Neural Basis of Cognition Carnegie Mellon University

More information

Lecture 29: Computational Learning Theory

Lecture 29: Computational Learning Theory CS 710: Complexity Theory 5/4/2010 Lecture 29: Computational Learning Theory Instructor: Dieter van Melkebeek Scribe: Dmitri Svetlov and Jake Rosin Today we will provide a brief introduction to computational

More information

Computational and Statistical Learning theory

Computational and Statistical Learning theory Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,

More information

Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions

Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions Adam R Klivans UT-Austin klivans@csutexasedu Philip M Long Google plong@googlecom April 10, 2009 Alex K Tang

More information

Learning Combinatorial Functions from Pairwise Comparisons

Learning Combinatorial Functions from Pairwise Comparisons Learning Combinatorial Functions from Pairwise Comparisons Maria-Florina Balcan Ellen Vitercik Colin White Abstract A large body of work in machine learning has focused on the problem of learning a close

More information

Multilevel factor analysis modelling using Markov Chain Monte Carlo (MCMC) estimation

Multilevel factor analysis modelling using Markov Chain Monte Carlo (MCMC) estimation Multilevel Factor Analysis Multilevel factor analysis modelling using Markov Chain Monte Carlo (MCMC) estimation Harvey Goldstein William Browne Institute of Education, Univsity of London Abstract A vy

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Active Learning from Single and Multiple Annotators

Active Learning from Single and Multiple Annotators Active Learning from Single and Multiple Annotators Kamalika Chaudhuri University of California, San Diego Joint work with Chicheng Zhang Classification Given: ( xi, yi ) Vector of features Discrete Labels

More information

PAC Generalization Bounds for Co-training

PAC Generalization Bounds for Co-training PAC Generalization Bounds for Co-training Sanjoy Dasgupta AT&T Labs Research dasgupta@research.att.com Michael L. Littman AT&T Labs Research mlittman@research.att.com David McAllester AT&T Labs Research

More information

A Necessary Condition for Learning from Positive Examples

A Necessary Condition for Learning from Positive Examples Machine Learning, 5, 101-113 (1990) 1990 Kluwer Academic Publishers. Manufactured in The Netherlands. A Necessary Condition for Learning from Positive Examples HAIM SHVAYTSER* (HAIM%SARNOFF@PRINCETON.EDU)

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

1 Learning Linear Separators

1 Learning Linear Separators 10-601 Machine Learning Maria-Florina Balcan Spring 2015 Plan: Perceptron algorithm for learning linear separators. 1 Learning Linear Separators Here we can think of examples as being from {0, 1} n or

More information

arxiv: v4 [cs.lg] 5 Nov 2014

arxiv: v4 [cs.lg] 5 Nov 2014 Statistical Active Learning Algorithms for Noise Tolerance and Differential Privacy arxiv:1307.310v4 [cs.lg] 5 Nov 014 Maria Florina Balcan Carnegie Mellon University ninamf@cs.cmu.edu Abstract Vitaly

More information

On a Theory of Learning with Similarity Functions

On a Theory of Learning with Similarity Functions Maria-Florina Balcan NINAMF@CS.CMU.EDU Avrim Blum AVRIM@CS.CMU.EDU School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213-3891 Abstract Kernel functions have become

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning 3. Instance Based Learning Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Outline Parzen Windows Kernels, algorithm Model selection

More information

Chapter One. The Real Number System

Chapter One. The Real Number System Chapter One. The Real Number System We shall give a quick introduction to the real number system. It is imperative that we know how the set of real numbers behaves in the way that its completeness and

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

COMS 4771 Lecture Boosting 1 / 16

COMS 4771 Lecture Boosting 1 / 16 COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information