Agnostic Active Learning Without Constraints

Size: px
Start display at page:

Download "Agnostic Active Learning Without Constraints"

Transcription

1 Agnostic Active Learning Without Constraints Alina Beygelzimer IBM Research Hawthorne, NY John Langford Yahoo! Research New York, NY Daniel Hsu Rutgers University & University of Pennsylvania Tong Zhang Rutgers University Piscataway, NJ Abstract We present and analyze an agnostic active learning algorithm that works without keeping a version space. This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained throughout learning, and only hypotheses from this set are ever returned. By avoiding this version space approach, our algorithm sheds the computational burden and brittleness associated with maintaining version spaces, yet still allows for substantial improvements over supervised learning for classification. Introduction In active learning, a learner is given access to unlabeled data and is allowed to adaptively choose which ones to label. This learning model is motivated by applications in which the cost of labeling data is high relative to that of collecting the unlabeled data itself. Therefore, the hope is that the active learner only needs to query the labels of a small number of the unlabeled data, and otherwise perform as well as a fully supervised learner. In this work, we are interested in agnostic active learning algorithms for binary classification that are provably consistent, i.e. that converge to an optimal hypothesis in a given hypothesis class. One technique that has proved theoretically profitable is to maintain a candidate set of hypotheses sometimes called a version space, and to query the label of a point only if there is disagreement within this set about how to label the point. The criteria for membership in this candidate set needs to be carefully defined so that an optimal hypothesis is always included, but otherwise this set can be quickly whittled down as more labels are queried. This technique is perhaps most readily understood in the noise-free setting [, 2], and it can be extended to noisy settings by using empirical confidence bounds [3, 4, 5, 6, 7]. The version space approach unfortunately has its share of significant drawbacks. The first is computational intractability: maintaining a version space and guaranteeing that only hypotheses from this set are returned is difficult for linear predictors and appears intractable for interesting nonlinear predictors such as neural nets and decision trees []. Another drawback of the approach is its brittleness: a single mishap due to, say, modeling failures or computational approximations might cause the learner to exclude the best hypothesis from the version space forever; this is an ungraceful failure mode that is not easy to correct. A third drawback is related to sample re-usability: if labeled data is collected using a version space-based active learning algorithm, and we later decide to use a different algorithm or hypothesis class, then the earlier data may not be freely re-used because its collection process is inherently biased.

2 Here, we develop a new strategy addressing all of the above problems given an oracle that returns an empirical risk minimizing ERM hypothesis. As this oracle matches our abstraction of many supervised learning algorithms, we believe active learning algorithms built in this way are immediately and widely applicable. Our approach instantiates the importance weighted active learning framework of [5] using a rejection threshold similar to the algorithm of [4] which only accesses hypotheses via a supervised learning oracle. However, the oracle we require is simpler and avoids strict adherence to a candidate set of hypotheses. Moreover, our algorithm creates an importance weighted sample that allows for unbiased risk estimation, even for hypotheses from a class different from the one employed by the active learner. This is in sharp contrast to many previous algorithms e.g., [, 3, 8, 4, 6, 7] that create heavily biased data sets. We prove that our algorithm is always consistent and has an improved label complexity over passive learning in cases previously studied in the literature. We also describe a practical instantiation of our algorithm and report on some experimental results.. Related Work As already mentioned, our work is closely related to the previous works of [4] and [5], both of which in turn draw heavily on the work of [] and [3]. The algorithm from [4] extends the selective sampling method of [] to the agnostic setting using generalization bounds in a manner similar to that first suggested in [3]. It accesses hypotheses only through a special ERM oracle that can enforce an arbitrary number of example-based constraints; these constraints define a version space, and the algorithm only ever returns hypotheses from this space, which can be undesirable as we previously argued. Other previous algorithms with comparable performance guarantees also require similar example-based constraints e.g., [3, 5, 6, 7]. Our algorithm differs from these in that i it never restricts its attention to a version space when selecting a hypothesis to return, and ii it only requires an ERM oracle that enforces at most one example-based constraint, and this constraint is only used for selective sampling. Our label complexity bounds are comparable to those proved in [5] though somewhat worse that those in [3, 4, 6, 7]. The use of importance weights to correct for sampling bias is a standard technique for many machine learning problems e.g., [9, 0, ] including active learning [2, 3, 5]. Our algorithm is based on the importance weighted active learning IWAL framework introduced by [5]. In that work, a rejection threshold procedure called loss-weighting is rigorously analyzed and shown to yield improved label complexity bounds in certain cases. Loss-weighting is more general than our technique in that it extends beyond zero-one loss to a certain subclass of loss functions such as logistic loss. On the other hand, the loss-weighting rejection threshold requires optimizing over a restricted version space, which is computationally undesirable. Moreover, the label complexity bound given in [5] only applies to hypotheses selected from this version space, and not when selected from the entire hypothesis class as the general IWAL framework suggests. We avoid these deficiencies using a new rejection threshold procedure and a more subtle martingale analysis. Many of the previously mentioned algorithms are analyzed in the agnostic learning model, where no assumption is made about the noise distribution see also [4]. In this setting, the label complexity of active learning algorithms cannot generally improve over supervised learners by more than a constant factor [5, 5]. However, under a parameterization of the noise distribution related to Tsybakov s low-noise condition [6], active learning algorithms have been shown to have improved label complexity bounds over what is achievable in the purely agnostic setting [7, 8, 8, 6, 7]. We also consider this parameterization to obtain a tighter label complexity analysis. 2 Preliminaries 2. Learning Model Let D be a distribution over X Y where X is the input space and Y = {±} are the labels. Let X,Y X Y be a pair of random variables with joint distributiond. An active learner receives a sequence X,Y,X 2,Y 2,... of i.i.d. copies of X,Y, with the label Y i hidden unless it is explicitly queried. We use the shorthand a :k to denote a sequence a,a 2,...,a k so k = 0 correspond to the empty sequence. 2

3 Let H be a set of hypotheses mapping from X to Y. For simplicity, we assume H is finite but does not completely agree on any single x X i.e., x X, h,h H such that hx h x. This keeps the focus on the relevant aspects of active learning that differ from passive learning. The error of a hypothesis h : X Y is errh := PrhX Y. Let h := argmin{errh : h H} be a hypothesis of minimum error in H. The goal of the active learner is to return a hypothesis h H with errorerrh not much more than errh, using as few label queries as possible. 2.2 Importance Weighted Active Learning In the importance weighted active learning IWAL framework of [5], an active learner looks at the unlabeled data X,X 2,... one at a time. After each new point X i, the learner determines a probability P i [0,]. Then a coin with bias P i is flipped, and the label Y i is queried if and only if the coin comes up heads. The query probability P i can depend on all previous unlabeled examples X :i, any previously queried labels, any past coin flips, and the current unlabeled point X i. Formally, an IWAL algorithm specifies a rejection threshold functionp : X Y {0,} X [0,] for determining these query probabilities. Let Q i {0,} be a random variable conditionally independent of the current label Y i, and with conditional expectation Q i Y i X :i,y :i,q :i E[Q i Z :i,x i ] = P i := pz :i,x i. where Z j := X j,y j,q j. That is, Q i indicates if the label Y i is queried the outcome of the coin toss. Although the notation does not explicitly suggest this, the query probability P i = pz :i,x i is allowed to explicitly depend on a label Y j j < i if and only if it has been queried Q j =. 2.3 Importance Weighted Estimators We first review some standard facts about the importance weighting technique. For a function f : X Y R, define the importance weighted estimator of E[fX,Y] from Z :n X Y {0,} n to be fz :n := Q i fx i,y i. n P i i= Note that this quantity depends on a label Y i only if it has been queried i.e., only if Q i = ; it also depends on X i only if Q i =. Our rejection threshold will be based on a specialization of this estimator, specifically the importance weighted empirical error of a hypothesis h errh,z :n := n i= Q i P i [hx i Y i ]. In the notation of Algorithm, this is equivalent to errh,s n := /P i [hx i Y i ] n X i,y i,/p i S n where S n X Y R is the importance weighted sample collected by the algorithm. A basic property of these estimators is unbiasedness: E[ fz :n ] = /n n i= E[E[Q i/p i fx i,y i X :i,y :i,q :i ]] = /n n i= E[P i/p i fx i,y i ] = E[fX,Y]. So, for example, the importance weighted empirical error of a hypothesis h is an unbiased estimator of its true error errh. This holds for any choice of the rejection threshold that guarantees P i > 0. 3 A Deviation Bound for Importance Weighted Estimators As mentioned before, the rejection threshold used by our algorithm is based on importance weighted error estimates errh,z :n. Even though these estimates are unbiased, they are only reliable when 3

4 the variance is not too large. To get a handle on this, we need a deviation bound for importance weighted estimators. This is complicated by two factors that rules out straightforward applications of some standard bounds:. The importance weighted samples X i,y i,/p i or equivalently, the Z i = X i,y i,q i are not i.i.d. This is because the query probabilityp i and thus the importance weight/p i generally depends on Z :i and X i. 2. The effective range and variance of each term in the estimator are, themselves, random variables. To address these issues, we develop a deviation bound using a martingale technique from [9]. Let f : X Y [,] be a bounded function. Consider any rejection threshold function p : X Y {0,} X 0,] for whichp n = pz :n,x n is bounded below by some positive quantity which may depend on n. Equivalently, the query probabilities P n should have inverses /P n bounded above by some deterministic quantity r max which, again, may depend on n. The a priori upper bound r max on /P n can be pessimistic, as the dependence on r max in the final deviation bound will be very mild it enters in as loglogr max. Our goal is to prove a bound on fz :n E[fX,Y] that holds with high probability over the joint distribution of Z :n. To start, we establish bounds on the range and variance of each term W i := Q i /P i fx i,y i in the estimator, conditioned on X :i,y :i,q :i. Let E i [ ] denote E[ X :i,y :i,q :i ]. Note that E i [W i ] = E i [Q i ]/P i fx i,y i = fx i,y i, so if E i [W i ] = 0, then W i = 0. Therefore, the conditional range and variance are non-zero only if E i [W i ] 0. For the range, we have W i = Q i /P i fx i,y i /P i, and for the variance, E i [W i E i [W i ] 2 ] E i [Q 2 i ]/P2 i fx i,y i 2 /P i. These range and variance bounds indicate the form of the deviations we can expect, similar to that of other classical deviation bounds. Theorem. Pick any t 0 and n. Assume /P i r max for all i n, and let R n := /min{p i : i n fx i,y i 0} {}. With probability at least 23 + log 2 r max e t/2, Q i fx i,y i E[fX,Y] n P i 2Rn t 2t n + n + R nt 3n. i= We defer all proofs to the appendices. 4 Algorithm First, we state a deviation bound for the importance weighted error of hypotheses in a finite hypothesis class H that holds for all n. It is a simple consequence of Theorem and union bounds; the form of the bound motivates certain algorithmic choices to be described below. Lemma. Pick any δ 0,. For all n, let ε n := 6log23+nlog 2nnn+ H /δ n logn H /δ = O n. 3 LetZ,Z 2,... X Y {0,} be the sequence of random variables specified in Section 2.2 using a rejection thresholdp : X Y {0,} X [0,] that satisfiespz :n,x /n n for all z :n,x X Y {0,} n X and alln. The following holds with probability at least δ. For alln and all h H, errh,z :n errh,z :n errh errh εn P min,n h + ε n P min,n h where P min,n h = min{p i : i n hx i h X i } {}. We let C 0 = Olog H /δ 2 be a quantity such that ε n as defined in Eq. 3 is bounded as ε n C 0 logn+/n. The following absolute constants are used in the description of the rejection 4 4

5 Algorithm Notes: see Eq. for the definition of err importance weighted error, and Section 4 for the definitions of C 0, c, and c 2. Initialize: S 0 :=. For k =,2,...,n:. Obtain unlabeled data point X k. 2. Let h k := argmin{errh,s k : h H}, and h k := argmin{errh,s k : h H hx k h k X k }. Let G k := errh k,s k errh k,s k, and { { C P k := ifg k 0 logk C0 logk k + k = min, O s otherwise G 2 + C0logk } k G k k where s 0, is the positive solution to the equation c C0 logk G k = c + s k + c2 s c 2 + C0logk k Toss a biased coin withprheads = P k. If heads, then query Y k, and let S k := S k {X k,y k,/p k }. Else, lets k := S k. Return: h n+ := argmin{errh,s n : h H}. Figure : Algorithm for importance weighted active learning with an error minimization oracle. threshold and the subsequent analysis: c := , c 2 := 5, c 3 := c + 2/c 2 2, c 4 := c + c 3 2, c 5 := c 2 +c 3. Our proposed algorithm is shown in Figure. The rejection threshold Step 2 is based on the deviation bound from Lemma. First, the importance weighted error minimizing hypothesish k and the alternative hypothesish k are found. Note that both optimizations are over the entire hypothesis class H with h k only being required to disagree with h k on x k this is a key aspect where our algorithm differs from previous approaches. The difference in importance weighted errors G k of the two hypotheses is then computed. If G k C 0 logk/k + C 0 logk/k, then the query probability P k is set to. Otherwise, P k is set to the positive solution s to the quadratic equation in Eq. 2. The functional form ofp k is roughlymin{, /G 2 k +/G k C 0 logk/k }. It can be checked thatp k 0,] and thatp k is non-increasing withg k. It is also useful to note thatlogk/k is monotonically decreasing withk we use the conventionlog/0 =. In order to apply Lemma with our rejection threshold, we need to establish the very crude bound P k /k k for all k. Lemma 2. The rejection threshold of Algorithm satisfies pz :n,x /n n for all n and all z :n,x X Y {0,} n X. Note that this is a worst-case bound; our analysis shows that the probabilities P k are more like /polyk in the typical case. 5 Analysis 5. Correctness We first prove a consistency guarantee for Algorithm that bounds the generalization error of the importance weighted empirical error minimizer. The proof actually establishes a lower bound on 5

6 the query probabilities P i /2 for X i such that h n X i h X i. This offers an intuitive characterization of the weighting landscape induced by the importance weights /P i. Theorem 2. The following holds with probability at least δ. For any n, 0 errh n errh errh n,z :n errh 2C0 logn,z :n + + 2C 0logn n n. This implies, for all n, errh n errh + 2C0 logn n + 2C 0logn n. Therefore, the final hypothesis returned by Algorithm after seeing n unlabeled data has roughly the same error bound as a hypothesis returned by a standard passive learner with n labeled data. A variant of this result under certain noise conditions is given in the appendix. 5.2 Label Complexity Analysis We now bound the number of labels requested by Algorithm after n iterations. The following lemma bounds the probability of querying the label Y n ; this is subsequently used to establish the final bound on the expected number of labels queried. The key to the proof is in relating empirical error differences and their deviations to the probability of querying a label. This is mediated through the disagreement coefficient, a quantity first used by [4] for analyzing the label complexity of the A 2 algorithm of [3]. The disagreement coefficient θ := θh,h,d is defined as { PrX DISh θh },r,h,d := sup : r > 0 r where DISh,r := {x X : h H such that Prh X h X r and h x h x} the disagreement region around h at radius r. This quantity is bounded for many learning problems studied in the literature; see [4, 6, 20, 2] for more discussion. Note that the supremum can instead be taken over r > ǫ if the target excess error isǫ, which allows for a more detailed analysis. Lemma 3. Assume the bounds from Eq. 4 holds for all h H and n. For any n, E[Q n ] θ 2errh C0 logn +O θ n +θ C0log 2 n. n Theorem 3. With probability at least δ, the expected number of labels queried by Algorithm after n iterations is at most +θ 2errh n +O θ C 0 nlogn+θ C 0 log 3 n. The bound is dominated by a linear term scaled by errh, plus a sublinear term. The linear term errh n is unavoidable in the worst case, as evident from label complexity lower bounds [5, 5]. When errh is negligible e.g., the data is separable and θ is bounded as is the case for many problems studied in the literature [4], then the bound represents a polynomial label complexity improvement over supervised learning, similar to that achieved by the version space algorithm from [5]. 5.3 Analysis under Low Noise Conditions Some recent work on active learning has focused on improved label complexity under certain noise conditions [7, 8, 8, 6, 7]. Specifically, it is assumed that there exists constantsκ > 0 and0 < α such that PrhX h X κ errh errh α 5 for all h H. This is related to Tsybakov s low noise condition [6]. Essentially, this condition requires that low error hypotheses not be too far from the optimal hypothesish under the disagreement metric Prh X hx. Under this condition, Lemma 3 can be improved, which in turn yields the following theorem. 6

7 Theorem 4. Assume that for some value of κ > 0 and 0 < α, the condition in Eq. 5 holds for all h H. There is a constant c α > 0 depending only on α such that the following holds. With probability at least δ, the expected number of labels queried by Algorithm afterniterations is at most θ κ c α C 0 logn α/2 n α/2. Note that the bound is sublinear in n for all 0 < α, which implies label complexity improvements whenever θ is bounded an improved analogue of Theorem 2 under these conditions can be established using similar techniques. The previous algorithms of [6, 7] obtain even better rates under these noise conditions using specialized data dependent generalization bounds, but these algorithms also required optimizations over restricted version spaces, even for the bound computation. 6 Experiments Although agnostic learning is typically intractable in the worst case, empirical risk minimization can serve as a useful abstraction for many practical supervised learning algorithms in non-worst case scenarios. With this in mind, we conducted a preliminary experimental evaluation of Algorithm, implemented using a popular algorithm for learning decision trees in place of the required ERM oracle. Specifically, we use the J48 algorithm from Weka v3.6.2 with default parameters to select the hypothesis h k in each round k; to produce the alternative hypothesis h k, we just modify the decision tree h k by changing the label of the node used for predicting on x k. Both of these procedures are clearly heuristic, but they are similar in spirit to the required optimizations. We set C 0 = 8 and c = c 2 = these can be regarded as tuning parameters, with C 0 controlling the aggressiveness of the rejection threshold. We did not perform parameter tuning with active learning although the importance weighting approach developed here could potentially be used for that. Rather, the goal of these experiments is to assess the compatibility of Algorithm with an existing, practical supervised learning procedure. 6. Data Sets We constructed two binary classification tasks using MNIST and KDDCUP99 data sets. For MNIST, we randomly chose 4000 training 3s and 5s for training using the 3s as the positive class, and used all of the 902 testing 3s and 5s for testing. For KDDCUP99, we randomly chose 5000 examples for training, and another5000 for testing. In both cases, we reduced the dimension of the data to25 using PCA. To demonstrate the versatility of our algorithm, we also conducted a multi-class classification experiment using the entire MNIST data set all ten digits, so60000 training data and0000 testing data. This required modifying how h k is selected: we force h k x k h k x k by changing the label of the prediction node for x k to the next best label. We used PCA to reduce the dimension to Results We examined the test error as a function of i the number of unlabeled data seen, and ii the number of labels queried. We compared the performance of the active learner described above to a passive learner one that queries every label, so i and ii are the same using J48 with default parameters. In all three cases, the test errors as a function of the number of unlabeled data were roughly the same for both the active and passive learners. This agrees with the consistency guarantee from Theorem 2. We note that this is a basic property not satisfied by many active learning algorithms this issue is discussed further in [22]. In terms of test error as a function of the number of labels queried Figure 2, the active learner had minimal improvement over the passive learner on the binary MNIST task, but a substantial improvement over the passive learner on the KDDCUP99 task even at small numbers of label queries. For the multi-class MNIST task, the active learner had a moderate improvement over the passive learner. Note that KDDCUP99 is far less noisy more separable than MNIST3s vs5s task, so the results are in line with the label complexity behavior suggested by Theorem 3, which states that the label complexity improvement may scale with the error of the optimal hypothesis. Also, 7

8 Passive Active Passive Active test error 0.5 test error number of labels queried MNIST 3s vs 5s number of labels queried KDDCUP Passive Active Passive Active test error test error number of labels queried KDDCUP99 close-up number of labels queried x 0 4 MNIST multi-class close-up Figure 2: Test errors as a function of the number of labels queried. the results from MNIST tasks suggest that the active learner may require an initial random sampling phase during which it is equivalent to the passive learner, and the advantage manifests itself after this phase. This again is consistent with the analysis also see [4], as the disagreement coefficient can be large at initial scales, yet much smaller as the number of unlabeled data increases and the scale becomes finer. 7 Conclusion This paper provides a new active learning algorithm based on error minimization oracles, a departure from the version space approach adopted by previous works. The algorithm we introduce here motivates computationally tractable and effective methods for active learning with many classifier training algorithms. The overall algorithmic template applies to any training algorithm that i operates by approximate error minimization and ii for which the cost of switching a class prediction as measured by example errors can be estimated. Furthermore, although these properties might only hold in an approximate or heuristic sense, the created active learning algorithm will be safe in the sense that it will eventually converge to the same solution as a passive supervised learning algorithm. Consequently, we believe this approach can be widely used to reduce the cost of labeling in situations where labeling is expensive. Recent theoretical work on active learning has focused on improving rates of convergence. However, in some applications, it may be desirable to improve performance at much smaller sample sizes, perhaps even at the cost of improved rates as long as consistency is ensured. Importance sampling and weighting techniques like those analyzed in this work may be useful for developing more aggressive strategies with such properties. Acknowledgments This work was completed while DH was at Yahoo! Research and UC San Diego. 8

9 References [] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 52:20 22, 994. [2] S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing Systems 8, [3] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Twenty-Third International Conference on Machine Learning, [4] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In Advances in Neural Information Processing Systems 20, [5] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In Twenty-Sixth International Conference on Machine Learning, [6] S. Hanneke. Adaptive rates of convergence in active learning. In Twenty-Second Annual Conference on Learning Theory, [7] V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. Manuscript, [8] M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In Twentieth Annual Conference on Learning Theory, [9] R..S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 998. [0] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal of Computing, 32:48 77, [] M. Sugiyama, M. Krauledat, and K.-R. Müller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8: , [2] M. Sugiyama. Active learning for misspecified models. In Advances in Neural Information Processing Systems 8, [3] F. Bach. Active learning for misspecified generalized linear models. In Advances in Neural Information Processing Systems 9, [4] S. Hanneke. A bound on the label complexity of agnostic active learning. In Twenty-Fourth International Conference on Machine Learning, [5] M. Kääriäinen. Active learning in the non-realizable case. In Seventeenth International Conference on Algorithmic Learning Theory, [6] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statistics, 32:35 66, [7] R. Castro and R. Nowak. Upper and lower bounds for active learning. In Allerton Conference on Communication, Control and Computing, [8] R. Castro and R. Nowak. Minimax bounds for active learning. In Twentieth Annual Conference on Learning Theory, [9] T. Zhang. Data dependent concentration bounds for sequential prediction algorithms. In Eighteenth Annual Conference on Learning Theory, [20] E. Friedman. Active learning for smooth problems. In Twenty-Second Annual Conference on Learning Theory, [2] L. Wang. Sufficient conditions for agnostic active learnable. In Advances in Neural Information Processing Systems 22, [22] S. Dasgupta and D. Hsu. Hierarchical sampling for active learning. In Twenty-Fifth International Conference on Machine Learning,

10 A Proof of Deviation Bound for Importance Weighted Estimators The techniques here are mostly developed in [9]; for completeness, we detail the proofs for our particular application. The first two lemmas establish a basic bound in terms of conditional moment generating functions. Lemma 4. For all n and all functionals Ξ i := ξ i Z :i, [ n ] E exp Ξ i lne i [expξ i ] =. i= Proof. A straightforward induction onn. Lemma 5. For all t 0, λ R, n, and functionals Ξ i := ξ i Z :i, Pr λ Ξ i lne i [expλξ i ] t e t. i= i= Proof. The claim follows by Markov s inequality and Lemma 4 replacing Ξ i withλξ i. i= In order to specialize Lemma 5 for our purposes, we first analyze the conditional moment generating function of W i E i [W i ]. Lemma 6. If0 < λ < 3P i, then IfE i [W i ] = 0, then lne i [expλw i E i [W i ]] P i lne i [expλw i E i [W i ]] = 0. λ 2 2 λ/3p i. Proof. Letgx := expx x /x 2 forx 0, soexpx = +x+x 2 gx. Note thatgx is non-decreasing. Thus, E i [expλw i E i [W i ]] = E i [ +λwi E i [W i ]+λ 2 W i E i [W i ] 2 gλw i E i [W i ] ] = +λ 2 E i [ Wi E i [W i ] 2 gλw i E i [W i ] ] +λ 2 E i [ Wi E i [W i ] 2 gλ/p i ] = +λ 2 E i [ Wi E i [W i ] 2] gλ/p i +λ 2 /P i gλ/p i where the first inequality follows from the range bound W i /P i and the second follows from variance bounde i [W i E i [W i ] 2 ] /P i. Now the first claim follows from the definition ofgx, the facts expx x x 2 /2 x/3 for0 x < 3 and ln+x x. The second claim is immediate from the definition of W i and the fact E i [W i ] = fx i,y i. We now combine Lemma 6 and Lemma 5 to bound the deviation of the importance weighted estimator fz :n from /n n i= E i[w i ]. Lemma 7. Pick any t 0, n, and p min > 0, and let E be the joint event Then PrE e t. n W i n i= E i [W i ] i= p min 2t n + p min and min{p i : i n E i [W i ] 0} p min. t 3n 0

11 Proof. With foresight, let λ := 3p min 3p min 2t 3n + 3p min 2t 3n Note that 0 < λ < 3p min. By Lemma 6 and the choice of λ, we have that if min{p i : i n E i [W i ] 0} p min, then nλ lne i [expλw i E i [W i ]] λ p min 2 λ/3p min = t 6 p min 2n and i= t nλ = p min Let E be the event that n W i E i [W i ] nλ i= i= t 2n + p min. t 3n. 7 lne i [expλw i E i [W i ]] t nλ and lete be the eventmin{p i : i n E i [W i ] 0} p min. Together, Eq. 6 and Eq. 7 implye E E. And of course,e E E, sopre PrE E PrE e t by Lemma 5. To do away with the joint event in Lemma 7, we use the standard trick of taking a union bound over a geometric sequence of possible values forp min. Lemma 8. Pick any t 0 and n. Assume /P i r max for all i n, and let R n := /min{p i : i n E i [W i ] 0} {}. We have Pr W i E i [W i ] n n 2Rn t n + R nt 22+log 3n 2 r max e t/2. i= i= Proof. The assumption on P i implies R n r max. Let r j := 2 j for j m := log 2 r max. Then Pr W i 2Rn t E i [W i ] n n n + R nt 3n i= i= m = Pr W i 2Rn t E i [W i ] n n n + R nt 3n r j < R n r j j=0 i= i= m Pr W i 2rj t E i [W i ] + r j t n n n 3n R n r j j=0 i= i= m = Pr W i 2rj t/2 E i [W i ] + r jt/2 R n r j n n n 3n j=0 i= 2+log 2 r max e t/2 i= where the last inequality follows from Lemma 7. Replacing W i with W i bounds the probability of deviations in the other direction in exactly the same way. The claim then follows by the union bound. Proof of Theorem. By Hoeffding s inequality and the fact fx i,y i, we have 2t Pr fx i,y i E[fX,Y] n 2e t/2. n i= Since E i [W i ] = fx i,y i, the claim follows by combining this and Lemma 8 with the triangle inequality and the union bound.

12 B Remaining Proofs In this section, we use the notation ε k := C 0 logk +/k. B. Proof of Lemma 2 By induction on n. Trivial for n = since pempty sequence,x = for all x X, so now fix any n 2 and assume as the inductive hypothesis p n = pz :n 2,x /n n for all z :n 2,x X Y {0,} n 2 X. Fix anyz :n,x X Y {0,} n X, and consider the error difference g n := errh n,z :n errh n,z :n used to determine p n := pz :n,x. We only have to consider the case g n > ε n + ε n. By the inductive hypothesis and triangle inequality, we have g n 2n n. Solving the quadratic in Eq. 2 implies pn = c ε n + c 2 ε n +4 g n +c ε n +c 2 ε n c2 ε n 2 g n +c ε n +c 2 ε n 4 g n +c ε n +c 2 ε n c2 ε n > 2 dropping terms g n +c ε n +c 2 ε n c 2 ε n = g n +c ε n +c 2 ε n c 2 ε n sincec 2 c g n +c ε n +c ε n c2 ε n sinceg n > ε n +ε n c g n c 2 C 0 logn = c n g n c 2 C 0 logn 2c n n n inductive hypothesis > en n sincec 0 2, n 2, and c 2 C 0 log2/2c > /e n n sincen/n n e as required. B.2 Proof of Theorem 2 We condition on the δ probability event that the deviation bounds from Lemma hold also using Lemma 2. The proof now proceeds by induction on n. The claim is trivially true for n =. Now pick any n 2 and assume as the strong inductive hypothesis that 0 errh k errh errh k,z :k errh,z :k + 2ε k +2ε k 8 for all k n. We need to show Eq. 8 holds fork = n. Let P min := min{p i : i n h n X i h X i } {}. If P min /2, then Eq. 4 implies that Eq. 8 holds for k = n as needed. So assume for sake of contradiction that P min < /2, and let n 0 := max{i n : P i = P min h n X i h X i }. By definition of P n0, we have errh c c2 n 0,Z :n0 errh n0,z :n0 = c + εn0 + c 2 + ε n0. Pmin P min 2

13 Using this fact together with the inductive hypothesis, we have errh n 0,Z :n0 errh,z :n0 = errh n 0,Z :n0 errh n0,z :n0 +errh n0,z :n0 errh,z :n0 c c2 c + ε n0 + c 2 + ε n0 2ε n0 2ε n0 Pmin P min c = c + c2 2 ε n0 + c 2 ε n0. 9 Pmin P min We use the assumption P min < /2 to lower bound the righthand side to get the inequality errh n 0,Z :n0 errh,z :n0 > c 2 ε n0 +c 2 ε n0 > 0. which implies errh n 0,Z :n0 > errh,z :n0. Since h n 0 minimizes errh,z :n0 among hypotheses h H that disagree with h n0 on X n0, it must be that h agrees with h n0 on X n0. By transitivity and the definition ofn 0, we conclude thath n X n0 = h n 0 X n0 ; soerrh n,z :n0 errh n 0,Z :n0. Then errh n,z :n errh,z :n errh n errh ε n ε n P min P min errh n,z :n0 errh,z :n0 2 ε n0 2 ε n0 P min P min c 2 c + c2 2 2 ε n0 + c 2 ε n0 Pmin P min > c ε n0 +c 2 5 ε n0 where Eq. 4 is used in the first two inequalities, Eq. 9 and the fact errh n,z :n0 errh n 0,Z :n0 are used in the third inequality, and the fact P min < /2 is used in the last inequality. This final quantity is non-negative, so we have the contradiction errh n,z :n > errh,z :n. B.3 Proof of Lemma 3 First, we establish a property of the query probabilities that relates error deviations via P min to empirical error differences via P n. Both quantities play essential roles in bounding the label complexity through the disagreement metric structure around h. Lemma 9. Assume the bounds from Eq. 4 hold for all h H and n. For any n, we have P n c 3 P min, where P min := min{p i : i n hx i h X i } {} and h := { hn ifh n disagrees withh on X n h n ifh n disagrees withh on X n. 0 Proof. We can assume P min < /c 3, since otherwise the claim is trivial. Pick any n 0 n such that hx n0 h X n0 and P n0 = P min such an n 0 is guaranteed to exist given the above assumption. We now proceed as in the proof of Theorem 2. We first show a lower bound on errh,z :n0 errh,z :n0. Note that errh n 0,Z :n0 errh,z :n0 = errh n 0,Z :n0 errh n0,z :n0 +errh n0,z :n0 errh,z :n0 c c2 c + ε n0 + c 2 + ε n0 2ε n0 2ε n0 Pmin P min c = c + c2 2 ε n0 + c 2 ε n0 Pmin P min 3

14 where the inequality follows from Theorem 2. The righthand side is positive, so h must disagree with h n 0 on X n0. By transitivity recalling that hx n0 h X n0, h must agree with h n 0 on X n0. Therefore errh,z :n0 errh n 0,Z :n0 0, so the inequality in Eq. holds with h in place of h n 0 on the lefthand side. Now errh,z :n errh,z :n is related to errh,z :n0 errh,z :n0 through errh errh using the deviation bound from Eq. 4 as well as the fact ε n0 ε n : errh,z :n errh,z :n errh,z :n0 errh,z :n0 2 ε n0 2 ε n0 P min P min c 2 c + c2 2 2 ε n + c 2 ε n > 0. 2 Pmin P min Ifh = h n, thenerrh,z :n errh,z :n = errh n,z :n errh,z :n 0 by the minimality of errh n,z :n ; this contradicts Eq. 2. Therefore it must be that h = h n. In this case, errh,z :n errh,z :n errh n,z :n errh n,z :n c c2 = c + ε n + c 2 + ε n 3 Pn P n where the inequality follows from the minimality of errh n,z :n, and the subsequent step follows from the definition of P n. Combining the lower bound in Eq. 2 and the upper bound in Eq. 3 implies that c Pn ε n + c 2 P n ε n It is easily checked that this impliesp n c 3 P min. c 2 c2 2 2 ε n + 2 ε n. Pmin P min Proof of Lemma 3. Define h as in Eq. 0. By Lemma 9, we have min{p i : i n hx i h X i } {} P n /c 3. We first show that errh errh errh,z :n errh c3,z :n + ε n + c 3 ε n P n P n c4 ε n + c 5 ε n. 4 P n P n The first inequality follows from Eq. 4 and Lemma 9. For the second inequality, we consider two cases depending on h. If h = h n, then we bound errh,z :n errh,z :n from above by errh n,z :n errh n,z :n by definition ofhand minimality oferrh n,z :n, and then simplify errh n,,z :n errh n,z :n + c + c 3 c + ε n + Pn c3 ε n + c 3 ε n P n P n c2 +c 3 c 2 + ε n P n c4 P n ε n + c 5 P n ε n using the definition of P n and the facts c and c 2. If instead h = h n, then we use the facts errh,z :n errh,z :n = errh n,z :n errh,z :n 0 andc 3 min{c 4,c 5 }. If errh errh = γ > 0, then solving the quadratic inequality in Eq. 4 for P n gives the bound { P n min, 3 2 c4 γ 2 + c } 5 ε n. γ Iferrh errh γ, then by the triangle inequality we have Prh X hx errh +errh 2errh + γ 4

15 which in turn impliesx n DISh,2errh + γ. Note thatprx n DISh,2errh + γ θ 2errh + γ by definition of θ, soprerrh errh γ θ 2errh + γ. Let fγ := Prerrh errh γ/ γ be the probability density mass function of the error difference errh errh ; note that this error difference is a function of Z :n,x n. We compute the expected value of Q n by conditioning on errh errh and integrating an upper bound on E[Q n errh errh = γ] with respect tofγ. Let γ 0 > 0 be the positive solution to.5c 4 /γ 2 + c 5 /γε n =. It can be checked that γ 0 >.5c4 ε n. We have E[Q n ] = E[E[Q n Z :n,x n ]] the outer expectation is over Z :n,x n = E[Q n errh errh = γ] dγ 0 0 γ Prerrh errh γ γ Prerrh errh γ 3 2 c 4 +c 5 ε n Prerrh errh { γ min, 3 2 c4 γ 2 + c 5 γ c 4 +c 5 ε n + γ { min, 3 2 c4 γ 2 + c } 5 ε n dγ γ ε n } Prerrh errh γ dγ 2c4 γ 3 + c 5 γ 2 = 3 2 c 4 +c 5 ε n +θ 2errh 3 2 +θ 3 2 2c 4 +c 5 ln γ0 γ 0 ε n θ 2errh +γ dγ c 4 γ0 2 ε n +c 5 ε n γ c 4 +c 5 ε n +θ 2errh +θ 6c 4 ε n +θ 3c 5 4 ε n ln.5c 4 ε n where the first inequality uses the bound on E[Q n errh errh = γ]; the second inequality uses integration-by-parts; the third inequality uses the fact that the integrand from the previous line is 0 for 0 γ γ 0, as well as the bound on Prerrh errh γ; and the fourth inequality uses the definition of γ 0. B.4 Proof of Theorem 4 The theorem is a simple consequence of the following analogue of Lemma 3. Lemma 0. Assume that for some value of κ > 0 and 0 < α, the condition in Eq. 5 holds for all h H. Assume the bounds from Eq. 4 holds for all h H and n. There is a constant c α > 0 such that the following holds. For any n, α/2 C0 logn E[Q n ] θ κ c α. n Proof. For the most part, the proof is the same as that of Lemma 3. The key difference is to use the noise condition in Eq. 5 to directly bound PrhX h X κ errh errh α, which in turn implies the bound Prerrh errh γ θκγ α. As before, let γ 0 >.5c 4 ε n be the solution to.5c 4 /γ 2 +c 5 /γε n =. First consider the caseα <. Then, the expectation of Q n can be bounded as E[Q n ] 3 2 c 4 +c 5 ε n c 4 +c 5 ε n + 3 γ 0 2 2c4 2c4 γ c 4 +c 5 ε n +θ κ 3 2 γ 3 + c 5 γ 2 γ 3 + c 5 γ 2 2c4 2 α ε n Prerrh errh γ dγ ε n θκγ α dγ γ0 2 α + c 5 α γ0 α ε n. 5

16 The case α = is handled similarly. B.5 Analogue of Theorem 2 under Low Noise Conditions We first state a variant of Lemma that takes into account the probability of disagreement between a hypothesis h and the optimal hypothesis h. Lemma. There exists an absolute constant c > 0 such that the following holds. Pick any δ 0,. For alln, let ε n := c logn+ H /δ. n LetZ,Z 2,... X Y {0,} be the sequence of random variables specified in Section 2.2 using a rejection thresholdp : X Y {0,} X [0,] that satisfiespz :n,x /n n for all z :n,x X Y {0,} n X and alln. The following holds with probability at least δ. For alln and all h H, errh,z :n errh,z :n errh errh PrhX h X ε n ε n + P min,n h P min,n h where P min,n h = min{p i : i n hx i h X i } {}. Proof sketch. The proof of this lemma follows along the same lines as that of Lemma. A key difference comes in Lemma 7: the joint event is modified to also conjoin with E i [fx i,y i ] 0 a n i= for some fixed a > 0. In the proof, the parameter λ should be chosen as 3p min 2at 3n λ := 3p min. a+ 3p min 2at 3n Lemma 8 is modified to also take a union bound over a sequence of possible values for a in fact, only n + different values need to be considered. Finally, instead of combining with Hoeffding s inequality, we use Bernstein s inequality or a multiplicative form of Chernoff s bound so the resulting bound an analogue of Theorem involves an empirical average inside the square-root term: with probability at least On log 2 r max e t/2, Q i fx i,y i E[fX,Y] n P i O Rn A n t + R nt n 3n where i= A n := n fx i,y i 0. i= Finally, we apply this deviation bound to obtain uniform error bounds over all hypotheses H a few extra steps are required to replace the empirical quantity A n in the bound with a distributional quantity. Using the previous lemma, a modified version of Theorem 2 follows from essentially the same proof. We note that the quantity C := Olog H /δ used here may differ from C 0 by constant factors. Lemma 2. The following holds with probability at least δ. For any n, 0 errh n errh errh n,z :n errh,z :n 2Prhn X h + XC logn + 2C logn n n. This implies, for all n, errh n errh 2Prhn X h + XC logn + 2C 0logn n n. 6

17 Finally, using the noise condition to bound Prh n X h X κ errh n errh α, we obtain the final error bound. Theorem 5. The following holds with probability at least δ. For any n, errh n errh C 2 α logn +c κ n where c κ is a constant that depends only on κ. 7

Active learning. Sanjoy Dasgupta. University of California, San Diego

Active learning. Sanjoy Dasgupta. University of California, San Diego Active learning Sanjoy Dasgupta University of California, San Diego Exploiting unlabeled data A lot of unlabeled data is plentiful and cheap, eg. documents off the web speech samples images and video But

More information

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 1 Active Learning Most classic machine learning methods and the formal learning

More information

Practical Agnostic Active Learning

Practical Agnostic Active Learning Practical Agnostic Active Learning Alina Beygelzimer Yahoo Research based on joint work with Sanjoy Dasgupta, Daniel Hsu, John Langford, Francesco Orabona, Chicheng Zhang, and Tong Zhang * * introductory

More information

Atutorialonactivelearning

Atutorialonactivelearning Atutorialonactivelearning Sanjoy Dasgupta 1 John Langford 2 UC San Diego 1 Yahoo Labs 2 Exploiting unlabeled data Alotofunlabeleddataisplentifulandcheap,eg. documents off the web speech samples images

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

A Bound on the Label Complexity of Agnostic Active Learning

A Bound on the Label Complexity of Agnostic Active Learning A Bound on the Label Complexity of Agnostic Active Learning Steve Hanneke March 2007 CMU-ML-07-103 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Machine Learning Department,

More information

Importance Weighted Active Learning

Importance Weighted Active Learning Alina Beygelzimer IBM Research, 19 Skyline Drive, Hawthorne, NY 10532 beygel@us.ibm.com Sanjoy Dasgupta dasgupta@cs.ucsd.edu Department of Computer Science and Engineering, University of California, San

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

1 A Lower Bound on Sample Complexity

1 A Lower Bound on Sample Complexity COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #7 Scribe: Chee Wei Tan February 25, 2008 1 A Lower Bound on Sample Complexity In the last lecture, we stopped at the lower bound on

More information

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University Sample and Computationally Efficient Active Learning Maria-Florina Balcan Carnegie Mellon University Machine Learning is Shaping the World Highly successful discipline with lots of applications. Computational

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees Improved Algorithms for Confidence-Rated Prediction with Error Guarantees Kamalika Chaudhuri Chicheng Zhang Department of Computer Science and Engineering University of California, San Diego 95 Gilman

More information

Asymptotic Active Learning

Asymptotic Active Learning Asymptotic Active Learning Maria-Florina Balcan Eyal Even-Dar Steve Hanneke Michael Kearns Yishay Mansour Jennifer Wortman Abstract We describe and analyze a PAC-asymptotic model for active learning. We

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017 Selective Prediction Binary classifications Rong Zhou November 8, 2017 Table of contents 1. What are selective classifiers? 2. The Realizable Setting 3. The Noisy Setting 1 What are selective classifiers?

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Lecture 7: Passive Learning

Lecture 7: Passive Learning CS 880: Advanced Complexity Theory 2/8/2008 Lecture 7: Passive Learning Instructor: Dieter van Melkebeek Scribe: Tom Watson In the previous lectures, we studied harmonic analysis as a tool for analyzing

More information

Lecture 29: Computational Learning Theory

Lecture 29: Computational Learning Theory CS 710: Complexity Theory 5/4/2010 Lecture 29: Computational Learning Theory Instructor: Dieter van Melkebeek Scribe: Dmitri Svetlov and Jake Rosin Today we will provide a brief introduction to computational

More information

Active Learning Class 22, 03 May Claire Monteleoni MIT CSAIL

Active Learning Class 22, 03 May Claire Monteleoni MIT CSAIL Active Learning 9.520 Class 22, 03 May 2006 Claire Monteleoni MIT CSAIL Outline Motivation Historical framework: query learning Current framework: selective sampling Some recent results Open problems Active

More information

10.1 The Formal Model

10.1 The Formal Model 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

More information

Efficient Semi-supervised and Active Learning of Disjunctions

Efficient Semi-supervised and Active Learning of Disjunctions Maria-Florina Balcan ninamf@cc.gatech.edu Christopher Berlind cberlind@gatech.edu Steven Ehrlich sehrlich@cc.gatech.edu Yingyu Liang yliang39@gatech.edu School of Computer Science, College of Computing,

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Agnostic Active Learning

Agnostic Active Learning Agnostic Active Learning Maria-Florina Balcan Carnegie Mellon University, Pittsburgh, PA 523 Alina Beygelzimer IBM T. J. Watson Research Center, Hawthorne, NY 0532 John Langford Yahoo! Research, New York,

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Active Learning from Single and Multiple Annotators

Active Learning from Single and Multiple Annotators Active Learning from Single and Multiple Annotators Kamalika Chaudhuri University of California, San Diego Joint work with Chicheng Zhang Classification Given: ( xi, yi ) Vector of features Discrete Labels

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016 Machine Learning 10-701, Fall 2016 Computational Learning Theory Eric Xing Lecture 9, October 5, 2016 Reading: Chap. 7 T.M book Eric Xing @ CMU, 2006-2016 1 Generalizability of Learning In machine learning

More information

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan Foundations For Learning in the Age of Big Data Maria-Florina Balcan Modern Machine Learning New applications Explosion of data Classic Paradigm Insufficient Nowadays Modern applications: massive amounts

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

From Batch to Transductive Online Learning

From Batch to Transductive Online Learning From Batch to Transductive Online Learning Sham Kakade Toyota Technological Institute Chicago, IL 60637 sham@tti-c.org Adam Tauman Kalai Toyota Technological Institute Chicago, IL 60637 kalai@tti-c.org

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

1 Review of The Learning Setting

1 Review of The Learning Setting COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Automatic Differentiation and Neural Networks

Automatic Differentiation and Neural Networks Statistical Machine Learning Notes 7 Automatic Differentiation and Neural Networks Instructor: Justin Domke 1 Introduction The name neural network is sometimes used to refer to many things (e.g. Hopfield

More information

Minimax Analysis of Active Learning

Minimax Analysis of Active Learning Journal of Machine Learning Research 6 05 3487-360 Submitted 0/4; Published /5 Minimax Analysis of Active Learning Steve Hanneke Princeton, NJ 0854 Liu Yang IBM T J Watson Research Center, Yorktown Heights,

More information

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees Improved Algorithms for Confidence-Rated Prediction with Error Guarantees Kamalika Chaudhuri Chicheng Zhang Department of Computer Science and Engineering University of California, San Diego 95 Gilman

More information

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

New Algorithms for Contextual Bandits

New Algorithms for Contextual Bandits New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised

More information

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre

More information

Multiclass Learnability and the ERM principle

Multiclass Learnability and the ERM principle Multiclass Learnability and the ERM principle Amit Daniely Sivan Sabato Shai Ben-David Shai Shalev-Shwartz November 5, 04 arxiv:308.893v [cs.lg] 4 Nov 04 Abstract We study the sample complexity of multiclass

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,

More information

Reducing contextual bandits to supervised learning

Reducing contextual bandits to supervised learning Reducing contextual bandits to supervised learning Daniel Hsu Columbia University Based on joint work with A. Agarwal, S. Kale, J. Langford, L. Li, and R. Schapire 1 Learning to interact: example #1 Practicing

More information

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Distributed Active Learning with Application to Battery Health Management

Distributed Active Learning with Application to Battery Health Management 14th International Conference on Information Fusion Chicago, Illinois, USA, July 5-8, 2011 Distributed Active Learning with Application to Battery Health Management Huimin Chen Department of Electrical

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Margin Based Active Learning

Margin Based Active Learning Margin Based Active Learning Maria-Florina Balcan 1, Andrei Broder 2, and Tong Zhang 3 1 Computer Science Department, Carnegie Mellon University, Pittsburgh, PA. ninamf@cs.cmu.edu 2 Yahoo! Research, Sunnyvale,

More information

The True Sample Complexity of Active Learning

The True Sample Complexity of Active Learning The True Sample Complexity of Active Learning Maria-Florina Balcan Computer Science Department Carnegie Mellon University ninamf@cs.cmu.edu Steve Hanneke Machine Learning Department Carnegie Mellon University

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 600.463 Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 25.1 Introduction Today we re going to talk about machine learning, but from an

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

1 Differential Privacy and Statistical Query Learning

1 Differential Privacy and Statistical Query Learning 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 5: December 07, 015 1 Differential Privacy and Statistical Query Learning 1.1 Differential Privacy Suppose

More information

Lecture 14: Binary Classification: Disagreement-based Methods

Lecture 14: Binary Classification: Disagreement-based Methods CSE599i: Online and Adaptive Machine Learning Winter 2018 Lecture 14: Binary Classification: Disagreement-based Methods Lecturer: Kevin Jamieson Scribes: N. Cano, O. Rafieian, E. Barzegary, G. Erion, S.

More information

arxiv: v1 [cs.lg] 15 Aug 2017

arxiv: v1 [cs.lg] 15 Aug 2017 Theoretical Foundation of Co-Training and Disagreement-Based Algorithms arxiv:1708.04403v1 [cs.lg] 15 Aug 017 Abstract Wei Wang, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing

More information

Martingale Boosting. September 15, 2008

Martingale Boosting. September 15, 2008 Martingale Boosting Philip M. Long Google Mountain View, CA plong@google.com Rocco A. Servedio Columbia University New York, NY rocco@cs.columbia.edu September 15, 2008 Abstract Martingale boosting is

More information

Learning Combinatorial Functions from Pairwise Comparisons

Learning Combinatorial Functions from Pairwise Comparisons Learning Combinatorial Functions from Pairwise Comparisons Maria-Florina Balcan Ellen Vitercik Colin White Abstract A large body of work in machine learning has focused on the problem of learning a close

More information

Tsybakov noise adap/ve margin- based ac/ve learning

Tsybakov noise adap/ve margin- based ac/ve learning Tsybakov noise adap/ve margin- based ac/ve learning Aar$ Singh A. Nico Habermann Associate Professor NIPS workshop on Learning Faster from Easy Data II Dec 11, 2015 Passive Learning Ac/ve Learning (X j,?)

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan Foundations For Learning in the Age of Big Data Maria-Florina Balcan Modern Machine Learning New applications Explosion of data Modern ML: New Learning Approaches Modern applications: massive amounts of

More information

Polynomial time Prediction Strategy with almost Optimal Mistake Probability

Polynomial time Prediction Strategy with almost Optimal Mistake Probability Polynomial time Prediction Strategy with almost Optimal Mistake Probability Nader H. Bshouty Department of Computer Science Technion, 32000 Haifa, Israel bshouty@cs.technion.ac.il Abstract We give the

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

Learning symmetric non-monotone submodular functions

Learning symmetric non-monotone submodular functions Learning symmetric non-monotone submodular functions Maria-Florina Balcan Georgia Institute of Technology ninamf@cc.gatech.edu Nicholas J. A. Harvey University of British Columbia nickhar@cs.ubc.ca Satoru

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Microchoice Bounds and Self Bounding Learning Algorithms

Microchoice Bounds and Self Bounding Learning Algorithms Microchoice Bounds and Self Bounding Learning Algorithms John Langford Computer Science Department Carnegie Mellong University Pittsburgh, PA 527 jcl+@cs.cmu.edu Avrim Blum Computer Science Department

More information

Beating the Hold-Out: Bounds for K-fold and Progressive Cross-Validation

Beating the Hold-Out: Bounds for K-fold and Progressive Cross-Validation Beating the Hold-Out: Bounds for K-fold and Progressive Cross-Validation Avrim Blum School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 avrim+@cscmuedu Adam Kalai School of Computer

More information

CS446: Machine Learning Spring Problem Set 4

CS446: Machine Learning Spring Problem Set 4 CS446: Machine Learning Spring 2017 Problem Set 4 Handed Out: February 27 th, 2017 Due: March 11 th, 2017 Feel free to talk to other members of the class in doing the homework. I am more concerned that

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

Multiple Identifications in Multi-Armed Bandits

Multiple Identifications in Multi-Armed Bandits Multiple Identifications in Multi-Armed Bandits arxiv:05.38v [cs.lg] 4 May 0 Sébastien Bubeck Department of Operations Research and Financial Engineering, Princeton University sbubeck@princeton.edu Tengyao

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Efficient PAC Learning from the Crowd

Efficient PAC Learning from the Crowd Proceedings of Machine Learning Research vol 65:1 24, 2017 Efficient PAC Learning from the Crowd Pranjal Awasthi Rutgers University Avrim Blum Nika Haghtalab Carnegie Mellon University Yishay Mansour Tel-Aviv

More information

Activized Learning with Uniform Classification Noise

Activized Learning with Uniform Classification Noise Activized Learning with Uniform Classification Noise Liu Yang Steve Hanneke June 9, 2011 Abstract We prove that for any VC class, it is possible to transform any passive learning algorithm into an active

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Empirical Risk Minimization, Model Selection, and Model Assessment

Empirical Risk Minimization, Model Selection, and Model Assessment Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,

More information

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Computational Learning Theory

Computational Learning Theory 1 Computational Learning Theory 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number

More information

On the Sample Complexity of Noise-Tolerant Learning

On the Sample Complexity of Noise-Tolerant Learning On the Sample Complexity of Noise-Tolerant Learning Javed A. Aslam Department of Computer Science Dartmouth College Hanover, NH 03755 Scott E. Decatur Laboratory for Computer Science Massachusetts Institute

More information

Multiclass Learnability and the ERM Principle

Multiclass Learnability and the ERM Principle Journal of Machine Learning Research 6 205 2377-2404 Submitted 8/3; Revised /5; Published 2/5 Multiclass Learnability and the ERM Principle Amit Daniely amitdaniely@mailhujiacil Dept of Mathematics, The

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory

More information

Analysis of perceptron-based active learning

Analysis of perceptron-based active learning Analysis of perceptron-based active learning Sanjoy Dasgupta 1, Adam Tauman Kalai 2, and Claire Monteleoni 3 1 UCSD CSE 9500 Gilman Drive #0114, La Jolla, CA 92093 dasgupta@csucsdedu 2 TTI-Chicago 1427

More information

Experience-Efficient Learning in Associative Bandit Problems

Experience-Efficient Learning in Associative Bandit Problems Alexander L. Strehl strehl@cs.rutgers.edu Chris Mesterharm mesterha@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu Haym Hirsh hirsh@cs.rutgers.edu Department of Computer Science, Rutgers University,

More information