Active Learning: Disagreement Coefficient

Similar documents
1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

A Bound on the Label Complexity of Agnostic Active Learning

Atutorialonactivelearning

Introduction to Machine Learning (67577) Lecture 3

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Online Learning Summer School Copenhagen 2015 Lecture 1

Does Unlabeled Data Help?

Active learning. Sanjoy Dasgupta. University of California, San Diego

Practical Agnostic Active Learning

The Perceptron algorithm

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Computational and Statistical Learning Theory

Active Learning Class 22, 03 May Claire Monteleoni MIT CSAIL

The sample complexity of agnostic learning with deterministic labels

Online Learning, Mistake Bounds, Perceptron Algorithm

Introduction to Machine Learning (67577) Lecture 5

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Generalization bounds

Littlestone s Dimension and Online Learnability

Lecture 14: Binary Classification: Disagreement-based Methods

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

Lecture 8. Instructor: Haipeng Luo

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Computational and Statistical Learning theory

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

1 Learning Linear Separators

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

IFT Lecture 7 Elements of statistical learning theory

On the tradeoff between computational complexity and sample complexity in learning

Online Convex Optimization

The Geometry of Generalized Binary Search

2 Upper-bound of Generalization Error of AdaBoost

1 The Probably Approximately Correct (PAC) Model

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Generalization, Overfitting, and Model Selection

Agnostic Active Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Asymptotic Active Learning

Introduction to Machine Learning

Lecture 4: Linear predictors and the Perceptron

Computational Learning Theory

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

The PAC Learning Framework -II

Computational Learning Theory. CS534 - Machine Learning

Online Passive-Aggressive Algorithms

CS229 Supplemental Lecture notes

Lecture 21: Minimax Theory

Sample complexity bounds for differentially private learning

From Batch to Transductive Online Learning

1 Learning Linear Separators

Supervised Learning and Co-training

ORIE 4741: Learning with Big Messy Data. Generalization

Understanding Machine Learning Solution Manual

Generalization and Overfitting

Efficient Semi-supervised and Active Learning of Disjunctions

Introduction to Support Vector Machines

Empirical Risk Minimization

Introduction to Machine Learning (67577)

The True Sample Complexity of Active Learning

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Chapter 6. Hypothesis Tests Lecture 20: UMP tests and Neyman-Pearson lemma

Introduction to Machine Learning (67577) Lecture 7

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Models of Language Acquisition: Part II

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Active Learning from Single and Multiple Annotators

Diameter-Based Active Learning

Multiclass Learnability and the ERM principle

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Support vector machines Lecture 4

Introduction to Statistical Learning Theory

LEARNING KERNEL-BASED HALFSPACES WITH THE 0-1 LOSS

Asymptotic redundancy and prolixity

Multiclass Learnability and the ERM Principle

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Generalization Bounds

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

The True Sample Complexity of Active Learning

The Perceptron Algorithm, Margins

Least Mean Squares Regression

Lecture Support Vector Machine (SVM) Classifiers

How can a Machine Learn: Passive, Active, and Teaching

arxiv: v1 [cs.lg] 15 Aug 2017

Agnostic Online learnability

New Algorithms for Contextual Bandits

3 Finish learning monotone Boolean functions

1 Differential Privacy and Statistical Query Learning

The information-theoretic value of unlabeled data in semi-supervised learning

Computational Learning Theory

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Machine Learning: Homework 5

large number of i.i.d. observations from P. For concreteness, suppose

Transcription:

Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which active learning gives an exponential improvement in the number of labels required for learning. In this lecture we describe the Disagreement Coefficient a measure of the complexity of an active learning problem proposed by Steve Hanneke in 2007. We will derive an algorithm for the realizable case and analyze it using the disagreement coefficient. In particular, we will show that if the disagreement coefficient is constant then it is possible to obtain exponential improvement over passive learning. We will also describe a variant of the Agnostic Active (A 2 ) algorithm (due to Balcan, Beygelzimer, Langford) and show how an exponential improvement can be obtained even in the agnostic case, as long as the accuracy is proportional to the best error rate. 1 Motivation Recall that for the task of learning thresholds on the line we established a logarithmic improvement for active learning using a binary search. Now, lets consider a different algorithm that achieves the same label complexity bound but is less specific. For simplicity, assume that D x is uniform over [0, 1]. We start with V 1 = [0, 1], which represents the initial version space. Now we sample a random example (x 1, y 1 ) and update V 2 = {a V 1 : sign(x 1 a) = y 1 }. That is, if y 1 = 1 then now V 2 = [0, x 1 ] and otherwise V 2 = [x 1, 1]. Next, we sample instances until we have x 2 V 2. We query the label y 2 and again update V 3 = {a V 2 : sign(x 2 a) = y 2 }. We continue this process until V i = [a i, b i ] satisfies b i a i ɛ. Lets analyze this algorithm. First, it is easy to verify that if we stop then all hypotheses in V i has error of at most ɛ (recall that we assume the data is realizable therefore a V i at all times). Second, whenever we query the label we have that x i is selected randomly from V i = [a i, b i ]. Denote = b i a i. Whenever we query x i [a i, b i ], if x i [a i + /4, b i /4] then the size of V i+1 satisfies (V i+1 ) (3/4). The probability that x i [a i + /4, b i /4] is 1/2 so the probability that this will not happen in k consecutive trials is 2 k. For n = kr, we can rewrite (V n+1 ) as (V n+1 ) = (V 1 ) ( (V2 ) (V 1 ) (V k+1) (V k ) ) ( (Vn k ) (V n k 1 ) (V n+1) (V n ) All the multiples above are at most 1. Additionally, for each k multiples in each parenthesis, with probability of 1 2 k at least one multiple is at most 3/4. Therefore, with probability of at least 1 n2 k the expression within each parenthesis is at most 3/4. It follows that with probability of at least 1 n2 k we have (V n+1 ) (3/4) n/k. To make the above at most ɛ it suffices that n Ω(k log(1/ɛ)) and to make the probability of failure be at most δ we can choose k = log 2 (n/δ). In summary, we have shown that with probability of at least 1 δ, the label complexity to achieve accuracy of ɛ is n = O (log(n/δ) log(1/ɛ)). The above analysis relied on the property that it is relatively easy to significantly decrease by sampling from V i. In the next section we describe the disagreement coefficient which characterizes when such an approach will work. ) Active Learning: Disagreement Coefficient-1

2 The Disagreement Coefficient Let H be a hypothesis class. We define a pseudo-metric on H, based on the marginal distribution D x over instances, such that d(h, h ) = P [h(x) h (x)]. x D x We define the corresponding ball of radius r around h to be B(h, r) = {h H : d(h, h ) ɛ}. The disagreement region of a subset of hypotheses V H is and its mass is DIS(V ) = {x : h, h V, h(x) h (x)} (V ) = D x (DIS(V )) = P[x DIS(V )]. That is, (V ) measures the probability to choose an instance x such that there are two hypotheses in V that disagree on its label. Intuitively, for small r we expect (B(h, r)) to become smaller and smaller. The disagreement coefficient measures how quickly (B(h, r)) grows as a function of r. Definition 1 (Disagreement Coefficient) Let H be a hypothesis class, D be a distribution over X {0, 1}, and D x be the marginal distribution over X. Let h be a minimizer of err D (h). The disagreement coefficient is θ def (B(h, r)) = sup. r (0,1) r The disagreement coefficient allows us to bound the mass of the disagreement region of a ball by its radius since r (0, 1), (B(h, r)) θ r. 2.1 Examples Thresholds on the line: Let H be threshold functions of the form x sign(x a) for a R and X = R. Take two hypotheses h, h in B(h, r). Denote a, a, a the thresholds of h, h, h respectively. Let a 0 be the minimal threshold such that D x ([a 0, a ]) r and similarly let a 1 be the maximal threshold such that D x ([a, a 1 ]) r. Clearly, both a and a are in [a 0, a 1 ]. It follows that (B(h, r)) 2r and therefore θ 2. Halfspaces: TBA Finite classes: Let H = {h 1,..., h k } be a finite class. We show that θ H and that there are classes for which equality holds. The upper bound is derived as follows. First, for each class of the form {h i, h } we have θ = 1. Second, it is easy to verify that the disagreement coefficient of H 1 H 2 with respect to h H 1 H 2 is at most the sum of the disagreement coefficients. From this it follows that θ H. Finally, an example in which θ = H is a class in which for all i such that h i h we have that h i (x i ) h (x i ) and for all other instances h i agrees with h. In such a case, setting r = 1/k gives the matching lower bound. Active Learning: Disagreement Coefficient-2

3 An algorithm for a finite class in the realizable case To demonstrate how the disagreement coefficient affects the performance of active learning we start with a simple algorithm for the case of a finite H and assuming that there exists h H with err D (h ) = 0. Algorithm 1 ActiveLearning Parameters: k, ɛ Initialization V = H Loop while ( (V ) > ɛ) Keep sampling instances until having k instances in DIS(V ) Query their labels Let (x 1, y 1 ),..., (x k, y k ) be the resulting examples Update: V = {h V : j [k], h(x j ) = y j } end while Output: Any hypothesis from V We now analyze the label complexity of the above algorithm. Theorem 1 Let H be a finite hypothesis class, D be a distribution, and assume that the disagreement coefficient, θ, is finite. For any δ, ɛ (0, 1), if Algorithm 1 is run with ɛ and with k = 2θ log(n H /δ), where n = log 2 (1/ɛ), then the algorithm outputs a hypothesis with err D (h) ɛ and with probability of at least 1 δ will stop after at most n rounds. The label complexity is therefore O (θ log(1/ɛ) (log( H /δ) + log log(1/ɛ))). Proof Let V i be the version space at round i. First, if we stop then ɛ and this guarantees that the error of all hypotheses in V i is at most ɛ. To bound the label complexity of the algorithm we will show that (V i+1 ) 1 2 (V i) with high probability. Let V θ i that V i+1 V i \ V θ i = {h V i : d(h, h ) > /(2θ)} be all hypotheses in V i with a large error. We will show and since V i \ Vi θ B(h, /(2θ)) we shall have (V i+1 ) (B(h, /(2θ))) θ /(2θ) = /2, where in the last inequality we used the definition of the disagreement coefficient. Indeed, let D i be the conditional distribution of D given that x in DIS(V i ) and note that in the realizable case we have err Di (h) = err D (h) = d(h, h ). It follows that h V θ i iff d(h, h ) > /(2θ) iff err Di (h) > 1/(2θ). Therefore, for h V θ i, the probability to choose x DIS(V i ) for which h(x) h (x) is at least 1/(2θ). Since we sample k points from DIS(V i ) we obtain that with probability of at least 1 (1 1/(2θ)) k at least one of the k points satisfies h(x) h (x) which means that h will not be in V i+1. Applying the union bound over all hypotheses in Vi θ and note that Vi θ H we obtain P[ h V i+1 V θ i ] H (1 1/(2θ)) k H exp( k/(2θ)). Choosing k as in the theorem statement we get that with probability of at least 1 δ/n we have that V i+1 V i \ Vi θ and as we showed before this implies (V i+1 ) 1 2 (V i). Applying a union bound over the first n rounds of the algorithm we obtain that with probability of at least 1 δ, which will be smaller than ɛ if n = log 2 (1/ɛ). (V n ) (V 1 )2 n 2 n, Active Learning: Disagreement Coefficient-3

3.1 An extension to a low VC class It is easy to tackle the case of a hypothesis class with VC dimension of d based on the above. The idea is to first sample m unlabeled examples. Based on Sauer lemma, the restriction of H to these m examples is of size at most O(m d ). Now, apply the analysis from previous section for finite classes w.r.t. the uniform distribution over the sample. It gives an almost ERM for the m examples. The result follows from generalization bounds for almost ERMs. 3.2 Query by Committee revisited Recall that the QBC algorithm query the label if two random hypotheses from V i give a different answer. This is similar in spirit to sampling from the disagreement region. We left as an exercise to analyze QBC using the concept of disagreement coefficient. 4 The A 2 algorithm In this section we describe and analyze the Agnostic Active (A 2 ) algorithm, originally proposed by Balcan, Beygelzimer and Langford in 2006. We give a variant of the algorithm which is easier to analyze. A more complicated variant of the algorithm is described and analyze in Hanneke (2007). The algorithm is similar to Algorithm 1. We sample examples and query their labels only if they fall in the disagreement region. The major difference is that while in Algorithm 1, we throw away a hypothesis even if it makes a single error, now we throw away a hypothesis only if we are certain that it is worse than h. The algorithm below relies on a function UB(S, h), which provides an upper bound on err D (h) (one that holds with high probability), and LB(S, h), which provides a lower bound on err D (h). Such bounds can be obtained from VC theory. For simplicity, we focus on finite hypotheses class. In that case we have: Lemma 1 Let H be a finite hypothesis class, let D be a distribution, and let S D m be a training set of m examples sampled i.i.d. from D. Then, with probability of at least 1 δ we have h H, err D (h) err S (h). 2m Based on the above lemma, we define { } UB(S, h) = min err S (h) +, 1 2m and we clearly have that with probability of at least 1 δ { } ; LB(S, h) = max err S (h), 0 2m h H, LB(S, h) err D (h) UB(S, h) Algorithm 2 Agnostic Active (A 2 ) Parameters: ν, θ, and δ (used in UB and LB) Initialization: V 1 = H ; k = 32θ 2 Loop while > 8θν Let D i be the conditional distribution D given that x DIS(V i ) Sample k i.i.d. labeled examples from D i, denote this set by S i Update: V i+1 = {h V i : LB(S i, h) min h H UB(S i, h )} end while Output: Sample S Di 8k and return argmin h Vi err S (h) Active Learning: Disagreement Coefficient-4

The following theorem shows that we can achieve a constant approximation of the best possible error very quickly. Theorem 2 Let H be a finite hypothesis class, D be a distribution, and assume that the disagreement coefficient, θ, is finite. Assume that ν err D (h ) for the optimal h H. Then, if Algorithm 2 runs with ν, θ, δ, then with probability of at least 1 δ log(1/(8θν)) the algorithm outputs a hypothesis with err D (h) 2ν and will query at most O ( θ 2 log(1/(θν)) log( H /δ) ) labels. Proof First, suppose that 8θν and that h V i. Then, since for all h V i we have err D (h) err D (h ) = (err Di (h) err Di (h )) 8θν(err Di (h) err Di (h )), it follows that to find h V i with err D (h) err D (h ) ν it suffices to find h V i with err Di (h) err Di (h ) 1/(8θ). Standard generalization bounds tell us that taking an i.i.d. sample from D i of size 4k guarantees that the ERM rule satisfies err Di (h) err Di (h ) 1/(8θ) with probability of at least 1 δ. Hence, it is left to analyze how many rounds are required to have 8θν and to verify that h V i at all times. Let n = log(1/(8θν)) + 1 and let δ = nδ. We will show that with probability of at least 1 δ we have that h is never removed from V i and that (V i+1 ) /2 on all rounds from 1 to n. This will imply that (V n ) 8θν and therefore the algorithm will stop after at most n rounds. Based on Lemma 1 we have that for all rounds from 1 to n and all h, with probability of at least 1 δ we have that err Di (h) err Si (h). 2k In particular this means we never remove h from V i because no hypothesis can have a lower bound smaller then the upper bound for h. Similarly to the proof of Theorem 1 let Vi θ with a large disagreement with h. We will show that V i+1 V i \Vi θ we shall have = {h V i : d(h, h ) > /(2θ)} be all hypotheses in V i and since V i \Vi θ B(h, /(2θ)) (V i+1 ) (B(h, /(2θ))) θ /(2θ) = /2. Note that for each h V i we have that if h(x) h (x) then x DIS(V i ). Therefore, d(h, h ) P [h(x) h (x)] (err Di (h) + err Di (h )) err Di (h) + ν, x D i where the last inequality is because ν err D (h ) err Di (h ). Therefore, if d(h, h ) > /(2θ) then 2θ < d(h, h ) err Di (h) + ν err Di (h) > 1 2θ ν Using again ν/ err Di (h ) we get err Di (h) > 1 2θ ν ( + err Di (h ) ν ) err Di (h) 1 8θ > err D i (h ) + 3 8θ 2 ν Active Learning: Disagreement Coefficient-5

Since we now assume that > 8θν the above implies err Di (h) 1 8θ > err D i (h ) + 1 8θ. Since k 32θ 2 then all hypotheses in V θ i will be removed. Thus, (V i+1 ) /2. 4.1 An extension to a low VC class It is easy to tackle the case of a hypothesis class with VC dimension of d using the same technique as in the previous section. Hanneke gave a direct analysis, without using the Sauer lemma trick Active Learning: Disagreement Coefficient-6