Practical Agnostic Active Learning

Similar documents
Atutorialonactivelearning

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Active learning. Sanjoy Dasgupta. University of California, San Diego

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Efficient Bandit Algorithms for Online Multiclass Prediction

Active Learning Class 22, 03 May Claire Monteleoni MIT CSAIL

Active Learning: Disagreement Coefficient

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Littlestone s Dimension and Online Learnability

Active Learning from Single and Multiple Annotators

Efficient Semi-supervised and Active Learning of Disjunctions

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

On the tradeoff between computational complexity and sample complexity in learning

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Active Learning and Optimized Information Gathering

Optimal and Adaptive Online Learning

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Empirical Risk Minimization Algorithms

New Algorithms for Contextual Bandits

Accelerating Stochastic Optimization

Optimal and Adaptive Algorithms for Online Boosting

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Lecture 14: Binary Classification: Disagreement-based Methods

Asymptotic Active Learning

Importance Weighted Active Learning

Better Algorithms for Selective Sampling

A Bound on the Label Complexity of Agnostic Active Learning

Plug-in Approach to Active Learning

Machine Learning in the Data Revolution Era

Learning with Exploration

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

Active Learning with Oracle Epiphany

Efficient Online Bandit Multiclass Learning with Õ( T ) Regret

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Foundations of Machine Learning

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

arxiv: v3 [cs.lg] 3 Apr 2016

Statistical Active Learning Algorithms

Learning and 1-bit Compressed Sensing under Asymmetric Noise

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Lecture 3: Multiclass Classification

Computational and Statistical Learning Theory

Online multiclass learning with bandit feedback under a Passive-Aggressive approach

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Sample complexity bounds for differentially private learning

Reducing contextual bandits to supervised learning

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling

Empirical Risk Minimization

Generalization, Overfitting, and Model Selection

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

4.1 Online Convex Optimization

Agnostic Active Learning

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Warm up: risk prediction with logistic regression

Lecture 29: Computational Learning Theory

Does Unlabeled Data Help?

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Distributed Active Learning with Application to Battery Health Management

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Online Learning Summer School Copenhagen 2015 Lecture 1

Efficient Bandit Algorithms for Online Multiclass Prediction

Hybrid Machine Learning Algorithms

The Online Approach to Machine Learning

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Machine Learning for NLP

Machine Learning for NLP

Generalization and Overfitting

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Recap from previous lecture

Distributed Machine Learning. Maria-Florina Balcan, Georgia Tech

Slides at:

Agnostic Active Learning Without Constraints

Efficient and Principled Online Classification Algorithms for Lifelon

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

Learning for Contextual Bandits

Learning with Rejection

The Perceptron algorithm

arxiv: v2 [cs.lg] 24 Oct 2016

Distributed Machine Learning. Maria-Florina Balcan Carnegie Mellon University

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Support vector machines Lecture 4

Linear discriminant functions

Learning From Data Lecture 2 The Perceptron

Part of the slides are adapted from Ziko Kolter

Multiclass Learnability and the ERM principle

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Learning Multiple Tasks in Parallel with a Shared Annotator

Lecture 16: Perceptron and Exponential Weights Algorithm

Online Learning and Online Convex Optimization

Learnability, Stability, Regularization and Strong Convexity

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Classification and Pattern Recognition

Online Passive- Aggressive Algorithms

Optimal Teaching for Online Perceptrons

Online and Stochastic Learning with a Human Cognitive Bias

Transcription:

Practical Agnostic Active Learning Alina Beygelzimer Yahoo Research based on joint work with Sanjoy Dasgupta, Daniel Hsu, John Langford, Francesco Orabona, Chicheng Zhang, and Tong Zhang * * introductory slide credit

Active learning Labels are often much more expensive than inputs: documents, images, audio, video, drug compounds,... Can interaction help us learn more effectively? Learn an accurate classifier requesting as few labels as possible.

Active learning Labels are often much more expensive than inputs: documents, images, audio, video, drug compounds,... Can interaction help us learn more effectively? Learn an accurate classifier requesting as few labels as possible.

Active learning Labels are often much more expensive than inputs: documents, images, audio, video, drug compounds,... Can interaction help us learn more effectively? Learn an accurate classifier requesting as few labels as possible.

Active learning Labels are often much more expensive than inputs: documents, images, audio, video, drug compounds,... Can interaction help us learn more effectively? Learn an accurate classifier requesting as few labels as possible.

Active learning Labels are often much more expensive than inputs: documents, images, audio, video, drug compounds,... Can interaction help us learn more effectively? Learn an accurate classifier requesting as few labels as possible.

Why should we hope for success? Threshold functions on the real line. Target is a threshold. + w Supervised: Need 1/ɛ labeled points. With high probability, any consistent threshold has ɛ error.

Why should we hope for success? Threshold functions on the real line. Target is a threshold. + w Supervised: Need 1/ɛ labeled points. With high probability, any consistent threshold has ɛ error. Active learning: start with 1/ɛ unlabeled points.

Why should we hope for success? Threshold functions on the real line. Target is a threshold. + w Supervised: Need 1/ɛ labeled points. With high probability, any consistent threshold has ɛ error. Active learning: start with 1/ɛ unlabeled points. Binary search: need just log 1/ɛ labels. Exponential improvement in label complexity!

Why should we hope for success? Threshold functions on the real line. Target is a threshold. + w Supervised: Need 1/ɛ labeled points. With high probability, any consistent threshold has ɛ error. Active learning: start with 1/ɛ unlabeled points. Binary search: need just log 1/ɛ labels. Exponential improvement in label complexity! Nonseparable data? Other hypothesis classes?

Typical heuristics for active learning Start with a pool of unlabeled data Pick a few points at random and get their labels Repeat Fit a classifier to the labels seen so far Query the unlabeled point that is closest to the boundary (or most uncertain,... )

Typical heuristics for active learning Start with a pool of unlabeled data Pick a few points at random and get their labels Repeat Fit a classifier to the labels seen so far Query the unlabeled point that is closest to the boundary (or most uncertain,... ) Biased sampling: the labeled points are not representative of the underlying distribution!

Sampling bias Start with a pool of unlabeled data Pick a few points at random and get their labels Repeat Example: Fit a classifier to the labels seen so far Query the unlabeled point that is closest to the boundary (or most uncertain, or most likely to decrease overall uncertainty,...) 45% 5% 5% 45%

Sampling bias Start with a pool of unlabeled data Pick a few points at random and get their labels Repeat Example: Fit a classifier to the labels seen so far Query the unlabeled point that is closest to the boundary (or most uncertain, or most likely to decrease overall uncertainty,...) 45% 5% 5% 45% Even with infinitely many labels, converges to a classifier with 5% error instead of the best achievable, 2.5%. Not consistent! Missed cluster effect (Schütze et al, 2006)

Setting: Hypothesis class H. For h H, err(h) = Pr [h(x) y] (x,y) Minimum error rate ν = min h H err(h) Given ɛ, find h H with err(h) ν + ɛ Desiderata: General H Consistent: always converge Agnostic: deal with arbitrary noise, ν 0 Efficient: statistically and computationally

Setting: Hypothesis class H. For h H, err(h) = Pr [h(x) y] (x,y) Minimum error rate ν = min h H err(h) Given ɛ, find h H with err(h) ν + ɛ Desiderata: General H Consistent: always converge Agnostic: deal with arbitrary noise, ν 0 Efficient: statistically and computationally Is this achievable?

Setting: Hypothesis class H. For h H, err(h) = Pr [h(x) y] (x,y) Minimum error rate ν = min h H err(h) Given ɛ, find h H with err(h) ν + ɛ Desiderata: General H Consistent: always converge Agnostic: deal with arbitrary noise, ν 0 Efficient: statistically and computationally Is this achievable? Yes (BBL-2006, DHM-2007, BDL-2009, BHLZ-2010, BHLZ-2016)

Importance Weighted Active Learning S 0 = For t = 1, 2,..., n 1. Receive unlabeled example x t and set S t = S t 1. 2. Choose a probability of labeling p t. 3. Flip a coin Q t with E[Q t ] = p t. If Q t = 1, request y t and add (x t, y t, 1 p t ) to S t. 4. Let h t+1 = LEARN(S t ). Empirical importance-weighted error err n (h) = 1 n n t=1 Q t p t 1[h(x t ) y t ] Minimizer LEARN(S t ) = arg min h H err t (h) Consistency: The algorithm is consistent as long as p t are bounded away from 0.

How should p t be chosen? Let t = increase in empirical importance-weighted error rate if learner is forced to change its prediction on x t. ( ) ( ) log t Set p t = 1 if t O t ; otherwise, p t = O log t 2 t t. h t = arg min{err t 1 (h) : h H} h t = arg min{err t 1 (h) : h H and h(x t ) h t (x t )} t = err t 1 (h t) err t 1 (h t )

How should p t be chosen? Let t = increase in empirical importance-weighted error rate if learner is forced to change its prediction on x t. ( ) ( ) log t Set p t = 1 if t O t ; otherwise, p t = O log t 2 t t. h t = arg min{err t 1 (h) : h H} h t = arg min{err t 1 (h) : h H and h(x t ) h t (x t )} t = err t 1 (h t) err t 1 (h t ) How can we compute t in constant time?

How should p t be chosen? Let t = increase in empirical importance-weighted error rate if learner is forced to change its prediction on x t. ( ) ( ) log t Set p t = 1 if t O t ; otherwise, p t = O log t 2 t t. h t = arg min{err t 1 (h) : h H} h t = arg min{err t 1 (h) : h H and h(x t ) h t (x t )} t = err t 1 (h t) err t 1 (h t ) How can we compute t in constant time? Find the smallest i t such that h t+1 = h t after (x t, h t(x t ), i t ) update. We have (t 1) err t 1 (h t) (t 1) err t 1 (h t ) + i t. Thus t i t /(t 1).

Practical Considerations Importance weight aware SGD updates [Karampatziakis and Langford]. Solve for i t directly. E.g., for logistic, i t = 2w T t x t η t sign(w T t x t)x T t x t error 0.1 0.09 0.08 0.07 0.06 0.05 0.04 astrophysics invariant implicit gradient multiplication passive 0.03 0 0.2 0.4 0.6 0.8 1 fraction of labels queried error 0.1 0.095 0.09 0.085 0.08 0.075 0.07 0.065 0.06 0.055 rcv1 invariant implicit gradient multiplication passive 0.05 0 0.2 0.4 0.6 0.8 1 fraction of labels queried

Guarantees IWAL achieves error similar to that of supervised learning on n points: Accuracy Theorem: For all n 1, with high probability. err(h n ) err(h ) + C log n n 1

Guarantees IWAL achieves error similar to that of supervised learning on n points: Accuracy Theorem: For all n 1, with high probability. err(h n ) err(h ) + C log n n 1 Label Efficiency Theorem: With high probability, the expected number of labels queried after n iteractions is at most ( O(θ err(h )n) +O θ ) n log n }{{} minimum due to noise where θ is the disagreement coefficient.

The crucial ratio: Disagreement Coefficient (Hanneke-2007) r-ball around a minimum-error hypothesis h : B(h, r) = {h H : Pr[h(x) h (x)] r} Disagreement region of B(h, r): DIS(B(h, r)) = {x X h, h B(h, r) : h(x) h (x)} The disagreement coefficient measures the rate of Pr[DIS(B(h, r))] θ = sup r>0 r Example:

The crucial ratio: Disagreement Coefficient (Hanneke-2007) r-ball around a minimum-error hypothesis h : B(h, r) = {h H : Pr[h(x) h (x)] r} Disagreement region of B(h, r): DIS(B(h, r)) = {x X h, h B(h, r) : h(x) h (x)} The disagreement coefficient measures the rate of Pr[DIS(B(h, r))] θ = sup r>0 r Example: Thresholds in R, any data distribution. θ = 2.

Benefits 1. Always consistent.

Benefits 1. Always consistent. 2. Efficient. 2.1 Label efficient, unlabeled data efficient, computationally efficient.

Benefits 1. Always consistent. 2. Efficient. 2.1 Label efficient, unlabeled data efficient, computationally efficient. 3. Compatible. 3.1 With online algorithms 3.2 With any optimization-style classification algorithms 3.3 With any Loss function 3.4 With supervised learning 3.5 With switching learning algorithms (!)

Benefits 1. Always consistent. 2. Efficient. 2.1 Label efficient, unlabeled data efficient, computationally efficient. 3. Compatible. 3.1 With online algorithms 3.2 With any optimization-style classification algorithms 3.3 With any Loss function 3.4 With supervised learning 3.5 With switching learning algorithms (!) 4. Collected labeled set is reusable with a different algorithm or hypothesis class.

Benefits 1. Always consistent. 2. Efficient. 2.1 Label efficient, unlabeled data efficient, computationally efficient. 3. Compatible. 3.1 With online algorithms 3.2 With any optimization-style classification algorithms 3.3 With any Loss function 3.4 With supervised learning 3.5 With switching learning algorithms (!) 4. Collected labeled set is reusable with a different algorithm or hypothesis class. 5. It works, empirically.

Applications News article categorizer Image classification Hate-speech detection / comments sentiment NLP sentiment classification (satire, newsiness, gravitas)

Active learning in Vowpal Wabbit Simulating active learning: (knob c > 0) vw --active simulation --active mellowness c Deploying active learning: vw --active learning --active mellowness c --daemon vw interacts with an active interactor (ai) receives labeled and unlabeled training examples from ai over network for each unlabeled data point, vw sends back a query decision (and an importance weight if label is requested) ai sends labeled importance-weighted examples as requested vw trains using labeled importance-weighted examples

Active learning in Vowpal Wabbit active_interactor vw x_1 (query, 1/p_1) (x_1,y_1,1/p_1) x_2 (gradient update) (no query)... active interactor.cc (in git repository) demonstrates how to implement this protocol.

Needle in a haystack problem: Rare classes Learning interval functions h a,b (x) = 1[a x b], for 0 a b 1. Supervised learning: need O(1/ɛ) labeled data. Active learning: need O(1/W + log 1/ɛ) labels, where W is the width of the target interval. No improvement over passive learning.

Needle in a haystack problem: Rare classes Learning interval functions h a,b (x) = 1[a x b], for 0 a b 1. Supervised learning: need O(1/ɛ) labeled data. Active learning: need O(1/W + log 1/ɛ) labels, where W is the width of the target interval. No improvement over passive learning.... but given any example of the rare class, the label complexity drops to O(log 1/ɛ). Dasgupta 2005; Attenberg & Provost 2010: Search and insertion of labeled rare class examples helps.

predicted label: Entertainment

predicted label: Entertainment

predicted label: Entertainment How can editors feed observed mistakes into an active learning algorithm?

Is this fixable? Beygelzimer-Hsu-Langford-Zhang (NIPS-16): Define a Search oracle: The active learner interactively restricts the searchable space guiding Search where it s most effective.

Search Oracle Oracle Search Require: Working set of candidate models V Ensure: Labeled example (x, y) s.t. h(x) y for all h V (systematic mistake), or if there is no such example.

Search Oracle Oracle Search Require: Working set of candidate models V Ensure: Labeled example (x, y) s.t. h(x) y for all h V (systematic mistake), or if there is no such example. How can a counterexample to a version space be used?

Search Oracle Oracle Search Require: Working set of candidate models V Ensure: Labeled example (x, y) s.t. h(x) y for all h V (systematic mistake), or if there is no such example. How can a counterexample to a version space be used? Nested sequence of model classes of increasing complexity: H 1 H 2... H k... Advance to more complex classes as simple are proved inadequate.

Search + Label Search + Label can provide exponentially large problem-dependent improvements over Label alone, with a general agnostic algorithm. Union of intervals example: Õ(k + log(1/ɛ)) Search queries Õ((poly log(1/ɛ) + log k )(1 + ν2 ɛ 2 )) Label queries

Search + Label Search + Label can provide exponentially large problem-dependent improvements over Label alone, with a general agnostic algorithm. Union of intervals example: Õ(k + log(1/ɛ)) Search queries Õ((poly log(1/ɛ) + log k )(1 + ν2 ɛ 2 )) Label queries How can we make it practical? Drawbacks as with any version space approach Can we reformulate as a reduction to supervised learning?

Interactive Learning Interactive settings: Active learning Contextual bandit learning Reinforcement learning Bias is a pervasive issue: The learner creates the data it learns from / is evaluated on State of the world depends on the learner s decisions

Interactive Learning Interactive settings: Active learning Contextual bandit learning Reinforcement learning Bias is a pervasive issue: The learner creates the data it learns from / is evaluated on State of the world depends on the learner s decisions How can we use supervised learning techonology in these new interactive settings?

Optimal Multiclass Bandit Learning (ICML-2017) For t = 1... T : 1. Observe x t 2. Predict label ŷ t [1,..., K] 3. Pay and observe 1[ŷ t y t ] (ad not clicked) No stochastic assumptions on the input sequence. Compete with multiclass linear predictors {W R K d }, where W (x) = arg max k [K] (W x) k

Mistake Bounds Banditron [Kakade-Shalev-Shwartz-Tewari, ICML-08] SOBA [Beygelzimer-Orabona-Zhang, ICML-17] Perceptron Banditron SOBA L + T L + T 2/3 L + T where L is the competitor s total hinge loss. Per-round hinge loss of W l t (W ) = max r y t [1 (W x t ) yt + (W x t ) r ] + 1[y t ŷ t ] Resolves a COLT open problem (Abernethy and Rakhlin 09)

The Multiclass Perceptron A linear multiclass predictor is defined by a matrix W R k d. For t = 1... T : Receive x t R d Predict ŷ t = arg max r (W t x t ) r Receive y t Update W t+1 = W t + U t where U t = 1[ŷ t y t ](e yt eŷt ) x t

Bandit Setting If ŷ t y t, we are blind to the value of y t Solution: Randomization! SOBA: A second order perceptron with a novel unbiased estimator for the perceptron update and the second order update. Passive-aggressive update (sometimes updating when there is no mistake but the margin is small)