Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Similar documents
Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Distributed Machine Learning. Maria-Florina Balcan Carnegie Mellon University

Distributed Machine Learning. Maria-Florina Balcan, Georgia Tech

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Practical Agnostic Active Learning

Generalization, Overfitting, and Model Selection

Atutorialonactivelearning

Support Vector Machines (SVMs).

Active Learning Class 22, 03 May Claire Monteleoni MIT CSAIL

Generalization and Overfitting

Learning and 1-bit Compressed Sensing under Asymmetric Noise

Active learning. Sanjoy Dasgupta. University of California, San Diego

Generalization, Overfitting, and Model Selection

Statistical Active Learning Algorithms

Active Learning from Single and Multiple Annotators

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Asymptotic Active Learning

The Perceptron Algorithm, Margins

Active Learning: Disagreement Coefficient

Efficient Semi-supervised and Active Learning of Disjunctions

A Bound on the Label Complexity of Agnostic Active Learning

Tsybakov noise adap/ve margin- based ac/ve learning

Noise-Tolerant Interactive Learning Using Pairwise Comparisons

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Efficient PAC Learning from the Crowd

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Agnostic Active Learning

Margin Based Active Learning

The sample complexity of agnostic learning with deterministic labels

Sample complexity bounds for differentially private learning

Learning with Rejection

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Better Algorithms for Selective Sampling

1 Learning Linear Separators

Active Learning and Optimized Information Gathering

Does Unlabeled Data Help?

Minimax Analysis of Active Learning

Clustering Perturbation Resilient

arxiv: v4 [cs.lg] 5 Nov 2014

PAC-learning, VC Dimension and Margin-based Bounds

On the power and the limits of evolvability. Vitaly Feldman Almaden Research Center

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Adaptive Sampling Under Low Noise Conditions 1

Plug-in Approach to Active Learning

The True Sample Complexity of Active Learning

Machine Learning

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Lecture 7: Passive Learning

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning

Computational and Statistical Learning Theory

Lecture 29: Computational Learning Theory

Efficient and Principled Online Classification Algorithms for Lifelon

Littlestone s Dimension and Online Learnability

1 Learning Linear Separators

Online Manifold Regularization: A New Learning Setting and Empirical Study

The Power of Localization for Efficiently Learning Linear Separators with Noise

Perceptron (Theory) + Linear Regression

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Learning symmetric non-monotone submodular functions

Machine Learning: Homework 5

Boosting: Foundations and Algorithms. Rob Schapire

From Batch to Transductive Online Learning

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Polynomial regression under arbitrary product distributions

Boosting the Area Under the ROC Curve

Understanding Machine Learning A theory Perspective

Activized Learning with Uniform Classification Noise

Linear Classification and Selective Sampling Under Low Noise Conditions

Empirical Risk Minimization Algorithms

Lecture 14: Binary Classification: Disagreement-based Methods

Machine Learning

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

Importance Weighted Active Learning

Computational Learning Theory. CS534 - Machine Learning

Stochastic optimization in Hilbert spaces

Foundations of Machine Learning

A Tight Excess Risk Bound via a Unified PAC-Bayesian- Rademacher-Shtarkov-MDL Complexity

Logistic Regression Logistic

Machine Learning (CS 567) Lecture 2

Efficient Bandit Algorithms for Online Multiclass Prediction

Machine Learning And Applications: Supervised Learning-SVM

New Algorithms for Contextual Bandits

The Perceptron algorithm

Active Learning with Oracle Epiphany

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Robust Selective Sampling from Single and Multiple Teachers

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

1 Differential Privacy and Statistical Query Learning

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

An Algorithms-based Intro to Machine Learning

Support vector machines Lecture 4

Introduction to Machine Learning

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Transcription:

Sample and Computationally Efficient Active Learning Maria-Florina Balcan Carnegie Mellon University

Machine Learning is Shaping the World Highly successful discipline with lots of applications. Computational Biology Computer Vision Speech, Language Processing Mostly based on passive supervised learning.

Interactive Machine Learning Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts. Protein sequences Billions of webpages Images Active Learning: utilize data, minimize expert intervention. Expert

Interactive Machine Learning Lots of exciting activity in recent years on understanding the power of active learning. Mostly label efficiency. This talk: margin based active learning. Power of aggressive localization for label efficient and poly time active learning. Techniques of interest beyond active learning. + + - + - -- - - + Adaptive sequence of convex optimization pbs better noise tolerance than previously known for classic passive learning via poly time algos.

Structure of the Talk Standard supervised learning. Active learning. Margin based active learning.

Supervised Machine Learning

Supervised Learning E.g., which emails are spam and which are important. Not spam spam E.g., classify objects as chairs vs non chairs. Not chair chair

Statistical / PAC learning model Data Source Distribution D on X Learning Algorithm Labeled Examples Expert / Oracle (x 1,c*(x 1 )),, (x m,c*(x m )) c* : X! {0,1} h : X! {0,1} Algo sees (x 1,c*(x 1 )),, (x m,c*(x m )), x i i.i.d. from D + - - - Does optimization over S, finds hypothesis h 2 C. Goal: h has small error, err(h)=pr x 2 D (h(x) c*(x)) c* in C, realizable case; else agnostic + + +

Two Main Aspects in Classic Machine Learning Algorithm Design. How to optimize? Automatically generate rules that do well on observed data. 9 Runing time: poly d, 1 ϵ, 1 δ Generalization Guarantees, Sample Complexity Confidence for rule effectiveness on future data. O 1 ϵ VCdim C log 1 + log 1 ϵ δ C= linear separators in R d : O 1 d log 1 + log 1 ϵ ϵ δ + + - + -- - + -

Modern ML: New Learning Approaches Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts. Protein sequences Billions of webpages Images

Active Learning raw data face not face Expert Labeler O O O Classifier

Active Learning in Practice Text classification: active SVM (Tong & Koller, ICML2000). e.g., request label of the example closest to current separator. Video Segmentation (Fathi-Balcan-Ren-Regh, BMVC 11).

Can adaptive querying help? [CAL 92, Dasgupta 04] Threshold fns on the real line: - h w (x) = 1(x w), C = {h w : w 2 R} + Active Algorithm Get N = O(1/ϵ) unlabeled examples How can we recover the correct labels with N queries? Do binary search! - - w Just need O(log N) labels! + Output a classifier consistent with the N inferred labels. Passive supervised: Ω(1/ϵ) labels to find an -accurate threshold. Active: only O(log 1/ϵ) labels. Exponential improvement.

Active Learning, Provable Guarantees Lots of exciting results on sample complexity. E.g., DasguptaKalaiMonteleoni 05, CastroNowak 07, CavallantiCesa-BianchiGentile 10, YanChaudhuriJavidi 16 DasguptaHsu 08, UrnerWulffBenDavid 13 Disagreement based algorithms Pick a few points at random from the current region of disagreement (uncertainty), query their labels, throw out hypothesis if you are statistically confident they are suboptimal. surviving classifiers region of disagreement [BalcanBeygelzimerLangford 06, Hanneke07, DasguptaHsuMontleoni 07, Wang 09, Fridman 09, Koltchinskii10, BHW 08, BeygelzimerHsuLangfordZhang 10, Hsu 10, Ailon 12, ]

Disagreement Based Active Learning A 2 Agnostic Active Learner [Balcan-Beygelzimer-Langford, ICML 2006] current version space Let H 1 = H. For t = 1,., region of disagreement Pick a few points at random from current region of disagreement DIS(H t ) and query their labels. Throw out hypothesis if statistically confident they are suboptimal.

Disagreement Based Active Learning A 2 Agnostic Active Learner [Balcan-Beygelzimer-Langford, ICML 2006] Pick a few points at random from the current region of disagreement (uncertainty), query their labels, throw out hypothesis if you are statistically confident they are suboptimal. Guarantees for A 2 [Hanneke 07]: Disagreement coefficient: P(DIS(B(c, r))) θ c = sup r>η+ϵ r Realizable: m = VCim C θ c log(1/ϵ) Agnostic: m = η2 ε 2 VCim C θ c 2 log(1/ϵ) Linear separators, uniform distr.: θ c = d c *

Disagreement Based Active Learning Pick a few points at random from the current region of disagreement (uncertainty), query their labels, throw out hypothesis if you are statistically confident they are suboptimal. [BalcanBeygelzimerLangford 06, Hanneke07, DasguptaHsuMontleoni 07, Wang 09, Fridman 09, Koltchinskii10, BHW 08, BeygelzimerHsuLangfordZhang 10, Hsu 10, Ailon 12, ] Generic (any class), adversarial label noise. suboptimal in label complexity computationally prohibitive.

Poly Time, Noise Tolerant/Agnostic, Label Optimal AL Algos.

Margin Based Active Learning Margin based algo for learning linear separators. [Balcan-Long, COLT 2013] [Awasthi-Balcan-Long, STOC 2014] [Awasthi-Balcan-Haghtalab-Urner, COLT15] [Awasthi-Balcan-Haghtalab-Zhang, COLT16] [Awasthi-Balcan-Long, JACM 2017] Realizable: exponential improvement, only O(d log 1/ϵ) labels to find w error, when D logconcave Agnostic: poly-time AL algo outputs w with err(w) =O(η + ε), η=err(best lin. sep), O(d log 1/ϵ) labels when D logconcave. Improves on noise tolerance of previous best passive [KKMS 05], [KLS 09] algos too!

Margin Based Active-Learning, Realizable Case Draw m 1 unlabeled examples, label them, add them to W(1). iterate k = 2,, s find a hypothesis w k-1 consistent with W(k-1). W(k)=W(k-1). sample m k unlabeled samples x satisfying w k-1 x k-1 label them and add them to W(k). 1 w 2 w 3 w 1 2

Margin Based Active-Learning, Realizable Case Log-concave distributions: log of density fnc concave. wide class: uniform distr. over any convex set, Gaussian, etc. f λx 1 + 1 λx 2 f x 1 λ f x 2 1 λ Theorem D log-concave in R d. If γ k = O then err w s ϵ 1 2 k after s = log 1 ϵ rounds using O(d) labels per round. Active learning label requests Passive learning label requests unlabeled examples

Analysis: Aggressive Localization Induction: all w consistent with W(k), err(w) 1/2 k w w *

Analysis: Aggressive Localization Induction: all w consistent with W(k), err(w) 1/2 k Suboptimal w w w k-1 w * w k-1 w * k-1 k-1

Analysis: Aggressive Localization Induction: all w consistent with W(k), err(w) 1/2 k w w k-1 1/2 k+1 w * err w = Pr w errs on x, w k 1 x γ k 1 + Pr w errs on x, w k 1 x γ k 1 k-1

Analysis: Aggressive Localization Induction: all w consistent with W(k), err(w) 1/2 k w w k-1 1/2 k+1 w * err w = Pr w errs on x, w k 1 x γ k 1 + Pr w errs on x w k 1 x γ k 1 Pr w k 1 x γ k 1 k-1 Enough to ensure Pr w errs on x w k 1 x γ k 1 C Need only m k = O(d) labels in round k. Key point: localize aggressively, while maintaining correctness.

The Agnostic Case No linear separator can separate Best linear separator error η and Algorithm still margin based Draw m 1 unlabeled examples, label them, add them to W. iterate k=2,, s find w k-1 in B(w k-1, r k-1 ) of small t k-1 hinge loss wrt W. Clear working set. sample m k unlabeled samples x satisfying w k-1 x k-1 ; label them and add them to W. end iterate

The Agnostic Case No linear separator can separate Best linear separator error η and

The Agnostic Case No linear separator can separate and Best linear separator error η w k w k-1 Key idea 1: Replace error minimization with k-1 hinge loss minimization penalty penalty τ k γ k 1 sgn(w x y ) τ k sgn(w x y )

The Agnostic Case No linear separator can separate and Best linear separator error η w k w k-1 Key idea 2: Stay close to the current guess: w k in small ball around w k 1 k-1

Margin Based Active-Learning, Agnostic Case Draw m 1 unlabeled examples, label them, add them to W. iterate k=2,, s find w k-1 in B(w k-1, r k-1 ) of small Localization in t k-1 hinge loss wrt W. concept space. Clear working set. sample m k unlabeled samples x satisfying w k-1 x k-1 ; label them and add them to W. end iterate Analysis, key idea: Pick τ k γ k Localization in instance space. Localization & variance analysis control the gap between hinge loss and 0/1 loss (only a constant).

Improves over Passive Learning too! Passive Learning Prior Work Our Work Malicious err(w) = O η 2/3 log 2/3 (d/η) [KLS 09] err(w) = O(η) Info theoretic optimal Agnostic err(w) = O η 1/3 log 1/3 (1/η) [KLS 09] err(w) = O(η) Bounded Noise P Y = 1 x P Y = 1 x Active Learning [agnostic/malicious/ bounded] NA NA η + ε same as above! Info theoretic optimal

Localization both algorithmic and analysis tool! Useful for active and passive learning!

Localization both algorithmic and analysis tool! Useful for active and passive learning! Well known for analyzing sample complexity. [Bousquet-Boucheron-Lugosi 04], [BBL 06], [Hanneke 07], We show useful for noise tolerant poly time algorithms. Previously observed in practice, e.g. for SVMs [Bottou 07]

Further Margin Based Refinements Efficient algorithms [Hanneke-Kanade-Yang, ALT 15] O(d log 1/ϵ) label complexity for our poly time agnostic algorithm. [Daniely, COLT 15] achieve C η for any constant C in poly time in the agnostic setting [tradeoff running time with accuracy]. [Awasthi-Balcan-Haghtalab-Urner, COLT15], [Awasthi-Balcan-Haghtalab-Zhang, COLT16] Bounded noise, get error η + ε in time poly(d, 1/ϵ) [YanZhang, NIPS 17] modified Perceptron enjoys similar guarantees. Active & Differentially Private [Balcan-Feldman, NIPS 13] General concept spaces: [Zhang-Chaudhuri, NIPS 14]. Compute the region to localize in each round by using unlabeled data and writing an LP

Discussion, Open Directions AL: important paradigm in the age of Big Data. Lots of exciting developments. Disagreement based AL, general sample complexity. Margin based AL: label efficient, noise tolerant, + poly time algo for learning linear separators. + - + - -- - - + Better noise tolerance than known for passive learning via poly time algos. [KKMS 05] [KLS 09] Solve an adaptive sequence of convex optimization pbs on smaller & smaller bands around current guess for target.

Discussion, Open Directions AL: important paradigm in the age of Big Data. Lots of exciting developments. Disagreement based AL, general sample complexity. Margin based AL: label efficient, noise tolerant, poly time algo for learning linear separators. Open Directions More general distributions, other concept spaces. Exploit localization insights in other settings (e.g., online convex optimization with adversarial noise).

Important direction: richer interactions with the expert. Better Accuracy Fewer queries Expert Natural interaction