Atutorialonactivelearning

Similar documents
Active learning. Sanjoy Dasgupta. University of California, San Diego

Active Learning Class 22, 03 May Claire Monteleoni MIT CSAIL

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Practical Agnostic Active Learning

Active Learning: Disagreement Coefficient

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Diameter-Based Active Learning

Lecture 14: Binary Classification: Disagreement-based Methods

Asymptotic Active Learning

A Bound on the Label Complexity of Agnostic Active Learning

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

The True Sample Complexity of Active Learning

Importance Weighted Active Learning

Efficient Semi-supervised and Active Learning of Disjunctions

Minimax Analysis of Active Learning

Adaptive Sampling Under Low Noise Conditions 1

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Agnostic Active Learning

Analysis of perceptron-based active learning

Sample complexity bounds for differentially private learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Plug-in Approach to Active Learning

The True Sample Complexity of Active Learning

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

New Algorithms for Contextual Bandits

Active Learning and Optimized Information Gathering

Statistical Active Learning Algorithms

Margin Based Active Learning

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Does Unlabeled Data Help?

Generalization, Overfitting, and Model Selection

Generalization and Overfitting

Distributed Machine Learning. Maria-Florina Balcan Carnegie Mellon University

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

NEAR-OPTIMALITY AND ROBUSTNESS OF GREEDY ALGORITHMS FOR BAYESIAN POOL-BASED ACTIVE LEARNING

Active Boosted Learning (ActBoost)

arxiv: v2 [cs.lg] 24 Oct 2016

Active Learning from Single and Multiple Annotators

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine Learning: Homework 5

Random projection trees and low dimensional manifolds. Sanjoy Dasgupta and Yoav Freund University of California, San Diego

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

COMS 4771 Introduction to Machine Learning. Nakul Verma

Reducing contextual bandits to supervised learning

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Concept Learning through General-to-Specific Ordering

The Geometry of Generalized Binary Search

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling

Linear Classification and Selective Sampling Under Low Noise Conditions

University of Alberta. Library Release Form. Title of Thesis: Active Learning and its Application to Heteroscedastic Problems

i=1 = H t 1 (x) + α t h t (x)

Computational Learning Theory

Selective Sampling Using the Query by Committee Algorithm

Distributed Machine Learning. Maria-Florina Balcan, Georgia Tech

Active Learning with Oracle Epiphany

Introduction to Machine Learning

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Introduction to Machine Learning (67577) Lecture 3

Learning Theory Continued

Distributed Active Learning with Application to Battery Health Management

Computational Learning Theory. Definitions

The sample complexity of agnostic learning with deterministic labels

Computational Learning Theory. CS534 - Machine Learning

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Learning with multiple models. Boosting.

Web-Mining Agents Computational Learning Theory

Online Learning, Mistake Bounds, Perceptron Algorithm

Machine Learning

Computational and Statistical Learning Theory

Decision Tree Learning

Cost-minimising strategies for data labelling : optimal stopping and active learning

Smart PAC-Learners. Malte Darnstädt and Hans Ulrich Simon. Fakultät für Mathematik, Ruhr-Universität Bochum, D Bochum, Germany

The Learning Problem and Regularization

1 Using standard errors when comparing estimated values

MODULE -4 BAYEIAN LEARNING

Machine Learning

arxiv: v1 [cs.lg] 15 Aug 2017

Active learning in sequence labeling

18.9 SUPPORT VECTOR MACHINES

Midterm, Fall 2003

Topics in Natural Language Processing

Agnostic Active Learning Without Constraints

Concept Learning. Space of Versions of Concepts Learned

Computational Learning Theory

On the Relationship between Data Efficiency and Error for Uncertainty Sampling

Supervised Learning and Co-training

Machine Learning

Machine Learning

Bias-Variance Tradeoff

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

PAC Generalization Bounds for Co-training

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Robust Selective Sampling from Single and Multiple Teachers

Analysis of a greedy active learning strategy

Transcription:

Atutorialonactivelearning Sanjoy Dasgupta 1 John Langford 2 UC San Diego 1 Yahoo Labs 2

Exploiting unlabeled data Alotofunlabeleddataisplentifulandcheap,eg. documents off the web speech samples images and video But labeling can be expensive.

Exploiting unlabeled data Alotofunlabeleddataisplentifulandcheap,eg. documents off the web speech samples images and video But labeling can be expensive. Unlabeled points

Exploiting unlabeled data Alotofunlabeleddataisplentifulandcheap,eg. documents off the web speech samples images and video But labeling can be expensive. Unlabeled points Supervised learning

Exploiting unlabeled data Alotofunlabeleddataisplentifulandcheap,eg. documents off the web speech samples images and video But labeling can be expensive. Unlabeled points Supervised learning Semisupervised and active learning

Typical heuristics for active learning Start with a pool of unlabeled data Pick a few points at random and get their labels Repeat Fit a classifier to the labels seen so far Query the unlabeled point that is closest to the boundary (or most uncertain, or most likely to decrease overall uncertainty,...)

Typical heuristics for active learning Start with a pool of unlabeled data Pick a few points at random and get their labels Repeat Fit a classifier to the labels seen so far Query the unlabeled point that is closest to the boundary (or most uncertain, or most likely to decrease overall uncertainty,...) Biased sampling: the labeled points are not representative of the underlying distribution!

Sampling bias Start with a pool of unlabeled data Pick a few points at random and get their labels Repeat Example: Fit a classifier to the labels seen so far Query the unlabeled point that is closest to the boundary (or most uncertain, or most likely to decrease overall uncertainty,...) 45% 5% 5% 45%

Sampling bias Start with a pool of unlabeled data Pick a few points at random and get their labels Repeat Example: Fit a classifier to the labels seen so far Query the unlabeled point that is closest to the boundary (or most uncertain, or most likely to decrease overall uncertainty,...) 45% 5% 5% 45% Even with infinitely many labels, converges to a classifier with 5% error instead of the best achievable, 2.5%. Not consistent!

Sampling bias Start with a pool of unlabeled data Pick a few points at random and get their labels Repeat Example: Fit a classifier to the labels seen so far Query the unlabeled point that is closest to the boundary (or most uncertain, or most likely to decrease overall uncertainty,...) 45% 5% 5% 45% Even with infinitely many labels, converges to a classifier with 5% error instead of the best achievable, 2.5%. Not consistent! Manifestation in practice, eg. Schutze et al 03.

Case II: Efficient search through hypothesis space Ideal case: each query cuts the version space in two. H + Then perhaps we need just log H labels to get a perfect hypothesis!

Case II: Efficient search through hypothesis space Ideal case: each query cuts the version space in two. H + Then perhaps we need just log H labels to get a perfect hypothesis! Challenges: (1) Do there always exist queries that will cut off a good portion of the version space? (2) If so, how can these queries be found? (3) What happens in the nonseparable case?

Exploiting cluster structure in data [DH 08] Basic primitive: Find a clustering of the data Sample a few randomly-chosen points in each cluster Assign each cluster its majority label Now use this fully labeled data set to build a classifier

Exploiting cluster structure in data [DH 08] Basic primitive: Find a clustering of the data Sample a few randomly-chosen points in each cluster Assign each cluster its majority label Now use this fully labeled data set to build a classifier

Efficient search through hypothesis space Threshold functions on the real line: H = {h w : w R} h w(x) =1(x w) + w Supervised: for misclassification error ϵ, need 1/ϵ labeled points.

Efficient search through hypothesis space Threshold functions on the real line: H = {h w : w R} h w(x) =1(x w) + w Supervised: for misclassification error ϵ, need 1/ϵ labeled points. Active learning: instead, start with 1/ϵ unlabeled points.

Efficient search through hypothesis space Threshold functions on the real line: H = {h w : w R} h w(x) =1(x w) + w Supervised: for misclassification error ϵ, need 1/ϵ labeled points. Active learning: instead, start with 1/ϵ unlabeled points. Binary search: need just log 1/ϵ labels, from which the rest can be inferred. Exponential improvement in label complexity!

Efficient search through hypothesis space Threshold functions on the real line: H = {h w : w R} h w(x) =1(x w) + w Supervised: for misclassification error ϵ, need 1/ϵ labeled points. Active learning: instead, start with 1/ϵ unlabeled points. Binary search: need just log 1/ϵ labels, from which the rest can be inferred. Exponential improvement in label complexity! Challenges: Nonseparable data? Other hypothesis classes?

Some results of active learning theory Separable data General (nonseparable) data Query by committee Aggressive (Freund, Seung, Shamir, Tishby, 97) Splitting index (D, 05) Generic active learner A 2 algorithm (Cohn, Atlas, Ladner, 91) (Balcan, Beygelzimer, L, 06) Disagreement coefficient Mellow (Hanneke, 07) Reduction to supervised (D, Hsu, Monteleoni, 2007) Importance-weighted approach (Beygelzimer, D, L, 2009)

Some results of active learning theory Separable data General (nonseparable) data Query by committee Aggressive (Freund, Seung, Shamir, Tishby, 97) Splitting index (D, 05) Generic active learner A 2 algorithm (Cohn, Atlas, Ladner, 91) (Balcan, Beygelzimer, L, 06) Disagreement coefficient Mellow (Hanneke, 07) Reduction to supervised (D, Hsu, Monteleoni, 2007) Importance-weighted approach (Beygelzimer, D, L, 2009) Issues: Computational tractability Are labels being used as efficiently as possible?

Agenericmellowlearner[CAL 91] For separable data that is streaming in. H 1 =hypothesisclass Repeat for t =1, 2,... Receive unlabeled point x t If there is any disagreement within H t about x t s label: query label y t and set H t+1 = {h H t : h(x t )=y t } else H t+1 = H t Is a label needed?

Agenericmellowlearner[CAL 91] For separable data that is streaming in. H 1 =hypothesisclass Repeat for t =1, 2,... Receive unlabeled point x t If there is any disagreement within H t about x t s label: query label y t and set H t+1 = {h H t : h(x t )=y t } else H t+1 = H t Is a label needed? H t =currentcandidate hypotheses

Agenericmellowlearner[CAL 91] For separable data that is streaming in. H 1 =hypothesisclass Repeat for t =1, 2,... Receive unlabeled point x t If there is any disagreement within H t about x t s label: query label y t and set H t+1 = {h H t : h(x t )=y t } else H t+1 = H t Is a label needed? H t =currentcandidate hypotheses Region of uncertainty

Agenericmellowlearner[CAL 91] For separable data that is streaming in. H 1 =hypothesisclass Repeat for t =1, 2,... Receive unlabeled point x t If there is any disagreement within H t about x t s label: query label y t and set H t+1 = {h H t : h(x t )=y t } else H t+1 = H t Is a label needed? H t =currentcandidate hypotheses Region of uncertainty Problems: (1) intractable to maintain H t ;(2)nonseparabledata.

Maintaining H t Explicitly maintaining H t is intractable. Do it implicitly, by reduction to supervised learning. Explicit version H 1 =hypothesisclass For t =1, 2,...: Receive unlabeled point x t If disagreement in H t about x t s label: query label y t of x t H t+1 = {h H t : h(x t )=y t} else: H t+1 = H t Implicit version S = {} (points seen so far) For t =1, 2,...: Receive unlabeled point x t If learn(s (x t, 1)) and learn(s (x t, 0)) both return an answer: query label y t else: set y t to whichever label succeeded S = S {(x t, y t)}

Maintaining H t Explicitly maintaining H t is intractable. Do it implicitly, by reduction to supervised learning. Explicit version Implicit version H 1 =hypothesisclass For t =1, 2,...: Receive unlabeled point x t If disagreement in H t about x t s label: query label y t of x t H t+1 = {h H t : h(x t )=y t} else: H t+1 = H t S = {} (points seen so far) For t =1, 2,...: Receive unlabeled point x t If learn(s (x t, 1)) and learn(s (x t, 0)) both return an answer: query label y t else: set y t to whichever label succeeded S = S {(x t, y t)} This scheme is no worse than straight supervised learning. But can one bound the number of labels needed?

Label complexity [Hanneke] The label complexity of CAL (mellow, separable) active learning can be captured by the the VC dimension d of the hypothesis and by a parameter θ called the disagreement coefficient.

Label complexity [Hanneke] The label complexity of CAL (mellow, separable) active learning can be captured by the the VC dimension d of the hypothesis and by a parameter θ called the disagreement coefficient. Regular supervised learning, separable case. Suppose data are sampled iid from an underlying distribution. To get a hypothesis whose misclassification rate (on the underlying distribution) is ϵ with probability 0.9, it suffices to have labeled examples. d ϵ

Label complexity [Hanneke] The label complexity of CAL (mellow, separable) active learning can be captured by the the VC dimension d of the hypothesis and by a parameter θ called the disagreement coefficient. Regular supervised learning, separable case. Suppose data are sampled iid from an underlying distribution. To get a hypothesis whose misclassification rate (on the underlying distribution) is ϵ with probability 0.9, it suffices to have labeled examples. CAL active learner, separable case. Label complexity is d ϵ θd log 1 ϵ

Label complexity [Hanneke] The label complexity of CAL (mellow, separable) active learning can be captured by the the VC dimension d of the hypothesis and by a parameter θ called the disagreement coefficient. Regular supervised learning, separable case. Suppose data are sampled iid from an underlying distribution. To get a hypothesis whose misclassification rate (on the underlying distribution) is ϵ with probability 0.9, it suffices to have labeled examples. CAL active learner, separable case. Label complexity is θd log 1 ϵ There is a version of CAL for nonseparable data. (More to come!) If best achievable error rate is ν, sufficestohave «θ d log 2 1 ϵ + dν2 ϵ 2 labels. Usual supervised requirement: d/ϵ 2. d ϵ

Disagreement coefficient [Hanneke] Let P be the underlying probability distribution on input space X. Induces (pseudo-)metric on hypotheses: d(h, h )=P[h(X ) h (X )]. Corresponding notion of ball B(h, r) ={h H : d(h, h ) < r}. Disagreement region of any set of candidate hypotheses V H: DIS(V )={x : h, h V such that h(x) h (x)}. Disagreement coefficient for target hypothesis h H: θ =sup r P[DIS(B(h, r))]. r h h* d(h, h) =P[shaded region] Some elements of B(h, r) DIS(B(h, r))

Disagreement coefficient: separable case Let P be the underlying probability distribution on input space X. Let H ϵ be all hypotheses in H with error ϵ. Disagreementregion: DIS(H ϵ )={x : h, h H ϵ such that h(x) h (x)}. Then disagreement coefficient is θ =sup ϵ P[DIS(H ϵ )]. ϵ

Disagreement coefficient: separable case Let P be the underlying probability distribution on input space X. Let H ϵ be all hypotheses in H with error ϵ. Disagreementregion: DIS(H ϵ )={x : h, h H ϵ such that h(x) h (x)}. Then disagreement coefficient is θ =sup ϵ P[DIS(H ϵ )]. ϵ Example: H = {thresholds in R}, anydatadistribution. target Therefore θ =2.

Disagreement coefficient: examples [H 07, F 09]

Disagreement coefficient: examples [H 07, F 09] Thresholds in R, any data distribution. Label complexity O(log 1/ϵ). θ =2.

Disagreement coefficient: examples [H 07, F 09] Thresholds in R, anydatadistribution. θ =2. Label complexity O(log 1/ϵ). Linear separators through the origin in R d,uniformdatadistribution. θ d. Label complexity O(d 3/2 log 1/ϵ).

Disagreement coefficient: examples [H 07, F 09] Thresholds in R, any data distribution. Label complexity O(log 1/ϵ). θ =2. Linear separators through the origin in R d,uniformdatadistribution. Label complexity O(d 3/2 log 1/ϵ). θ d. Linear separators in R d, smooth data density bounded away from zero. θ c(h )d where c(h )isaconstantdependingonthetargeth. Label complexity O(c(h )d 2 log 1/ϵ).