Does Unlabeled Data Help?

Similar documents
1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Generalization, Overfitting, and Model Selection

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

1 A Lower Bound on Sample Complexity

Statistical and Computational Learning Theory

The sample complexity of agnostic learning with deterministic labels

2 Upper-bound of Generalization Error of AdaBoost

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Statistical learning theory, Support vector machines, and Bioinformatics

Understanding Generalization Error: Bounds and Decompositions

Active Learning: Disagreement Coefficient

Computational Learning Theory. Definitions

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Generalization and Overfitting

Statistical Learning Learning From Examples

10.1 The Formal Model

Introduction to Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

VC dimension and Model Selection

Computational Learning Theory

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

PAC-learning, VC Dimension and Margin-based Bounds

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Foundations of Machine Learning

Solving Classification Problems By Knowledge Sets

Introduction to Machine Learning

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Part of the slides are adapted from Ziko Kolter

Computational Learning Theory. CS534 - Machine Learning

Introduction to Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

Empirical Risk Minimization

Introduction to Machine Learning Midterm Exam

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Active learning. Sanjoy Dasgupta. University of California, San Diego

Minimax risk bounds for linear threshold functions

Computational Learning Theory

Active Learning and Optimized Information Gathering

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Introduction to Support Vector Machines

Multiclass Learnability and the ERM principle

Computational and Statistical Learning theory

Computational Learning Theory

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Introduction to Machine Learning (67577) Lecture 3

PDEEC Machine Learning 2016/17

Introduction to Machine Learning Midterm Exam Solutions

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Optimization Methods for Machine Learning (OMML)

COMS 4771 Introduction to Machine Learning. Nakul Verma

Generalization bounds

Sample width for multi-category classifiers

Learning with Rejection

Computational and Statistical Learning Theory

Multiclass Learnability and the ERM Principle

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Machine Learning 4771

Generalization theory

IFT Lecture 7 Elements of statistical learning theory

Lecture 8. Instructor: Haipeng Luo

Large-Margin Thresholded Ensembles for Ordinal Regression

References for online kernel methods

Lecture Support Vector Machine (SVM) Classifiers

ECS171: Machine Learning

Advanced Introduction to Machine Learning CMU-10715

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Discriminative Models

Learnability of the Superset Label Learning Problem

The information-theoretic value of unlabeled data in semi-supervised learning

Support Vector Machines

Generalization, Overfitting, and Model Selection

Machine Learning And Applications: Supervised Learning-SVM

Generalization theory

Lecture 14: Binary Classification: Disagreement-based Methods

CS446: Machine Learning Spring Problem Set 4

PAC-learning, VC Dimension and Margin-based Bounds

Support Vector Machine for Classification and Regression

Consistency of Nearest Neighbor Methods

CSCI-567: Machine Learning (Spring 2019)

Generalization Bounds and Stability

Support Vector Machines. Machine Learning Fall 2017

Machine Learning Lecture 7

Discriminative Learning can Succeed where Generative Learning Fails

Logistic Regression. Machine Learning Fall 2018

A Bound on the Label Complexity of Agnostic Active Learning

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Kernel Methods and Support Vector Machines

Machine Learning

PAC-Bayesian Learning and Domain Adaptation

What is semi-supervised learning?

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Classification: The rest of the story

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

arxiv: v1 [cs.lg] 15 Aug 2017

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Transcription:

Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar.

Outline Introduction & Motivation SSL in Practice (SSL assumptions) Preliminaries Conjecture about Sample Complexity Proof of Conjecture for a Simple Concept Class in Realizable Setting Proof of Conjecture for another Simple Concept Class in Agnostic Setting Conclusion 2

Introduction Setting Input to the algorithm (Distribution D) Algorithm output Performance of the algorithm Examples Supervised Learning labeled examples a function that maps points to labels expected error on an unseen point Support Vector Machines, AdaBoost Unsupervised Learning unlabeled examples a function that maps points to labels expected error on an unseen point Clustering Semi-supervised Learning (SSL) labeled & unlabeled examples a function that maps points to labels expected error on an unseen point Transductive Learning labeled & unlabeled examples labels of the unlabeled examples average error on unlabeled examples Transductive SVMs, Graph regularization 3

In real world problems, Motivation much more unlabeled data than labeled data. labeling costly, both in terms of time and money. Examples: web document classifiers, speech recognition, protein sequences,... Can unlabeled data help prediction accuracy? intuitively, if labels continuous, expect better performance. but, a precise theoretical analysis of the merits of SSL missing. 4

Introduction Question: Can SSL help without any prior assumptions on the distribution of labels? Measure help in terms of reduced labeled sample complexity. Sample complexity: how much training data needed to learn effectively. Answer: For some simple hypotheses space, not significantly, at most a factor of 2 reduction in the sample complexity. 5

SSL in Practice Binary classification problem.??? Access to labeled training data. 6

Binary classification problem. SSL in Practice Access to labeled training data. More confidence with unlabeled data if labels continuous. But, need an assumption on the distribution of labels of. 7

SSL in Practice Common assumptions when using semi-supervised learning: decision boundary should lie in a low-density region. points in the same cluster likely to be of the same class. if two points in a high-density region are close, then so should be the corresponding outputs. Difficulty: assumptions hard to verify and not formalizable. No analysis of precise conditions under which SSL is better. There are bounds for classification and regression under SSL, but none shown to be provably better than the supervised setting. [Vapnik, 98; El-Yaniv & Pechyony, 06; Cortes et. al, 08] 8

Low-density separators not always good, two overlapping gaussians; SSL in Practice 1 2 N( 1, 1) + 1 N(1, 1) 2 density function optimal separator optimal separator in a very high-density region. 9

SSL in Practice Low-density separators can be quite bad! two gaussians with different variances: 1 2 N( 2, 1) + 1 N(2, 2) 2 density function min-density separator (error ~ 0.21) optimal separator (error ~ 0.17) 10

Preliminaries Let X be a domain set. We focus on classification, so labels are Y = {0, 1}. Hypotheses are functions, h : X Y. Hypothesis set H. Target function: a probability distribution over X Y. For this talk, assume Y a deterministic function of X, i.e: Pr[y =1 x], Pr[y =0 x] {0, 1}. Realizable setting: target function f H, minimum error 0. [also called the consistent case] f/ H Agnostic setting: target function, minimum error not 0. [also called the inconsistent case] 11

Preliminaries In the SSL setting considered, the learning algorithm receives: a labeled training sample S = {(x i,y i ) m i=1}. the entire unlabeled distribution D. Distribution D Unknown 12

Preliminaries A supervised learning algorithm A(S) : receives as input m labeled examples, (x i,y i ) m i=1. produces a hypothesis h : X Y. An unsupervised learning algorithm A(S, D) : receives as input m labeled examples, (x i,y i ) m i=1 and a probability distribution D over X. produces a hypothesis h : X Y. Risk (test error): Err D (h) = Pr Training error (empirical error): [h(x) y x] x D Err S (h) = 1 m m i=1 1 h(xi ) y i 13

Preliminaries In the SSL setting considered, the learning algorithm receives: Distribution D a labeled training sample S = {(x i,y i ) m i=1} the entire unlabeled distribution D really, this is the transductive setting (infinite unlabeled points) any lower bound carries over to fewer unlabeled points upper bounds only apply for a large number of unlabeled points Unknown 14

For a class Sample Complexity H of hypotheses, the sample complexity of an SSL algorithm A(S, D), confidence 1 δ, accuracy ɛ is m(a(,d), H, ɛ, δ) = min {m N : Pr S D m [ ] Err D (A(S, D)) inf h H ErrD (h) > ɛ < δ} In words, the number of samples needed for the error of the learning algorithm to be within a probability at least 1 δ. In the realizable case: inf h H ErrD (h) = 0 Symmetric difference of h 1,h 2 H: h 1 h 2 = 1 {x X : h1 (x) h 2 (x)} ɛ of the optimal hypothesis with 15

Conjectures For any hypothesis class H, there exists a constant c 1 there exists a single supervised algorithm A, such that for any dist. D and any SSL algorithm B, the sample complexity of the supervised algorithm is larger by at most a factor of c. sup f H m(a( ), H, ɛ, δ) c sup f H m(b(,d), H, ɛ, δ) 16

Two Hypotheses Classes Thresholds: threshold real line positively labeled points negatively labeled points H = { 1 (,t] : t R } Union of intervals: d UI d = { } 1 [a1,a 2 )... [a 2l 1,a 2l ):l d FYI, recall that: VC(thresholds) = 1, VC( d intervals) = 2d 17

Learning Thresholds Focus on the realizable case. Will show that the sample complexity of SSL for learning thresholds is half the sample complexity of supervised learning. Easy part: sample complexity of supervised learning. H L density D Theorem: let be the hypothesis space and the learning algorithm that returns the left-most hypothesis that is consistent with the training set. Then for any distribution D, for any ɛ, δ > 0 and any target h H, m(a( ), H, D, ɛ, δ) ln(1/δ) ɛ area ɛ 18

Learning Thresholds Theorem: let H be the hypotheses space and L the learning algorithm that returns the left-most hypothesis that is consistent with the training set. Then for any distribution D, for any ɛ, δ > 0 and any target h H, m(a( ), H, D, ɛ, δ) ln(1/δ) ɛ Failure probability the same as no point lies in the shaded area. (1 ɛ) m exp( ɛm) δ density D left-most consistent hypothesis area ɛ target h 19

Learning Thresholds An SSL algorithm for learning thresholds: [explain via picture] [Kääriäinen, 05] Theorem: Let H be a hypothesis space and B the above SSL learning algorithm. For any continuous distribution and any target h H, m(b(,d), H, ɛ, δ) ln(1/δ) + ln2 2ɛ 2ɛ D, any ɛ, δ given density D choose a point that splits area in half target h 20

Learning Thresholds Theorem: Let H be a hypothesis space and B the above SSL learning algorithm. For any continuous distribution and any target h H, m(b(,d), H, ɛ, δ) ln(1/δ) + ln2 2ɛ 2ɛ Proof: Let be the open interval containing D, any Step 1: Map I to (0, 1) via a function F ( ). Transform training sample S S via F Step 2: I m(b(,d), H, ɛ, δ) m(a(,u (0,1) ), H, ɛ, δ) [ Step 3: Pr Err U (A(S,U)) ɛ ] 2(1 2ɛ) m S U m D ɛ, δ 21

Learning Thresholds Proof: Let I be the open interval containing D Step 1: Map I to (0, 1) via a function F ( ). Transform training sample S S via F The mapping [in pictures]: 0 1 22

Learning Thresholds Proof: Let I be the open interval containing D Step 1: Map I to (0, 1) via a function F ( ). Transform training sample S S via F The mapping [in pictures]: naturally stretches out D distribution in (0, 1). Define the corresponding SSL algorithm on (0, 1). to a uniform 0 1 23

Learning Thresholds Proof: Let I be the open interval containing D [ Step 3: Pr Err U (A(S,U)) ɛ ] 2(1 2ɛ) m S U m Proof: technical and (unnecessarily?) complicated. Based on bad events. Can possibly be made simpler. 0 1 24

Learning Thresholds Theorem: For any (randomized) SSL algorithm A, any small ɛ, any δ, any continuous distribution exists a target h H such that Proof: Step 1: assume that over an open interval, there is uniform (similar trick as before) Step 2: Fix A, ɛ and δ. Pick t U(0, 1) and consider h = 1 [0,t] h is a random variable. Show that: E h Theorem follows by the Probabilistic Method. D m(a(,d), ɛ, δ) ln(1/δ) 2.01ɛ ln 2 2.01ɛ D [ Pr Err U h (A(S, U)) ɛ ] 1 (1 2ɛ)m S U m 2 25

Proof (sketch): Learning Thresholds Step 1: assume that D is uniform (similar trick as before) Step 2: Fix A, ɛ and δ. Pick t U(0, 1) and consider h = 1 [0,t] h is a random variable. Show that: [ E h Pr Err U h (A(S, U)) ɛ ] 1 (1 2ɛ)m S U m 2 E h [ Pr Err U h (A(S, U)) ɛ ] = E h Pr [Bad Event] S U m m S U = E h E S U m1 Bad Event = E S U me h 1 Bad Event = E S U m Pr [Bad Event] E S U m max(x i+1 x i 2ɛ, 0). i x 1... x m 26

Agnostic Case Notion of b shatterable distributions: b clusters can be shattered by the concept class. Probabilistic targets. Learning thresholds on the real line: Union of intervals: d Θ ( ) 2d + ln(1/δ) ɛ 2 Θ ( ln(1/δ)/ɛ 2 ) Lemma [Anthony, Bartlett]: Suppose P uniformly distributed over two Bernoulli distributions, {P 1,P 2 } such that probability of heads: P 1 (H) = 1/2 γ,p 2 (H) = 1/2+γ Further, suppose ξ 1,..., ξ m are {H, T } -valued random variables, with Pr[ξ i = H] =P (H) for all i. Then for any decision function f : {H, T } m {P 1,P 2 } E P Pr [f(ξ) P ] > 1 ξ P m 4 ( ( ) ) 4mγ 2 1 1 exp 1 4γ 2 27

Agnostic Case Lemma: Fix any X, H, D and an m>0. Suppose there are h, g H such that D(h g) > 0. Define probability dist. P h,p g over X Y such that: P h ((x, h(x)) x) =P g ((x, g(x)) x) = 1/2+γ note that P h,p g have the same marginal distribution when agree and have close distributions when h, g disagree. h, g 28

Agnostic Case Lemma: Fix any X, H, D and an m>0. Suppose there are h, g H such that D(h g) > 0. Define probability dist. P h,p g over X Y such that: P h ((x, h(x)) x) =P g ((x, g(x)) x) = 1/2+γ Let A D :(h g Y ) m H be any function. Then for x 1,..., x m h g, there exists P {P h,p g },y i P xi [ Pr Err P (A D ((x i,y i ) m y i i=1) OPT P > γd(h g) ] > ( ( ) ) 1 4mγ 2 1 1 exp 4 1 4γ 2 29

Agnostic Case Use the previous lemma to construct a probabilistic target that achieves the desired sample complexity. 30

Conclusion Formal analysis of sample complexity of SSL. Comparison to sample complexity of supervised learning. No assumptions on the relationship between distribution of labels and distribution of unlabeled data. Limited advantage of SSL for basic concept classes over the real line. Open question: extend the result to any concept class of VC dimension d. 31

Thank You.