Introduction to Statistical Learning Theory

Similar documents
Introduction to Machine Learning (67577) Lecture 3

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Computational Learning Theory. CS534 - Machine Learning

The PAC Learning Framework -II

Generalization bounds

Statistical Learning Learning From Examples

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

The Vapnik-Chervonenkis Dimension

Introduction to Machine Learning

IFT Lecture 7 Elements of statistical learning theory

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Computational Learning Theory. Definitions

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

1 The Probably Approximately Correct (PAC) Model

1 Learning Linear Separators

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Empirical Risk Minimization

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Classification: The PAC Learning Framework

Machine Learning. Lecture 9: Learning Theory. Feng Li.

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

1 Review of The Learning Setting

Introduction to Machine Learning

Introduction to Machine Learning

Computational Learning Theory

Generalization Bounds

Computational and Statistical Learning theory

Part of the slides are adapted from Ziko Kolter

Introduction to Statistical Learning Theory

Computational and Statistical Learning Theory

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Probably Approximately Correct Learning - III

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning Foundations

Foundations of Machine Learning

Computational Learning Theory

Introduction to Machine Learning (67577) Lecture 5

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Advanced Introduction to Machine Learning CMU-10715

Lecture 5: Probabilistic tools and Applications II

Generalization Bounds and Stability

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

COMS 4771 Introduction to Machine Learning. Nakul Verma

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Numerical Sequences and Series

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Hierarchical Concept Learning

ECS171: Machine Learning

The Perceptron algorithm

Computational Learning Theory (COLT)

CS 6375: Machine Learning Computational Learning Theory

Understanding Machine Learning Solution Manual

Generalization, Overfitting, and Model Selection

A Result of Vapnik with Applications

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

Introduction to Machine Learning CMU-10701

10.1 The Formal Model

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

Multiclass Learnability and the ERM principle

Active Learning: Disagreement Coefficient

Thus f is continuous at x 0. Matthew Straughn Math 402 Homework 6

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

PAC Model and Generalization Bounds

CS446: Machine Learning Spring Problem Set 4

(x 1 +x 2 )(x 1 x 2 )+(x 2 +x 3 )(x 2 x 3 )+(x 3 +x 1 )(x 3 x 1 ).

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

Introduction to Machine Learning

Does Unlabeled Data Help?

The sample complexity of agnostic learning with deterministic labels

Generalization theory

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Introduction and Models

COMPUTATIONAL LEARNING THEORY

1 Maximizing a Submodular Function

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Computational and Statistical Learning Theory

Sample width for multi-category classifiers

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

PAC Generalization Bounds for Co-training

Learning From Data Lecture 5 Training Versus Testing

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Symmetrization and Rademacher Averages

Bayesian Learning (II)

1 Proof of learning bounds

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

FORMULATION OF THE LEARNING PROBLEM

Statistical Machine Learning

Understanding Generalization Error: Bounds and Decompositions

1 Generalization bounds based on Rademacher complexity

Beating the Hold-Out: Bounds for K-fold and Progressive Cross-Validation

Introduction to Computational Learning Theory

Warm up: risk prediction with logistic regression

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Transcription:

Introduction to Statistical Learning Theory

Definition Reminder: We are given m samples {(x i, y i )} m i=1 Dm and a hypothesis space H and we wish to return h H minimizing L D (h) = E[l(h(x), y)]. Problem 1: It is unrealistic to hope to find the exact minimizer after seeing only a sample of the data ( or even if we had perfect knowledge). We can only reasonably hope for an approximate solution: L D (h) min h H L D(h) + ɛ. Problem 2: We depend on a random sample. There is always a chance we get a bad sample that doesn t represent D. Our algorithm can only be probably correct: there is always some probability δ that we are completely wrong. We wish to find a probably approximately correct (PAC) hypothesis.

Definition Definition (PAC learnable) A hypothesis class H is PAC learnable, if there exists a learning algorithm A, satisfying that for any ɛ > 0 and δ (0, 1) there exist M(ɛ, δ) = poly( 1 ɛ, 1 δ ) such that for i.i.d samples S m = {(x i, y i )} m i=1 drawn from any distribution D and m M(ɛ, δ) the algorithm returns a hypothesis A(S m ) H satisfying P S m D m(l D(A(S)) min h H L D(h) + ɛ) > 1 δ Next will show that if L S (h) L D (h) for all h then the ERM is a PAC learning algorithm.

Uniform convergence Definition (Uniform convergence) A hypothesis class H has the uniform convergence property, if for any ɛ > 0 and δ (0, 1) there exist M(ɛ, δ) = poly( 1 ɛ, 1 δ ) such that for any distribution D and m M(ɛ, δ) i.i.d samples S m = {(x i, y i )} m i=1 Dm with probability at least 1 δ, L S (h) L D (h) < ɛ for all h H. It is trivial to bound L S (h) L D (h) for a single h using the Hoeffding inequality (for a bounded loss function). The difficulty is to bound all the h H uniformly.

Uniform convergence Theorem (PAC by uniform convergence) If H has the uniform convergence with M(ɛ, δ) then H is PAC learnable with the ERM algorithm and M( ɛ 2, δ) samples. Proof. By uniform convergence: With probability at least 1 δ for all h H, L S (h) L D (h) ɛ 2. Define h ERM = arg min h H L S(h) and h = arg min h H L D(h). L D (h ERM ) L S (h ERM ) + ɛ 2 L S(h ) + ɛ 2 L D(h ) + ɛ

Finite hypothesis space A first simple example of PAC learnable spaces - finite hypothesis spaces. Theorem (uniform convergence for finite H) Let H be a finite hypothesis space and l : Y Y [0, 1] be a bounded loss function, then( H) has the uniform convergence property with M(ɛ, δ) = ln H δ and is therefore PAC learnable 2ɛ 2 by the ERM algorithm. Proof. For any h H, l(h(x 1 ), y 1 ),..., l(h(x m ), y m ) are i.i.d random variables with expected value L D (h). According to the Hoeffding inequality, P ( L S (h) L D (h) > ɛ) 2e 2ɛ2m 2e 2ɛ2 M(ɛ,δ) = δ H (1)

Finite hypothesis space Proof (Cont.) We can now use the union bound: For all events A 1,..., A n P ( n i=1a i ) n P (A i ) (2) For all h H define A h as the event that L S (h) L D (h) > ɛ. By equation 1 we know that P (A h ) δ H. With equation 2 we can conclude P ( h H : L S (h) L D (h) > ɛ) = P ( h H A h ) h H P (A h ) h H δ H = δ i=1

We have seen that finite hypothesis class can be learned, but what about infinite ones like linear predictors? We can discretize (after all we are working on a finite precision machines), but this is not a great solution. The main problem is with the use of the union bound as similar hypothesis will fail on similar samples. The solution is the check how many effective hypothesis there are on a sample of size m. We will restrict ourselves (for the time being) to binary classification with 0 1 loss.

Definition Definition Let H be a set of function from X to {±1} and let C X be a subset of the input space. We denote by H C all the function that can be derived by restricting functions in H to C. H C = {h C : C {±1} : h H} Definition (Growth function) The growth function of H, Π H (m) is the size of the largest restriction of H to a set of size m. Π H (m) = max{ H C : C X, C = m}

Definition Notice that Π H (m) 2 m. Example 1: H = 2 X for infinite X, Π H (m) = 2 m. Example 2: For finite H, Π H (m) H. Example 3: For H = {h a (x) = sign(x a), a R}, Π H (m) = m + 1. Example 4: For H = {h ± a (x) = sign(±x a), a R}, Π H (m) = 2m. As we can see, even for an infinite hypothesis set it is possible that Π H (m) 2 m.

Uniform convergence upper bound We can now state the main theorem that shows the importance of the growth function. Theorem (Uniform convergence bound) Let H be a hypothesis set of {±1} valued functions and l be the 0 1 loss, then for any distribution D on X {±1}, any ɛ > 0 and positive integer m, we have ( ) P S D m ( h H : L S (h) L D (h) ɛ) 4Π H (2m) exp ɛ2 m 8 Immediate corollary - if Π H (m) grows sub-exponentially then H is PAC learnable.

Uniform convergence upper bound This is not a simple proof, so we will go over the main steps first. We wish to reduce the problem to a finite problem, so we will start by showing the we can replace L D by L S - the error on another m independent test samples. The next step to show you can fix the samples, and look at the probability of permuting between the train and test sets. Last part will be to use the union bound and Hoeffding on this reduced case.

Symmetrization We define Z = X {±1}. Lemma (1) Let Q = {S Z m : h H s.t. L S (h) L D (h) ɛ} and R = {(S 1, S 2 ) Z 2m : h H s.t. L S1 (h) L S2 (h) ɛ 2 }. For m 4, P ɛ 2 S D m(q) 2P S1 S 2 D 2m(R). Proof. Let S 1 Q and pick h H such that L S1 (h) L D (h) ɛ. By the Hoeffding inequality we know that P S2 ( L S2 (h) L D (h) ɛ 2 ) 1 2. This means that ( P h H : L D (h) L S1 (h) ɛ L D (h) L S2 (h) ɛ ) 2 P (Q) 2

Symmetrization Proof (Cont.) P ( h H : L D (h) L S1 (h) ɛ L D (h) L S2 (h) ɛ ) 2 P (Q) 2 We now notice that if L D (h) L S1 (h) ɛ and L D (h) L S2 (h) ɛ 2, then by the triangle inequality L S2 (h) L S1 (h) ɛ 2. This means that the probability above is lesser or equal to P (R) concluding our proof.

Permutations The next step is to bound the probability of R with permutations between training and testing. Define Γ m as the set of permutations on {1,..., 2m} that swap between i and i + m, i.e. for σ Γ m and 1 i m, σ(i) = i or σ(i) = i + m. Lemma (2) Let R be any subset of Z 2m and D any distribution on Z. Then P S D 2m(R) = E S [P σ (σs R)] max S Z 2m P σ(σs R) When σ is chosen uniformly from Γ m.

Permutations Proof. As S is a set of 2m i.i.d samples, then the probability of any event is invariant to permutation, i.e. σ Γ m, P S D 2m(R) = P S D 2m(σS R). Based on this we can deduce: P S D 2m(R) = E S [1 R (S)] = 1 Γ m σ Γ m E S [1 R (σs)] From the linearity of expectation we get [ ] 1 P S D 2m(R) = E S 1 R (σs) = E S [P σ (σs R)] Γ m σ Γ m This proves the first equality, the fact that E S [P σ (σs R)] max S Z 2m P σ(σs R) is trivial.

Finite case We have shown that we just need to bound the probability of permuting a fixed sample. Lemma (3) For the set R = {(S 1, S 2 ) Z 2m : h H s.t. L S1 (h) L S2 (h) ɛ 2 } as in lemma 1, and permutation σ chosen uniformly from Γ m, max P σ(σs R) 2Π H (2m)e ɛ S Z 2m 2 m 8 Proof. Let S = ((x 1, y 1 ),..., (x 2m, y 2m )) be the maximizing S, and let C = {x 1,..., x 2m }. By definition H C = {h 1,..., h t } for t Π H (2m).

Finite case Proof (Cont.) We have σs R if and only if for some h H, 1 m l(h(x m σ(i) ), y σ(i) ) 1 2m l(h(x m σ(i) ), y σ(i) ) ɛ 2 i=1 i=m+1 As h C h j C for some 1 j t, it is enough to look at h 1,..., h t. We define { u j i = l(h 1 if h j (x i ) y i j(x i ), y i ) = 0 otherwise So σs R if and only if for some 1 j t 1 m u j m σ(i) 1 2m u j m σ(i) ɛ 2 i=1 i=m+1 (3)

Finite case Proof (Cont.) Notice that u j σ(i) uj σ(m+i) = ± uj i uj m+i with both possibilities equally likely, so P σ ( 1 m ) ( m (u j σ(i) uj σ(i+m) ) ɛ 1 = P 2 m i=1 ) m u j i uj m+i β i ɛ 2 where β i {±1} uniformly and independently. ( ) By the hoeffding inequality this is smaller then 2 exp and using the union ɛ2 m 8 bound on all h H C we can bound it by 2Π H (2m)e ɛ2 m 8 i=1

Finite case Summery - Theorem (Uniform convergence bound) Let H be a hypothesis set of {±1} valued functions and l be the 0 1 loss, then for any distribution D on X {±1}, any ɛ > 0 and positive integer m, we have ( ) P S D m ( h H : L S (h) L D (h) ɛ) 4Π H (2m) exp ɛ2 m 8 The proof is just the combination of lemmas 1-3. note: in lemma 1 re required that m 4 ɛ 2, this is not a problem because the bound in this theorem is trivial for m < 4 ɛ 2.