Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Similar documents
Rademacher Bounds for Non-i.i.d. Processes

Understanding Generalization Error: Bounds and Decompositions

1 A Lower Bound on Sample Complexity

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Generalization theory

Does Unlabeled Data Help?

Learning with Imperfect Data

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Introduction to Statistical Learning Theory

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Introduction to Machine Learning CMU-10701

Rademacher Complexity Bounds for Non-I.I.D. Processes

Computational and Statistical Learning Theory

Statistical Learning Learning From Examples

VC dimension and Model Selection

Sample Selection Bias Correction

Foundations of Machine Learning

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Active Learning: Disagreement Coefficient

PAC-learning, VC Dimension and Margin-based Bounds

Generalization Bounds and Stability

Statistical learning theory, Support vector machines, and Bioinformatics

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Introduction to Support Vector Machines

Generalization and Overfitting

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Generalization error bounds for classifiers trained with interdependent data

2 Upper-bound of Generalization Error of AdaBoost

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Class 2 & 3 Overfitting & Regularization

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

References for online kernel methods

Generalization Bounds for the Area Under an ROC Curve

1 Review of The Learning Setting

The definitions and notation are those introduced in the lectures slides. R Ex D [h

Generalization, Overfitting, and Model Selection

Computational Learning Theory

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Introduction to Machine Learning

COMS 4771 Introduction to Machine Learning. Nakul Verma

Generalization bounds

Introduction to Machine Learning (67577) Lecture 3

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

The PAC Learning Framework -II

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. Definitions

Active Learning and Optimized Information Gathering

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

The Vapnik-Chervonenkis Dimension

Computational and Statistical Learning Theory

Computational and Statistical Learning theory

Computational and Statistical Learning Theory

BINARY CLASSIFICATION

Introduction to Machine Learning

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Statistical and Computational Learning Theory

Introduction and Models

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

PAC-learning, VC Dimension and Margin-based Bounds

IFT Lecture 7 Elements of statistical learning theory

Learning with Rejection

Empirical Risk Minimization

Sample Complexity of Learning Mahalanobis Distance Metrics. Nakul Verma Janelia, HHMI

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Maximum Mean Discrepancy

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H:

1 The Probably Approximately Correct (PAC) Model

Decentralized Detection and Classification using Kernel Methods

Accelerated Training of Max-Margin Markov Networks with Kernels

A Uniform Convergence Bound for the Area Under the ROC Curve

Machine Learning Theory (CS 6783)

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

ECS171: Machine Learning

Authors: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira and Jennifer Wortman (University of Pennsylvania)

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

Computational and Statistical Learning Theory

Joint distribution optimal transportation for domain adaptation

Generalization Bounds for the Area Under the ROC Curve

Lecture 14: Binary Classification: Disagreement-based Methods

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Rank, Trace-Norm & Max-Norm

CIS 520: Machine Learning Oct 09, Kernel Methods

The sample complexity of agnostic learning with deterministic labels

Generalization Bounds

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Perceptron Mistake Bounds

Generalization Bounds for the Area Under the ROC Curve

Part of the slides are adapted from Ziko Kolter

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Computational Learning Theory

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

ORIE 4741: Learning with Big Messy Data. Generalization

Statistical Machine Learning

Transcription:

Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh

Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

Learning Scenario Goal: given feature vectors labels (y 1,..., y m ) drawn i.i.d. from an (x 1,..., x m ) with unknown distribution D, learn a function (hypothesis) h H that minimizes expected loss (risk) over entire distribution, [ ] R(h) =E (x,y) D l(h(x),y) Where l : R R R is the loss function.

Learning Scenario Cannot measure risk directly, approximate using the sample S = ((x 1,y 1 ),..., (x m,y m )) R S (h) = 1 m m i=1 l ( h(x i ),y i ) ) How well does R S approximate R? In other words, how well does the algorithm generalize?

Generalization Bound

Generalization Bound Generalization (risk) bounds show how well R S (h) bounds R(h) as a function of m = S. R(h) R S (h) + complexity(h)

Generalization Bound Generalization (risk) bounds show how well R S (h) bounds R(h) as a function of m = S. R(h) R S (h) + complexity(h) Uniform convergence, holds simultaneously over all h H.

Generalization Bound Generalization (risk) bounds show how well R S (h) bounds R(h) as a function of m = S. R(h) R S (h) + complexity(h) Uniform convergence, holds simultaneously over all h H. Complexity (capacity) is a measure of the richness of the hypothesis class.

Generalization Bound Generalization (risk) bounds show how well R S (h) bounds R(h) as a function of m = S. R(h) R S (h) + complexity(h) Uniform convergence, holds simultaneously over all h H. Complexity (capacity) is a measure of the richness of the hypothesis class. Explicitly shows the tradeoff between complexity and generalization ability.

Outline

Outline We will cover four main examples,

Outline We will cover four main examples, Hypothesis Complexity Bounds: VC-Dimension Covering Number Rademacher Complexity

Outline We will cover four main examples, Hypothesis Complexity Bounds: VC-Dimension Covering Number Rademacher Complexity Algorithm Complexity Bounds: Algorithmic Stability

Concentration Bounds Hoeffding s Inequality [ 63] - Let be (X 1,..., X m ) independent bounded ( a i X i b i ) random variables. Then the following is true, Pr ( Y E[Y ] ɛ ) 2 exp ( 2ɛ 2 ) m i=1 (b i a i ) 2 where Y = X 1 +... + X m.

Concentration Bounds McDiarmid s Inequality [ 89] - Let f be a real-valued function such that for all i, sup f(x 1,..., x i,..., x m ) f(x 1,..., x i,..., x m ) c i then the following is true, Pr ( f(s) E[f] ɛ ) 2 exp ( 2ɛ 2 m i=1 c i ) with respect to i.i.d. sample S =(x 1,..., x m ).

Warmup For finite hypothesis set H and bounded loss function l(, ) M the following is true for all h H simultaneously, ( ) 2mɛ 2 Pr ( R(h) R S (h) ɛ ) 2 H exp M 2

Warmup For finite hypothesis set H and bounded loss function l(, ) M the following is true for all h H simultaneously, ( ) 2mɛ 2 Pr ( R(h) R S (h) ɛ ) 2 H exp Setting the rhs equal to, we can say with probability at least δ (1 δ) M 2 the following is true, ln( H ) + ln(2/δ) R(h) R S (h)+m 2m

Warmup

Warmup Proof: for any fixed hypothesis h, by Hoeffding s inequality, we have Pr ( R(h) R S (h) ɛ ) 2 exp ( 2mɛ 2 M 2 ) where in this case, (a i b i ) M/m.

Warmup Proof: for any fixed hypothesis h, by Hoeffding s inequality, we have Pr ( R(h) R S (h) ɛ ) 2 exp ( 2mɛ 2 M 2 ) where in this case, (a i b i ) M/m. Taking the union bound for all hypotheses in H completes the proof.

Hypothesis Complexity

Hypothesis Complexity Bound can be loose for large values of H.

Hypothesis Complexity Bound can be loose for large values of H. In general hypothesis class is not even finite!

Hypothesis Complexity Bound can be loose for large values of H. In general hypothesis class is not even finite! We must measure complexity in a more careful way.

Symmetrization Main Insight: Define the quantity of interest purely over a finite sample. We can now consider the complexity of a hypothesis class over only a finite sample.

Symmetrization Main Insight: Define the quantity of interest purely over a finite sample. Pr [ sup h 2 Pr [ sup h R(h) R S (h) ɛ ] R S (h) R S (h) ɛ 2 We can now consider the complexity of a hypothesis class over only a finite sample. ]

VC-dimension

VC-dimension Consider binary classification, denote the number of dichotomies realized by H over a sample S as follows, Π H (S) ={(h(x 1 ),..., h(x m )) h H}

VC-dimension Consider binary classification, denote the number of dichotomies realized by H over a sample S as follows, Π H (S) ={(h(x 1 ),..., h(x m )) h H} The sample S is shattered by H if, Π H (S) = 2 m.

VC-dimension Consider binary classification, denote the number of dichotomies realized by H over a sample S as follows, Π H (S) ={(h(x 1 ),..., h(x m )) h H} The sample S is shattered by H if, Π H (S) = 2 m. The shattering coefficient over m points is the defined as, Π H (m) = max S { Π H(S) : S = m}

VC-dimension

VC-dimension VC-dimension: the size of the largest set that is shattered by H. d : max{m : S = m, Π H (S) = 2 m }

VC-dimension VC-dimension: the size of the largest set that is shattered by H. d : max{m : S = m, Π H (S) = 2 m } Example: halfspaces in R n.

VC-dimension VC-dimension: the size of the largest set that is shattered by H. d : max{m : S = m, Π H (S) = 2 m } Example: halfspaces in R n. Can always shatter set of (non-degenerate) n points.

VC-dimension VC-dimension: the size of the largest set that is shattered by H. d : max{m : S = m, Π H (S) = 2 m } Example: halfspaces in R n. Can always shatter set of (non-degenerate) n points. No configuration of (n + 1) points can be shattered Case 1: Case 2:

SVM Margin Bound For linear hypothesis class h(x) = <w, x> with canonical margin, bound the VC-dim independently of the input dimension, d Λ 2 R 2 where the data is held in a ball of radius R centered at the origin and w Λ. 2 w <w,x>=0

VC-bound

VC-bound Sauer s Lemma - If VC-dim is finite then, Π H (m) O(m d )

VC-bound Sauer s Lemma - If VC-dim is finite then, Π H (m) O(m d ) VC-bound [ 71] - for hypothesis class H with 0/1- loss function and finite VC-dimentsion d, the following is holds with probability at least 1 δ, ( d ln(m) ) R(h) R S (h)+o m + ln(1/δ) Shown to be optimal within a log(m) factor.

Covering Number

Covering Number Def. The ε-covering number of a function class F with respect to a sample S, denoted N p (F, ɛ,s), is the minimum number of ε-balls needed to cover all h(s) in H(S),

Covering Number Def. The ε-covering number of a function class F with respect to a sample S, denoted N p (F, ɛ,s), is the minimum number of ε-balls needed to cover all h(s) in H(S), H(S)

Covering Number Def. The ε-covering number of a function class F with respect to a sample S, denoted N p (F, ɛ,s), is the minimum number of ε-balls needed to cover all h(s) in H(S), H(S) ε

Covering Number Def. The ε-covering number of a function class F with respect to a sample S, denoted N p (F, ɛ,s), is the minimum number of ε-balls needed to cover all h(s) in H(S), H(S) ε ε

Covering Number Def. The ε-covering number of a function class F with respect to a sample S, denoted N p (F, ɛ,s), is the minimum number of ε-balls needed to cover all h(s) in H(S), H(S) ε ε ε

Covering Number Def. The ε-covering number of a function class F with respect to a sample S, denoted N p (F, ɛ,s), is the minimum number of ε-balls needed to cover all h(s) in H(S), ε ε ε ε H(S) ε ε ε ε ε ε ε

Covering Number Def. The ε-covering number of a function class F with respect to a sample S, denoted N p (F, ɛ,s), is the minimum number of ε-balls needed to cover all h(s) in H(S), ε ε ε ε H(S) ε ε ε ε ε ε ε Intuitively, the number of functions needed to ε-approximate (in p-norm) any hypothesis in H.

Covering Number Def. The ε-covering number of a function class F with respect to a sample S, denoted N p (F, ɛ,s), is the minimum number of ε-balls needed to cover all h(s) in H(S), H(S) ε ε ε ε ε Intuitively, the number of functions needed to ε ε-approximate (in p-norm) any hypothesis in H. Also, denote N p (F, ɛ,m) = sup N p (F, ɛ,s). ε ε ε ε S: S =m ε

Covering Number

Covering Number A natural complexity measure for continuous functions.

Covering Number A natural complexity measure for continuous functions. Can be applied during the union bound step. Replace each function in l H with an ε/8 approximation.

Covering Number A natural complexity measure for continuous functions. Can be applied during the union bound step. Replace each function in l H with an ε/8 approximation. If loss function is L-Lipschitz, then N p (l F, ɛ,m) N p (F, ɛ/l, m).

Covering Number Bound Thm. [Pollard 84] For a Lipschitz loss function bounded by M, the following is holds with probability at least, ( Pr sup R(h) R ) S (h) ɛ h H ( ) mɛ 2 O N 1 (H, ɛ,m)e M 2

Rademacher Complexity

Rademacher Complexity How well does the hypothesis class fit random noise?

Rademacher Complexity How well does the hypothesis class fit random noise? Empirical Rademacher Complexity: R S (H) =E σ (sup h H where σi is chosen uniformly from {1,-1}. 2 m m i=1 σ i h(x i ) )

Rademacher Complexity How well does the hypothesis class fit random noise? Empirical Rademacher Complexity: R S (H) =E σ (sup h H where σi is chosen uniformly from {1,-1}. This quantity can be computed from data (assuming we can minimize empirical error). 2 m m i=1 σ i h(x i ) )

Rademacher Complexity

Rademacher Complexity Rademacher Complexity R m (H) =E S [ RS (H) S = m ]

Rademacher Complexity Rademacher Complexity R m (H) =E S [ RS (H) S = m ] McDiarmid s inequality exponentially bounds the difference between R S and R m. ln(1/δ) R m (H) R S (H)+ 2m

Rademacher Bound Rademacher Bound: Consider binary classification with 0/1 loss function, then the following is true, Pr sup h H R(h) R S (h) ɛ + R m (h) e 2mɛ2 Again, with probability at least 1 δ, ln(1/δ) R(h) R S (h)+r m (H)+ 2m

Rademacher Bound Proof (outline): Apply McDiarmid s inequality to, Φ(S) = sup h which is 1/m-Lipschitz. Main insight: use symmetrization type argument to show, R(h) RS (h) E [ Φ(S) ] R m (H)

Rademacher Complexity Important qualities: We can compute R S (H) from data. The empirical Rademacher complexity lower bounds the VC-bound! 2d ln(em/d) R(H) 2M m

Algorithmic Stability

Algorithmic Stability Complexity measure of a specific algorithm (rather than a hypothesis class).

Algorithmic Stability Complexity measure of a specific algorithm (rather than a hypothesis class). The usual union-bound term is not need.

Algorithmic Stability Complexity measure of a specific algorithm (rather than a hypothesis class). The usual union-bound term is not need. Intuition: How does the algorithm explore the hypothesis space?

Algorithmic Stability Algorithm is called β-stable if, for any two samples S and S that differ in one point, the following holds, β sup x,y l(hs (x),y) l(h S (x),y) where hs and hs are the hypothesis produced by the algorithm for samples S and S respectively.

Stability Bound Thm. Let hs be the hypothesis returned by a β- stable algorithm with access to a sample S, then the following bound holds with respect to a nonnegative loss function with upper bound M, Pr S ( R(S) RS (S) ɛ + β ) exp ( 2mɛ 2 (mβ + M) 2 ) Thus, probability at least (1 - δ), over the sample S, ln(1/δ) R(S) R S (S)+β(mβ + M) 2m

Stability Bound Proof: Apply McDiarmid s inequality to the random variable, Φ(S) =R(h S ) ˆR S (h S ) Stability allows us to bound the mean, and verify the Lipschitz property.

Stability Bound Let S and Si differ in the ith coordinate, show Lipschitz behavior, Φ(S) Φ(S i ) = R(h S ) R S (h S ) R(h S i)+ R S i(h S i) R S i(h S i) R S (h S ) }{{} (1) + R(h S ) R(h S i) }{{} (2) We bound the two terms separately.

Stability Bound Part (1): R S i(h S i) R S (h S ) = 1 m j i l(h S i,x j ) l(h S,x j ) + l(h S i,x i) l(h S,x i ) β + M m

Stability Bound Part (2): R(h S ) R(h S i) = E x [ l(hs,x) l(h S,x) ] β. So we have, c i = Φ ( S) Φ(S i ) 2β M m

Bounding the mean, E S [ Φ(S) ] Stability Bound = E S,x [ l(hs,x) ] E S [ 1 The second equality holds by the i.i.d. assumption. m = E S,x [ l(hs,x) ] E S,x [ 1 = E S,x [ 1 m β i m i i l(h S,x i ) ] [ l(hs,x) l(h S i,x) }{{} β l(h S i,x) ] ] ]

Stable Algorithms

Stable Algorithms Algorithm must be 1/m-stable for bound to converge.

Stable Algorithms Algorithm must be 1/m-stable for bound to converge. Kernel Regularized Algorithms: min h R S (h)+λ h 2 K

Stable Algorithms Algorithm must be 1/m-stable for bound to converge. Kernel Regularized Algorithms: min h R S (h)+λ h 2 K Thm. Kernel regularized algorithms with κ- bounded kernel and σ-lipschitz and convex loss function, are stable with coefficient, β σ2 κ 2 λm

Stable Algorithms B(f g) =F (f) F (g) (f g) F (g) B 2 K (h S h S )+B 2 K (h S h S ) 2σ mλ sup x S h(x) h = h S h S

Proof outline: Stable Algorithms B(f g) =F (f) F (g) (f g) F (g) B 2 K (h S h S )+B 2 K (h S h S ) 2σ mλ sup x S h(x) h = h S h S

Proof outline: Stable Algorithms Define the Bregman divergence with respect to RKHS convex function F, B(f g) =F (f) F (g) (f g) F (g) B 2 K (h S h S )+B 2 K (h S h S ) 2σ mλ sup x S h(x) h = h S h S

Proof outline: Stable Algorithms Define the Bregman divergence with respect to RKHS convex function F, B(f g) =F (f) F (g) (f g) F (g) First show, where B 2 K (h S h S )+B 2 K (h S h S ) 2σ mλ sup x S h(x) h = h S h S

Stable Algorithms B 2 K (h S,h S )= h S h S 2 K 2 h 2 K 2σ mλ sup x S h(x) = 2σ mλ sup K(x, ),h K 2σ x S mλ κ h K

Stable Algorithms Proof outline: B 2 K (h S,h S )= h S h S 2 K 2 h 2 K 2σ mλ sup x S h(x) = 2σ mλ sup K(x, ),h K 2σ x S mλ κ h K

Stable Algorithms Proof outline: Notice, B 2 (h S,h. K S )= h S h S 2 K 2 h 2 K 2σ mλ sup x S h(x) = 2σ mλ sup K(x, ),h K 2σ x S mλ κ h K

Stable Algorithms Proof outline: Notice, B (h 2. K S,h S )= h S h S 2 K Using the kernel reproducing property we see, 2 h 2 K 2σ mλ sup x S h(x) = 2σ mλ sup K(x, ),h K 2σ x S mλ κ h K

Stable Algorithms h 2 K σκ mλ l(h S (x),y) l(h S (x),y) σ h(x) κσ h K σ2 κ 2 mλ

Stable Algorithms Proof outline: h 2 K σκ mλ l(h S (x),y) l(h S (x),y) σ h(x) κσ h K σ2 κ 2 mλ

Stable Algorithms Proof outline: Thus, h 2 K σκ mλ, and finally we see that l(h S (x),y) l(h S (x),y) σ h(x) κσ h K σ2 κ 2 mλ

Conclusion Generalization bounds: Show that the empirical error closely estimates the true error. Complexity: Several different notions used to measure size of Hypothesis class. VC-dimension (combinatorial) Stability (algorithm specific) Covering Number (continuous hypothesis) Rademacher Complexity (data specific)

Additional Material

VC-dimension

VC-dimension Consider binary classification, denote the number of dichotomies realized by H over a sample S as follows, Π H (S) ={(h(x 1 ),..., h(x m )) h H}

VC-dimension Consider binary classification, denote the number of dichotomies realized by H over a sample S as follows, Π H (S) ={(h(x 1 ),..., h(x m )) h H} The sample S is shattered by H if, Π H (S) = 2 m.

VC-dimension Consider binary classification, denote the number of dichotomies realized by H over a sample S as follows, Π H (S) ={(h(x 1 ),..., h(x m )) h H} The sample S is shattered by H if, Π H (S) = 2 m. The shattering coefficient over m points is the defined as, Π H (m) = max S { Π H(S) : S = m}

VC-dimension

VC-dimension VC-dimension: the size of the largest set that is shattered by H. d : max{m : S = m, Π H (S) = 2 m }

VC-dimension VC-dimension: the size of the largest set that is shattered by H. d : max{m : S = m, Π H (S) = 2 m } Example: halfspaces in R n.

VC-dimension VC-dimension: the size of the largest set that is shattered by H. d : max{m : S = m, Π H (S) = 2 m } Example: halfspaces in R n. Can always shatter set of (non-degenerate) n points.

VC-dimension VC-dimension: the size of the largest set that is shattered by H. d : max{m : S = m, Π H (S) = 2 m } Example: halfspaces in R n. Can always shatter set of (non-degenerate) n points. No configuration of (n + 1) points can be shattered Case 1: Case 2:

VC-dimension

VC-dimension Sauer s Lemma [ 72]: The shattering coefficient can be bounded by the VC-dimension. Π H (m) d i=0 ( ) m i

VC-dimension Sauer s Lemma [ 72]: The shattering coefficient can be bounded by the VC-dimension. Π H (m) d i=0 ( ) m i Furthermore, if d is finite, then for m > d, d i=0 ( ) m i ( ) em d = O(m d ) d

VC-bound VC-bound [VC 71] - for hypothesis class H with 0/1-loss function, the following is true, Pr [ sup h H R(h) R ] s (h) ɛ 8Π H (m)e mɛ2 32 Using Sauer s Lemma, the following holds with probability at least, 1 δ 2d ln(em/d) + 2 ln(8/δ) R(h) R S (h)+4 m

VC-bound (proof outline)

VC-bound (proof outline) Symmetrization - Pr [ sup h 2 Pr [ sup h R(h) R S (h) ɛ ] R S (h) R S (h) ɛ 2 where S is another i.i.d. sample of size m. ]

VC-bound (proof outline) Symmetrization - Pr [ sup h 2 Pr [ sup h R(h) R S (h) ɛ ] R S (h) R S (h) ɛ 2 ] where S is another i.i.d. sample of size m. Random Signs - for 2 Pr [ sup h 1 m 4 Pr [ sup h i 1 m ( ) 1h(x i ) y 1 ɛ ] i h(x i ) y i 2 i Pr(σ i = 1) = Pr(σ i = 1) = 1/2 σ i 1 h(xi ) y i ɛ 4 ]

VC-bound (proof outline)

VC-bound (proof outline) Conditioning - [ 4 Pr sup 1 σ i 1 h m h(xi ) y ɛ i 4 4Π H (m) sup h i Pr [ 1 m i ] x 1,..., x m σ i 1 h(xi ) y i ɛ 4 ] x1,..., x m

VC-bound (proof outline) Conditioning - [ 4 Pr sup 1 σ i 1 h m h(xi ) y ɛ i 4 4Π H (m) sup h i Pr [ 1 m Concentration bound (Hoeffding)- i ] x 1,..., x m σ i 1 h(xi ) y i ɛ 4 ] x1,..., x m sup h Pr [ 1 m i σ i 1 h(xi ) y i ɛ 4 x 1,..., x m ] 2e mɛ2 32 Taking the expectation of both sides proves the result.

SVM Margin Bound For linear hypothesis class h(x) = <w, x> with canonical margin, bound the VC-dim independently of the input dimension, d Λ 2 R 2 where the data is held in a ball of radius R centered at the origin and w Λ. 2 w <w,x>=0

SVM Margin Bound Proof: Choose k points that the hypothesis shatters. Then, upper and lower bound E[ k i=1 y ix i 2 ] with respect to uniform +1/-1 labels chosen for y. E [ Upper bound: By independence of y s, k E [ y i x i y j x j ] = E [ y i x i 2] = i=1 i,j also, y i x i = x i R i y i x i 2] thus, E [ k i=1 y ix i 2] kr 2

SVM Margin Bound Lower bound: Since the hypothesis (with canonical margin) shatters the sample i y i(w x i ) k thus, i, y i (w x i ) 1 k i y i(w x i ) w i y ix i Λ i y ix i Combining the bounds, shows k 2 Λ 2 E[ k y i x i 2] kr 2 i=1

Covering Number

Covering Number Def. The ε-covering number of a function class F with respect to a sample S =(x 1,..., x m ), denoted N p (F, ɛ,s), is the minimum number of points vj needed such that, [ ] m 1/p 1 f F, v j : f(x i ) vj i p ɛ n i=1

Covering Number Def. The ε-covering number of a function class F with respect to a sample S =(x 1,..., x m ), denoted N p (F, ɛ,s), is the minimum number of points vj needed such that, [ ] m 1/p 1 f F, v j : f(x i ) vj i p ɛ n i=1 Intuitively, the number of epsilon balls needed to cover the hypothesis space.

Covering Number Def. The ε-covering number of a function class F with respect to a sample S =(x 1,..., x m ), denoted N p (F, ɛ,s), is the minimum number of points vj needed such that, [ ] m 1/p 1 f F, v j : f(x i ) vj i p ɛ n i=1 Intuitively, the number of epsilon balls needed to cover the hypothesis space. Also, denote N p (F, ɛ,m) = sup N p (F, ɛ,s). S: S =m

Covering Number

Covering Number A natural complexity measure for continuous functions.

Covering Number A natural complexity measure for continuous functions. Can be applied during the conditioning step. Replace each function in l H with an ε/8 approximation.

Covering Number A natural complexity measure for continuous functions. Can be applied during the conditioning step. Replace each function in l H with an ε/8 approximation. If loss function is L-Lipschitz, then N p (l F, ɛ,m) N p (F, ɛ/l, m).

Covering Number Bound Thm. [Pollard 84] For L-lipschtiz loss function bounded by M, the following is true, Pr ( sup h H R(h) R S (h) ɛ ) 8N 1 (H, ɛ/(8l),m)exp ( mɛ 2 128M 2 )

Covering Number Bound To bound the covering number, notice by Jensen s inequality p < q, N p N q Result of Zhang [ 02], for linear hypothesis classes (h(x) = <w, x>), if x < b, w < a, then log 2 N 2 (H, ɛ,m) a2 b 2 ɛ 2 log 2 (2m + 1)

Rademacher Complexity R S (H) =E σ (sup h H 2 m m i=1 σ i h(x i ) )

Rademacher Complexity Empirical Rademacher Complexity: R S (H) =E σ (sup h H where again σi is selected uniformly from {1,-1}. 2 m m i=1 σ i h(x i ) )

Rademacher Complexity Empirical Rademacher Complexity: R S (H) =E σ (sup h H where again σi is selected uniformly from {1,-1}. This quantity can be computed from data. 2 m m i=1 σ i h(x i ) )

Rademacher Complexity Empirical Rademacher Complexity: R S (H) =E σ (sup h H where again σi is selected uniformly from {1,-1}. This quantity can be computed from data. Rademacher Complexity R m (H) =E S [ RS (H) S = m ] 2 m m i=1 σ i h(x i ) )

Rademacher Complexity Empirical Rademacher Complexity: R S (H) =E σ (sup h H where again σi is selected uniformly from {1,-1}. This quantity can be computed from data. Rademacher Complexity R m (H) =E S [ RS (H) S = m ] McDiarmid s inequality provides an exponentially bounded difference between R S and R m. 2 m m i=1 σ i h(x i ) )

Rademacher Bound Rademacher Bound: Consider binary classification with 0/1 loss function, then the following is true, Pr sup h H R(h) R S (h) ɛ + R m (h) e 2mɛ2 Again, with probability at least 1 δ, ln(1/δ) R(h) R S (h)+r m (H)+ 2m

Rademacher Bound Proof (outline): Apply McDiarmid s inequality to, Φ(S) = sup h which is 1/m-Lipschitz. Main insight: use symmetrization type argument to show, R(h) RS (h) E [ Φ(S) ] R m (H)

Rademacher Complexity Improvements over VC-bound We can compute R S (H) from data. The empirical Rademacher complexity lower bounds the VC-bound! 2d ln(em/d) R(H) 2M m

Rademacher Complexity Proof outline: Use Chernoff bounding method, need to show, [ m exp 2 t R(H) ] Π H (m) exp(t 2 M 2 m/2) then by Sauer s Lemma, m R(H) 2d ln(m) t then choose best t so that, + tm 2 m 2d ln(em/d) R(H) 2M m