Learnability with Rademacher Complexities

Similar documents
Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Rademacher Complexity

1 Review and Overview

Glivenko-Cantelli Classes

REGRESSION WITH QUADRATIC LOSS

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Regression with quadratic loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Optimally Sparse SVMs

A survey on penalized empirical risk minimization Sara A. van de Geer

Agnostic Learning and Concentration Inequalities

18.657: Mathematics of Machine Learning

Machine Learning Theory (CS 6783)

Intro to Learning Theory

1 Review and Overview

Binary classification, Part 1

18.657: Mathematics of Machine Learning

Lecture 15: Learning Theory: Concentration Inequalities

Machine Learning Brett Bernstein

10-701/ Machine Learning Mid-term Exam Solution

Lecture 13: Maximum Likelihood Estimation

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = =

Chapter 6 Principles of Data Reduction

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Empirical Process Theory and Oracle Inequalities

Linear Regression Demystified

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Estimation for Complete Data

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

ECE534, Spring 2018: Solutions for Problem Set #2

Convergence of random variables. (telegram style notes) P.J.C. Spreij

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Lecture 2: Monte Carlo Simulation

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

Lecture 7: October 18, 2017

Sieve Estimators: Consistency and Rates of Convergence

An Introduction to Randomized Algorithms

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

1 Convergence in Probability and the Weak Law of Large Numbers

STAT Homework 1 - Solutions

Fall 2013 MTH431/531 Real analysis Section Notes

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

32 estimating the cumulative distribution function

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities

Infinite Sequences and Series

On the Theory of Learning with Privileged Information

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15

Sequences and Series of Functions

CS / MCS 401 Homework 3 grader solutions

Lecture 7: Properties of Random Samples

Optimal Sample-Based Estimates of the Expectation of the Empirical Minimizer

STAT Homework 2 - Solutions

1 Approximating Integrals using Taylor Polynomials

ECE 901 Lecture 13: Maximum Likelihood Estimation

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Selective Prediction

Random Variables, Sampling and Estimation

Lecture 3: August 31

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Section 11.8: Power Series

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Machine Learning Theory (CS 6783)

4.1 Sigma Notation and Riemann Sums

Advanced Stochastic Processes.

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

MATH301 Real Analysis (2008 Fall) Tutorial Note #7. k=1 f k (x) converges pointwise to S(x) on E if and

Learning Theory: Lecture Notes

arxiv: v1 [math.pr] 13 Oct 2011

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Topic 9: Sampling Distributions of Estimators

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Peter L. Bartlett 1, Shahar Mendelson 2, 3 and Petra Philips Introduction

Support Vector Machines and Kernel Methods

Algorithms for Clustering

Stat 421-SP2012 Interval Estimation Section

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

Machine Learning Brett Bernstein

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables

4.1 Data processing inequality

Topic 9: Sampling Distributions of Estimators

Information-based Feature Selection

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Empirical Processes: Glivenko Cantelli Theorems

Lecture 19: Convergence

Expectation and Variance of a random variable

Fast Rates for Regularized Objectives

Probability for mathematicians INDEPENDENCE TAU

b i u x i U a i j u x i u x j

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Problem Set 4 Due Oct, 12

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

HOMEWORK #10 SOLUTIONS

The random version of Dvoretzky s theorem in l n

Transcription:

Learability with Rademacher Complexities Daiel Khashabi Fall 203 Last Update: September 26, 206 Itroductio Our goal i study of passive ervised learig is to fid a hypothesis h based o a set of examples that has small error with respect to some target fuctio. Oe ca improve geeralizatio by cotrollig the complexity of the cocept class H from which we are choosig a hypothesis. Oe way to achieve this is via the ideas i VC dimesio. Here we will itroduce Rademacher complexity as aother way of hadlig hypothesis space complexity, ad as a result, derivig geeralizatio bouds. Here are some major differeces our results will have with those i the discussio of VC dimesio: Oe observatio i the dicusssio of VC dimesio is that it is idepedet of the data distributio. I other words, its gurattees hold for ay data distributio; o the other had, the boud that it gives might ot be tight for certais data distributios. The aalysis of VC dimesio boud apply to discrete problems (such as classificatio), ad it does ot state aythig about problems like regressio. 2 Rademacher Averages/Complexities Here we defie Rademacher complexity which will be used i boudig risk fuctios. Defiitio 2. (Rademacher Average). If H F = {f : X R} be a class of fuctios we are explorig defied o domai X X, ad S = {x i } be the set of samples geerated by some ukow distributio D X o the same domai X. Defie σ i to be uiform radom variable o ±, for ay i. The empirica" Rademacher average or complexity is defied as followig: ˆR S (H) = E σ {x i} = E σ {x i} ad the expectatio of the above measure, with respect to the radom samples {x i }, is called the Rademacher average or complexity: R (H) = E ˆR S (H) = E σ http://web.egr.illiois.edu/~khashab2/lear/vc.pdf Implicit assumptio: premum over the fuctio class H is measurable.

There is a similar defiitio without the absolutes, which have similar properties as above: ˆR a S(H) = E σ {x i } ad R a (H) = E ˆR a S(H) = E σ Aother way of writig the Rademacher complexiity is the followig ˆR S (H) = E σ f S.σ {x i} where f S = (f(x ),..., f(x )), ad σ = (σ,..., σ ). The dot product f S.σ measures the correlatio betwee the fuctio values, ad the radom oise vector. I overall, the Rademacher complexity measures how well the fuctio class H ca correlate with radom oise. The richer the hypothesis class it, the better it will correclate with the radom oise. Here are some useful properties of the Rademacher averages. Lemma 2.. For ay {x i } ad for ay fuctio class F ad H, that map X R:. If H F the ˆR S (H) ˆR S (F). 2. For ay fuctio h : X R, the ˆR a S (F h) = ˆR a S (F). 3. If cvx(f) = {x E f π f(x), π (F)} the ˆR a S (F) = ˆR a S (cvx(f)). 4. ˆRa S (F H) = ˆR a S (F) ˆR a S (H). Proof of this propositio is icluded i Sectio 7. Proof. We prove each propositio:. 2. ˆR S (H) = E σ {x i} E σ ˆR a S(F h) = E σ = E σ = E σ f F f F f F f F {x i} = ˆR S (F) σ i (f(x i ) h(x i )) {x i } σ i h(x i ) {x i } {x i } 0 = ˆR a S(F) 2

3. 4. ˆR a S(cvx(F)) = E σ π (F) σ i E f π f(x i ) {x i } = E σ E f π {x i } π (F) = E σ {x i } = ˆR a S(F) f F ˆR a S(F H) = E σ = E σ f F,h H f F σ i (f(x i ) h(x i )) {x i } h H (swap oly i the corers of the covex set) σ i h(x i ) {x i } = ˆR a S(F) ˆR a S(H) Lemma 2.2. Give real-valued CDF fuctio F (x), ad F beig class of idicator fuctios o half-itervals which defie the empirical CDF fuctio: ˆF S (x) = {Xi x} we ca show that, with S = (X = x,..., X = x ). E S ˆF S (x) F (x) 2R (F) x R Proof. The trick that is commoly used for this is covertig expectatio to empirical mea by itroducig fake/ghost samples S ad symmetrizatio: E S ˆF S (x) F (x) = E S ˆF S (x) E S ˆFS (x) E S,S ˆF S (x) ˆF S (x) x R x R x R = E S,S {Xi x} {X i x} d = ES,S,σ σ i {Xi x} { Xi x} 2E S,σ σ i {Xi x} = 2R (F), for F = half-itervals x R It turs out that this observatio is geeral for ay loss fuctio. techique could be geeralized to ay loss fuctio. The followig boudig 3

Lemma 2.3. Give a class fuctios H = {f : X R} defied o domai X X, we have the followig geeral boud o the Rademacher average: E S Ef ÊSf 2R (H) with S = (x,..., x ) ad ÊSf = f(x i). Proof. The steps for the previous proof hold for this proof, with some mior chages. Agai, we covert expectatio to empirical mea by itroducig fake/ghost samples S ad symmetrizatio: E S Ef ÊSf = E S E S Ê S f ÊSf = E S,S ÊS f ÊSf = E S,S,σ 2E S,σ x R = 2R (F) σ i f(xi ) f(x i) With the followig lemma we show how to geeralize Rademacher averages usig Lipchitz maps. Lemma 2.4 (Ledoux-Talagrad cotractio). Let f : R R be a covex ad icreasig fuctio. Also let φ i (x) : R R, s.t. it satisfies φ i (0) = 0 with Lipchitz costat L (for ay x, y R φ i (x) φ i (y) L x y ). For ay T R, E σ f ( 2 t T ) ( σ i φ i (t i ) E σ f L. t T ) σ i t i Proof. Proof with defiitio of Rademacher average ad properties of covex fuctios. The above lemma will result the followig boud: Corollary 2.5. Let F be a class of fuctios with domai X ad φ(.) be a L-Lipchitz map from R to R with φ(0) = 0. The compositio of the map o the fuctios is defied as φ F = {φ f f F}. The R (φ F) 2LR (F) Proof. I the previous lemma, take the covex icreasig fuctio be the idetity fuctio. 2. Rademacher complexity of liear class Here we aalyze the Rademacher complexity of the followig liear classes. These results will come hady i aalyzig the geeralizatio bouds of may forthcomig problems which ivolve liear models. Defie the followig classes: H = {x x, w : w }, H 2 = {x x, w : w 2 } 4

Lemma 2.6. Let S = (x,..., x ), the R (H 2 S) max i x i 2 Proof. Due to Jese iequality: R a (H) = E σ w: w 2 = E σ w, w: w 2 E σ σ i x i 2 σ i x i, w σ i x i E σ σ i x i 2 E σ 2 σ i x i Sice the Rademacher radom variables are idepedet of each other, we have: E σ 2 σ i x i = E σ σ i σ j x i, x j 2 i,j = x i, x j E σ σ i σ j x i, x i E σ σ 2 i i i,j,i j = x i 2 i max x i 2 = max i x i i 2 Lemma 2.7. Let S = (x,..., x ), the Proof. R a 2 log 2 (H S) max x i i R a (H) = E σ w: w = E σ w, w: w E σ σ i x i σ i x i, w σ i x i 5

The last step is doe via the fiite class lemma (see Lemma 4.). 3 Geeralizatio bouds Here is the mai theorem, which cotais the geeralizatio bouds via Rademacher complexity: Theorem 3.. Let F be a class of fuctios, defied o domai X ad mappig to 0,. For some δ (0, ), ad for ay f F: log /δ Ef(X) E f(x) 2R (F), with probability at least δ 2 Also for ay f F: Ef(X) log 2/δ f(x i ) 2R S (F) 5, with probability at least δ 2 Similar results ca be foud with slightly differet defiitio of the Radmacher average: Theorem 3.2. Let F be a class of fuctios, defied o domai X ad mappig to 0,. For some δ (0, ), ad for ay f F: log /δ Ef(X) E f(x) 2R a (F), with probability at least δ 2 Also for ay f F: Ef(X) log 2/δ f(x i ) 2R a S(F) 3, with probability at least δ 2 A side ote before jumpi ito the proof: usually i practice the set F is a compositio of iput space X, hypothesis fuctios H ad the loss family l whcih measures the quality of the learig: F = l H S For example for SVM, H is space of liear classifiers, ad l is margi based (hard/soft) loss. Aother issue worhty to poit out is that, here we assumed that the rage of the fuctio F is bouded iside 0,. However if the fuctio is raged betwee 0, c, a c coefficiet would appear before log 2/δ 2 (easy to verify through the proof). Proof. For a sample set S = (x, x 2,..., x ), defie the followig fuctio Φ S (F) = f F E f f(x i ) Proof uses the McDiarmid s boud o the fuctio Φ S (F); defie the sample set S to be exactly the same as S, except oe differig sample. Φ S (F) Φ S (F) f(x f F i ) f(x i ) = f(x j ) f(x j ) x i S x i S f F 6 x i S

We used the fact that remum of diffece is bigger tha the differece of remums. Also we implicitly assumed that the fuctio is bouded betwee 0 ad. Hece we proved that Φ S (F) Φ S (F) Usig the boudedess property of Φ(.) ad usig the McDiarmid s iequality we have: log 2/δ Φ S (F) E S Φ S (F), with probably at least δ/2 2 Note that usig Lemma 2.3 we kow: E S Φ S (F) 2R (H) which would give us the first iequality (with δ/2 replaced with δ). To get the secod iequaly, we apply the McDiarmid boud o the Rademacher defiitio: R (H) ˆR log 2/δ S (H), with probably at least δ/2 2 Combie this with the previous result ad we will the 2d iequality the i the defitio of the theorem. 3. Cocetratio bouds for biary classificatio We start with a few examples, ad the move to more geeral theorems. Example 3.3. Let f : X {0, }, ad let (X, Y ) X {0, } be radom i.i.d. sampligs from the joit distributio P XY. Cosider the empirical risk defied as, L (f) = {f(x i ) Y i }. Prove that for ay f F, L(f) L (f) probability at least δ. Hit: Use Beristei s iequality. 2L(f) log(/δ) 2 log(/δ) 3 () 2. Use the result of the previous part to show that, for ay f F, 2L (f) log(/δ) L(f) L (f) 4 log(/δ) with probability at least δ. Use this to prove that if the ERM solutio predicts every test data correctly, i.e., if L ( ˆf ) = 0, the, L( ˆf ) 4 log( F /δ) with probability at least δ. This boud also holds with the relatioship betwee X ad Y is determiistic. Hit: Use the fact that, for ay a, b, c R ad a b c a, the we have a b c 2 c b. 7

Lemma 3.4 (Beristei s iequality). If U,..., U are i.i.d. Beroulli radom variables with parameter p, the, ( ) ) P U i < p ɛ exp ( ɛ2 (2) 2p 2ɛ/3 3.2 Geeralizatio boud for hard SVM usig Rademacher complexity Here we prove geeralizatio boud for hard SVM.. We will resort to Thereom 3.2 whcih cotais the geeralizatio bouds based o the defiitio of the Rademacher average. For SVM, the hypothesis space is a class of ilear predictors: H = { w, x : w R d} with hige loss l(x, y; w) = max {0, y w, x } as the loss fucti. Defie F = l H S = {l(x, y ; w), l(x 2, y 2 ; w),..., l(x, y ; w)} sice the hige loss is -Lipchitz, ad assumig that x R, w B, usig Lemma 2.5 we have: R (F) BR/ I geeral for ay ρ-lipchitz fuctio, R (F) ρbr/. Pluggig this ito Theorem 3.2 we get the followig risk boud for SVM: Ef(X) E f(x) 2BR/ log /δ 2, with probability at least δ So how should we iterpret this? Suppose we make the assumptio that we kow the miimize of the empirical risk, which we deote with w which has zero empirical risk. Also B = w, H ca simply be the set of liear classifier which have orm smaller tha B. The the risk boud ca be refied to Ef(X) ˆL 2R w ( R w ) log /δ 2, with probability at least δ Ad ote that F is (B w )-Lipchitz. With risk boud, oe ca show that the sample complexity of hard-svm R2 w 2. ɛ 2 I practice w is ot kow. Oe way to fix this, is to use the doublig trick o the weight vector size boud B. Suppose B i = 2 i, H be all the liear models with weight orm less tha B i, δ i = 2/. For each i we ca write a iequaliy for the risk. A uio boud over all of the iequalities would give a uified boud which holds for all ws. 4 Gliveko-Catelli Theorem The Gliveko-Catelli guaratees uiform covergece bouds o empirical risk of the distributios. Our characterizatio of GC is based o Rademacher ad Fiite Class lemma, though this is ot the oly way to derive these results. First we itroduce the fiite class lemma which is a tool for boudig Rademacher averages. Details o basic formulatios here: http://web.egr.illiois.edu/~khashab2/lear/svm.pdf 8

Lemma 4. (Fiite Class Lemma (Massart)). Let A be some fiite subset of R ad {σ i } m idepedet Rademacher radom variables, ad L = a A A, m σ i x i Proof. Defie, For ay λ R, e λµ E exp ( λ a A R (A) = E a A µ = E a A ) m σx i = E 2L log A m σ i x i = m R (A) exp a A ( ) ( ) m m λ σx i E exp λ σx i a A = ( ) m E exp λ σx i = m E exp (λσx i ) = m a A a A a A m exp ( λ 2 x i2 /2) m exp ( λ 2 L 2 /2 ) m A exp ( λ 2 L 2 /2 ) a A a A exp ( λx i ) exp (λx i ) 2 µ l A λ λl2 2. Set λ = 2 l A, ad we will have, µ L 2 l A L 2 More details: TBW The fiite class lemma could be geeralized to the class of biary-valued fuctios. Now defie F be class of biary valued fuctios, F = {f : Z {0, }}. I other words, give radom samples {Z i }, ad F(Z ) {(f(z ),..., f(z )) : f F}, We geeralize the boud usig the Rademacher boud for this class of fuctios, Lemma 4.2 (Rademacher boud for biary-valued fuctios). For class of biary-valued fuctios F, log F(Z R (F (Z )) 2 ) Proof. Proof i the Sectio 7. Theorem 4.3 (Gliveko-Catelli). Let, F (x) {X i x} 9

if, the for big eough. F (x) F (x) a.s 0 x Proof. The proof cosists of two mai pars. First usig the Rademacher for boudig the risk, ad the secod, usig the Fiite-Class lemma for boudig the Rademacher average. More details for later 5 Bibliographical otes The first use of Rademacher complexity for risk bouds is probably due to, 2. Refereces Peter L Bartlett ad Shahar Medelso. Rademacher ad gaussia complexities: Risk bouds ad structural results. Joural of Machie Learig Research, 3(Nov):463 482, 2002. 2 Vladimir Koltchiskii ad Dmitry Pacheko. Empirical margi distributios ad boudig the geeralizatio error of combied classifiers. Aals of Statistics, pages 50, 2002. 0

6 Appedix: Uio boud for risk Let s assume we have prove the followig boud for ay f F, p(l(f) L (f) a(δ)) δ, for ay f F which is equivalet to, L (f) L(f) b(δ) with probability at least δ (3) for some values a, b (fuctios of parameters). The, p( f F L (f) = 0 L(f) a) F δ or, equivaletly, L (f) L(f) b(δ/ F ) with probability at least δ Proof. p( f F L (f) = 0 L(f) a) p ( f F (L (f) = 0 L(f) a)) Now defie δ = δ F, ad the usig 3 we have f F p ((L (f) = 0 L(f) a)) F δ L (f) L(f) b(δ ) = L(f) b(δ/ F ) with probability at least δ which proves our desired statemet. 7 Proofs 7. Proof of lemma 4.2 Proof. Sice each f is a biary-valued fuctio, F {0, }. For ay set of samples {Z i }, ad ay fuctio f F, we kow, f(z i ) = For a fixed set of radom samples,{z i }, the set F(Z ) {(f(z ),..., f(z )) : f F} is equivalet to the set A, i Lemma 4., as N = F(Z ) 2 ad L =. As such, R (F (Z )) 2 log F(Z )

8 Aswers Here aswers to some of the questios are icluded. The aswers are mostly by the authors, ad might be buggy. Therefore, read cautiously! 8. Aswer to example 3.3 8.. First part : We ( first use the ) Berstei s iequality ad simplify it. Cosider the Equatio 2 ad take δ = exp ɛ2 2p2ɛ/3. The, ( 2 ɛ 2 3 l ) ɛ 2p l δ δ = 0 2 ( 3 ɛ = l δ ± 2 3 l ) 2 δ 8p l δ 2 Based o the assumptio of the iequality the ɛ 0 ad we ca choose the value with the sig i the about equatio. Usig this simplificatio, we ca rewrite the Berstei iequality i the followig equivalet form: EU U i 2 ( 3 l δ 2 3 l ) 2 δ 8p l δ 2, probability at least δ Now, for a specific f F, we ca cosider U i = {y i f(x i )} as a Beroulli distributio, with the probability of success defied by p = EU = L(f). The empirical estimatio is the Beroulli distributio is U i = {Y i f(x i )} = L (f). This we ca rewrite the boud as: L(f) L (f) l δ 3 Now we use the fact that, a b a b L(f) L (f) l δ 3 L (f) l δ 3 L (f) 2 l δ 3 Which proves the desired result. ( 2 3 l ) 2 δ 8L(f) l δ 2 ( 2 3 l ) 2 δ 8L(f) l δ 2 ( 2 3 l δ ) 2 8L(f) l δ, probability at least δ 2 2L(f) l δ, probability at least δ 2

8..2 Secod part : We use the hit o the boud which we foud i the previous part, i Equatio, with the followig defiitios: a = L(f), b = L (f) 2 log(/δ) 2 log(/δ), c = 3 This would imply the followig iequality: L(f) L (f) 2 log(/δ) 3 L(f) L (f) 8 log(/δ) 3 We use the iequality a b a b, L(f) L (f) 8 log(/δ) 3 ( ) 2 ( ) 2 log(/δ) 2 log(/δ) L (f) 2 log(/δ) 3 2L (f) log(/δ) 4 3 2L (f) log(/δ) 2L (f) log(/δ) 4 3 ( log(/δ) ( log(/δ) L (f) 8 log(/δ) ( 4 log(/δ) 3 3 ( 2 L (f) 3 8 ) log(/δ) 2L (f) log(/δ) 3 3.83 log(/δ) 2L (f) log(/δ) =L (f) L (f) 4 log(/δ) 2L (f) log(/δ) Which proves the desired result. Now usig this boud, we prove the last part of the questio. Before that we state the uio boud for risk. Sice this boud holds for ay f F, this also holds for ˆf F. Based o the assumptio of the questio, the risk for this fuctio is zero. For a fixed ˆf F, if we have L ( ˆf) = 0, L(f) 4 log(/δ) sice ˆf is ot kow a priori ad it ca ay fuctio i the class of fuctios F, we eed to use the uio boud, as i Equatio 3: L(f) 4 log( F /δ) ) 2 ) 2 ) 2 3