Rademacher Complexity

Similar documents
1 Review and Overview

Sieve Estimators: Consistency and Rates of Convergence

Lecture 2: Concentration Bounds

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Lecture 3: August 31

18.657: Mathematics of Machine Learning

Glivenko-Cantelli Classes

Lecture 15: Learning Theory: Concentration Inequalities

REGRESSION WITH QUADRATIC LOSS

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Regression with quadratic loss

7.1 Convergence of sequences of random variables

The Boolean Ring of Intervals

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Optimally Sparse SVMs

7.1 Convergence of sequences of random variables

Learning Theory: Lecture Notes

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

An Introduction to Randomized Algorithms

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = =

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

A 2nTH ORDER LINEAR DIFFERENCE EQUATION

Lecture 9: Expanders Part 2, Extractors

CS / MCS 401 Homework 3 grader solutions

Math 2784 (or 2794W) University of Connecticut

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

(b) What is the probability that a particle reaches the upper boundary n before the lower boundary m?

10-701/ Machine Learning Mid-term Exam Solution

Empirical Process Theory and Oracle Inequalities

Sequences and Series of Functions

Lecture 7: October 18, 2017

Agnostic Learning and Concentration Inequalities

1+x 1 + α+x. x = 2(α x2 ) 1+x

The log-behavior of n p(n) and n p(n)/n

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Linear Regression Demystified

Math 61CM - Solutions to homework 3

Binary classification, Part 1

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Topic 9: Sampling Distributions of Estimators

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

1 Review and Overview

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

6.883: Online Methods in Machine Learning Alexander Rakhlin

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Basics of Probability Theory (for Theory of Computation courses)

Topic 9: Sampling Distributions of Estimators

Lecture 2: Monte Carlo Simulation

Intro to Learning Theory

Topic 9: Sampling Distributions of Estimators

Math 155 (Lecture 3)

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Math 104: Homework 2 solutions

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities

Sequences. Notation. Convergence of a Sequence

Math 451: Euclidean and Non-Euclidean Geometry MWF 3pm, Gasson 204 Homework 3 Solutions

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

STAT Homework 1 - Solutions

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 6 9/23/2013. Brownian motion. Introduction

NICK DUFRESNE. 1 1 p(x). To determine some formulas for the generating function of the Schröder numbers, r(x) = a(x) =

1 Convergence in Probability and the Weak Law of Large Numbers

4.1 Data processing inequality

Axioms of Measure Theory

Ma 530 Introduction to Power Series

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Lecture 10: Universal coding and prediction

Information Theory and Statistics Lecture 4: Lempel-Ziv code

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Problem Set 2 Solutions

Lecture Notes for Analysis Class

The multiplicative structure of finite field and a construction of LRC

Machine Learning Theory (CS 6783)

Entropy Rates and Asymptotic Equipartition

Learnability with Rademacher Complexities

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Fall 2013 MTH431/531 Real analysis Section Notes

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

Lesson 10: Limits and Continuity

ECE534, Spring 2018: Solutions for Problem Set #2

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Lecture 19: Convergence

Chapter 6 Principles of Data Reduction

Expectation and Variance of a random variable

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Chapter 6 Infinite Series

Notes for Lecture 11

Entropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP

A survey on penalized empirical risk minimization Sara A. van de Geer

Section 11.8: Power Series

MATH301 Real Analysis (2008 Fall) Tutorial Note #7. k=1 f k (x) converges pointwise to S(x) on E if and

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

6.3 Testing Series With Positive Terms

Law of the sum of Bernoulli random variables

Transcription:

EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for formal publicatios. They may be distributed outside this class oly with the permissio of the Istructor. Itroductio Rademacher complexity is a measure of the richess of a class of real-valued fuctios. I this sese, it is similar to the VC dimesio. I fact, we will establish a uiform deviatio boud i terms of Rademacher complexity, ad the use this result to prove the VC iequality. Ulike VC dimesio, however, Rademacher complexity is ot restricted to biary fuctios, ad will also prove useful later i the aalysis of other learig algorithms such as kerel-based algorithms. 2 Rademacher Complexity Let G a, b Z be a set of fuctios Z a, b where a, b R, a < b. Let Z,..., Z be i.i.d. radom variables o Z followig some distributio P. Deote the sample S = Z,..., Z. The empirical Rademacher complexity of G with respect to the sample S is R S G : gz i where σ = σ,..., σ with iid uif{, }. Here σ,..., σ are kow as Rademacher radom variables. The complexity R S G is radom because of the radomess of S. The Rademacher complexity of G is R G = E S R S G. Rademacher complexity is sometimes called Rademacher average. A iterpretatio i the cotext of biary classificatio is that G is rich, equivaletly, R S G or R G is high, if we ca choose fuctios g to accurately match differet radom sig combiatios reflected by σ. Note that the complexity is bouded, sice elemets of G are bouded withi the iterval a, b. Theorem Oe-sided Rademacher complexity boud. Let Z, Z,..., Z be iid radom variables takig values i a set Z. Cosider a set of fuctios G a, b Z. δ > 0, with probability δ, we have with respect to the draw of sample S that: g G, EgZ log /δ gz i + 2R G + b a 2. I additio, δ > 0, with probability δ, we have with respect to the draw of S that: g G, EgZ gz i + 2 R log 2/δ S G + 3b a 2. 2

2 The fial term i both ad 2 is typically much smaller tha the Rademacher complexity. Note that ad 2 are oe-sided uiform deviatio bouds, ad that 2 is a data-depedet boud. Before provig the theorem, we first review the followig useful facts. Fact : For ay real-valued fuctios f, f 2 : X R, x f x x f 2 x x f x f 2 x. To see this, let ɛ > 0 ad let x be such that f x x f x ɛ. The, x f x x f 2 x x ɛ > 0 was arbitrary, so the result follows. f x f 2 x f x f 2 x + ɛ f x f 2 x + ɛ. x Fact 2: For ay real-valued fuctios f, f 2 : X R, x f x + f 2 x x f x + x f 2 x. Fact 3: is a covex fuctio, i.e., if x λ λ Λ ad x λ λ Λ are two sequeces where Λ is possibly ucoutable, the α 0,, λ Λ αx λ + αx λ α This is a immediate cosequece of Fact 2. λ Λ x λ + α x λ λ Λ Fact 4: Jese s iequality, i.e., if f is covex, the feu EfU. Now, we are ready to prove Theorem. Proof. For otatioal brevity, deote Eg = EgZ ad ÊSg = gz i. The idea is to apply the bouded differece iequality BDI to φs = Eg ÊSg. First, we verify the bouded differece assumptio. Deotig S i = Z,..., Z i, Z i, Z i+,..., Z, we have φs φs i = Eg ÊSg Eg ÊS g ÊS ÊSg g = gz i gz i b a. by Fact Similarly, we ca prove φs φs i b a/ ad therefore φs φs i b a/. By the BDI, we have that with probability δ, log/δ φs E S φs b a. 2 To establish, it remais to show that E S φs 2R G. Thus let us itroduce aother radom sample called a ghost sample S = Z,..., Z with Z i idepedet of S. The E S φs = E S Eg ÊSg = E S E S ÊS g ÊSg by Eg = E S Ê S g iid P,

E S,S ÊS g ÊSg by Facts 3 ad 4 = E S,S,S,S,S,S,S = 2R G. gz i gz i gz i gz i gz i + gz i gz i +,S gz i by Fact 2 is symmetric, i The equality holds because i Z i ad Z i are i.i.d. hece gz i gz i ad gz i gz i have the same distributio, ad ii is symmetric. To establish 2, we apply the BDI agai to φs = R S G. Observe that φs φs i = R S G R S G gz i σ j gz j + gz i j i gz i gz i b a. by Fact. Similarly, we ca prove φs φs i b a/ ad thus φs φs i b a/. Applyig the BDI, we have that with probability δ/2, R G R log 2/δ S G + b a 2. 3 Combiig with δ replaced by δ/2 ad the iequality above, we the establish 2, because Prviolatig 2 Prviolatig + Prviolatig 3 δ/2 + δ/2 = δ. 3 The followig two-sided boud also holds. Theorem 2 Two-sided Rademacher complexity boud. Cosider a set of classifiers G a, b Z. δ > 0, with probability δ, we have with respect to the draw of sample S that: EgZ log 2/δ gz i 2R G + b a 2. 4 I additio, δ > 0, with probability δ, we have with respect to the draw of S that: EgZ gz i 2 R log 4/δ S G + 3b a 2. 5 The proof is left as a exercise.

4 3 Bouds for Biary Classificatio Cosider a set of biary classifiers H {, } X. Let Z = X {, }. Defie aother set G based o H as G = {x, y {hx y} : h H}. Let S = {Z,..., Z } = {X, Y,..., X, Y }, ad also let T = {X,..., X }, which is the projectio of S o the domai X. The empirical Rachemacher complexity of H should be writte R T H, however, we will follow covetio ad write it as R S H. There should be o cofusio sice the domai of elemets of H is X, so oly the X i s i the sample ca be used whe evaluatig the empirical Rademacher complexity. Thus we have Lemma. R S G = 2 R S H Proof. From the defiitios, we have R S H R S G 2 = 2 = 2 R S H, hx i. {hxi Y i} Y i h X i 2 + 2 h X i Y i h X i where the secod to last step follows from the facts that = 0 ad ad Y i have the same distributio. Now observe that E g = E {hx Y } = Rh whe g G is defied i terms of h H. Note also that g Z i = R h. This gives the followig corollary: Corollary. δ > 0, with probability δ, ad with probability δ, Rh R h R H + l /δ 2, Rh R h R l2/δ S H + 3 2. Remark. A two-sided versio of this corollary also holds, with δ δ/2. Example. Let Π = {A,..., A k } be a fixed partitio of X, such as a regular partitio or a recursive dyadic partitio. Let H = {classifiers that are costat o cells i Π}. The H = 2 k. We ll obtai a boud o the

5 empirical Rademacher complexity of H. Let la deote the label assiged to A Π. The R S H h X i = k E σ j= = A Π la i:x i A = A Π j h X i la la la. Maipulatig the terms iside the expectatio gives la la la la 2 2 la {, } Jese s iequality = #{i : X i A}, where the last lie follows because σ j = { 0, i j,, i = j. If j = #{i : X i A j }, the k R S H = j= k = j= j P A j, where P A j = j. The oly iequality i the above derivatio was Jese s iequality, ad by the Kitchie- Kahae iequality the reverse iequality holds if we iclude a multiplicative factor of 2, so the calculatio is tight up to this factor. The Rachemacher complexity i this example ca actually be computed exactly i terms of biomal probabilities. This is left as a exercise. 4 Proof of VC Iequality To prove the VC iequality, we will focus o boudig R H i terms of the shatter coefficiet.

6 Theorem 3. Massart s Lemma Let A R, A. Set r = max u 2. The u i r 2 l A, where u = u,..., u T. Proof. t 0, we have that exp t u i = exp t exp t exp t exp The summad is a MGF. Due to idepedece, exp t u i where the boud comes from the followig lemma: t u i u i u i u i. = Jese s iequality expoetial is strictly icreasig i exp t u i exp t 2 2u i 2 /8, Lemma 2. Let V be a radom variable o R with E V = 0 ad V a, b with probability oe. The for all t > 0, E e tv e t2 b a 2 /8. This lemma was give ad proved as Lemma i the otes o Hoeffdig s iequality. It was used to prove Hoeffdig s iequality. I our case, we used V = u i, a = u i, ad b = u i. Cotiuig with the proof of Massart s lemma, exp t 2 2u i 2 /8 Takig the log of both sides ad dividig by t gives u i = exp t 2 2 u 2 i = t 2 u 2 2 exp 2 t 2 r 2 exp 2 t 2 r 2 = A exp. 2 l A + tr2 t 2 = r 2 l A,

7 where the last step follows from choosig t = 2 l A r. Dividig both sides by completes the proof. This theorem is the key result that bridges the gap betwee VC theory ad Rademacher complexity. We first state ad prove a oe-sided versio of the VC iequality. Theorem 4 Oe-sided VC Iequality. For 0 < δ <, with probability δ, Rh R 8 l SH + l /δ h. Equivaletly, for ay ɛ > 0, Pr Rh R h ɛ S H e ɛ2 /8. 6 Proof. Let H = {, } X ad S = X,..., X X. Deote H S = {h X,..., h X : h H}. If u H S, the u 2 =. By Massart s lemma, R H = E S h X i S 2 l H S E S 2 l E H S 7 Jese s iequality 2 l SH, 8 where the last step follows from the fact that H S S H. From Corollary we deduce that with probability δ, Rh R 2 l SH l /δ h +. 2 We ow observe that for a, b 0, a + b a + b + a + b = 2 a + b. Therefore, for 0 < δ <, with probability δ, Rh R 8 l SH + l /δ /4 h 9 8 l SH + l /δ. This establishes the first part of the theorem. To establish the secod part, set the right-had side equal to ɛ ad solve for δ if o such δ exists, the boud holds trivially. Remark. Note that step 7 is ot ecessary. We could have goe directly to 8 usig the defiitio of the shatter coefficiet. However, the itermediate result gives a uiform deviatio bouds i terms of the expected cardiality of H S, which we used to study mootoe layers ad covex sets i the lecture o VC Theory. Fially, we state the stadard two-sided VC iequality, whose proof is left as a exercise.

8 Theorem 5 Two-sided VC Iequality. For 0 < δ <, with probability δ, Rh R 8 l SH + l 2/δ h. Equivaletly, for ay ɛ > 0, Pr Rh R h ɛ 2S H e ɛ2 /8. Exercises. Ca you improve the costats i the empirical Rademacher complexity boud 2 through a sigle, direct applicatio of the bouded differece iequality? 2. Determie a exact formula for the empirical Rademacher complexity of the set of classifiers based o a fixed partitio see example above. 3. Let G, G, G 2 deote arbitrary classes of fuctios Z a, b, ad let c, d be arbitrary real umbers. Show a R S cg + d = c RG, where cg + d := {g z = cgz + d g G}. b R S covg = R S G, where covg := { α ig i N, α i 0, i α i =, g i G}. c R S G + G 2 = R S G + R S G 2, where G + G 2 := {gz = g z + g 2 z g G, g 2 G 2 }. 4. Two-sided uiform deviatio bouds. a Prove Theorem 2. Hit: Apply the oe-sided Rademacher boud agai to G. b Prove Theorem 5. Hit: Observe that R h = Rh ad similarly for the empirical risk. c Show that if G = G, the the two-sided Rademacher boud holds with the same costats as the oe-sided versio. I particular, the substitutio δ δ/2 is uecessary. d Show that if H = H, the the two-sided VC iequality holds with the same costats as the oe-sided versio. I particular, the substitutio δ δ/2 is uecessary. 5. Use iequality 9 to improve the costat i the expoet of 6 at the expese of a larger term i frot of the expoetial.