EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for formal publicatios. They may be distributed outside this class oly with the permissio of the Istructor. Itroductio Rademacher complexity is a measure of the richess of a class of real-valued fuctios. I this sese, it is similar to the VC dimesio. I fact, we will establish a uiform deviatio boud i terms of Rademacher complexity, ad the use this result to prove the VC iequality. Ulike VC dimesio, however, Rademacher complexity is ot restricted to biary fuctios, ad will also prove useful later i the aalysis of other learig algorithms such as kerel-based algorithms. 2 Rademacher Complexity Let G a, b Z be a set of fuctios Z a, b where a, b R, a < b. Let Z,..., Z be i.i.d. radom variables o Z followig some distributio P. Deote the sample S = Z,..., Z. The empirical Rademacher complexity of G with respect to the sample S is R S G : gz i where σ = σ,..., σ with iid uif{, }. Here σ,..., σ are kow as Rademacher radom variables. The complexity R S G is radom because of the radomess of S. The Rademacher complexity of G is R G = E S R S G. Rademacher complexity is sometimes called Rademacher average. A iterpretatio i the cotext of biary classificatio is that G is rich, equivaletly, R S G or R G is high, if we ca choose fuctios g to accurately match differet radom sig combiatios reflected by σ. Note that the complexity is bouded, sice elemets of G are bouded withi the iterval a, b. Theorem Oe-sided Rademacher complexity boud. Let Z, Z,..., Z be iid radom variables takig values i a set Z. Cosider a set of fuctios G a, b Z. δ > 0, with probability δ, we have with respect to the draw of sample S that: g G, EgZ log /δ gz i + 2R G + b a 2. I additio, δ > 0, with probability δ, we have with respect to the draw of S that: g G, EgZ gz i + 2 R log 2/δ S G + 3b a 2. 2
2 The fial term i both ad 2 is typically much smaller tha the Rademacher complexity. Note that ad 2 are oe-sided uiform deviatio bouds, ad that 2 is a data-depedet boud. Before provig the theorem, we first review the followig useful facts. Fact : For ay real-valued fuctios f, f 2 : X R, x f x x f 2 x x f x f 2 x. To see this, let ɛ > 0 ad let x be such that f x x f x ɛ. The, x f x x f 2 x x ɛ > 0 was arbitrary, so the result follows. f x f 2 x f x f 2 x + ɛ f x f 2 x + ɛ. x Fact 2: For ay real-valued fuctios f, f 2 : X R, x f x + f 2 x x f x + x f 2 x. Fact 3: is a covex fuctio, i.e., if x λ λ Λ ad x λ λ Λ are two sequeces where Λ is possibly ucoutable, the α 0,, λ Λ αx λ + αx λ α This is a immediate cosequece of Fact 2. λ Λ x λ + α x λ λ Λ Fact 4: Jese s iequality, i.e., if f is covex, the feu EfU. Now, we are ready to prove Theorem. Proof. For otatioal brevity, deote Eg = EgZ ad ÊSg = gz i. The idea is to apply the bouded differece iequality BDI to φs = Eg ÊSg. First, we verify the bouded differece assumptio. Deotig S i = Z,..., Z i, Z i, Z i+,..., Z, we have φs φs i = Eg ÊSg Eg ÊS g ÊS ÊSg g = gz i gz i b a. by Fact Similarly, we ca prove φs φs i b a/ ad therefore φs φs i b a/. By the BDI, we have that with probability δ, log/δ φs E S φs b a. 2 To establish, it remais to show that E S φs 2R G. Thus let us itroduce aother radom sample called a ghost sample S = Z,..., Z with Z i idepedet of S. The E S φs = E S Eg ÊSg = E S E S ÊS g ÊSg by Eg = E S Ê S g iid P,
E S,S ÊS g ÊSg by Facts 3 ad 4 = E S,S,S,S,S,S,S = 2R G. gz i gz i gz i gz i gz i + gz i gz i +,S gz i by Fact 2 is symmetric, i The equality holds because i Z i ad Z i are i.i.d. hece gz i gz i ad gz i gz i have the same distributio, ad ii is symmetric. To establish 2, we apply the BDI agai to φs = R S G. Observe that φs φs i = R S G R S G gz i σ j gz j + gz i j i gz i gz i b a. by Fact. Similarly, we ca prove φs φs i b a/ ad thus φs φs i b a/. Applyig the BDI, we have that with probability δ/2, R G R log 2/δ S G + b a 2. 3 Combiig with δ replaced by δ/2 ad the iequality above, we the establish 2, because Prviolatig 2 Prviolatig + Prviolatig 3 δ/2 + δ/2 = δ. 3 The followig two-sided boud also holds. Theorem 2 Two-sided Rademacher complexity boud. Cosider a set of classifiers G a, b Z. δ > 0, with probability δ, we have with respect to the draw of sample S that: EgZ log 2/δ gz i 2R G + b a 2. 4 I additio, δ > 0, with probability δ, we have with respect to the draw of S that: EgZ gz i 2 R log 4/δ S G + 3b a 2. 5 The proof is left as a exercise.
4 3 Bouds for Biary Classificatio Cosider a set of biary classifiers H {, } X. Let Z = X {, }. Defie aother set G based o H as G = {x, y {hx y} : h H}. Let S = {Z,..., Z } = {X, Y,..., X, Y }, ad also let T = {X,..., X }, which is the projectio of S o the domai X. The empirical Rachemacher complexity of H should be writte R T H, however, we will follow covetio ad write it as R S H. There should be o cofusio sice the domai of elemets of H is X, so oly the X i s i the sample ca be used whe evaluatig the empirical Rademacher complexity. Thus we have Lemma. R S G = 2 R S H Proof. From the defiitios, we have R S H R S G 2 = 2 = 2 R S H, hx i. {hxi Y i} Y i h X i 2 + 2 h X i Y i h X i where the secod to last step follows from the facts that = 0 ad ad Y i have the same distributio. Now observe that E g = E {hx Y } = Rh whe g G is defied i terms of h H. Note also that g Z i = R h. This gives the followig corollary: Corollary. δ > 0, with probability δ, ad with probability δ, Rh R h R H + l /δ 2, Rh R h R l2/δ S H + 3 2. Remark. A two-sided versio of this corollary also holds, with δ δ/2. Example. Let Π = {A,..., A k } be a fixed partitio of X, such as a regular partitio or a recursive dyadic partitio. Let H = {classifiers that are costat o cells i Π}. The H = 2 k. We ll obtai a boud o the
5 empirical Rademacher complexity of H. Let la deote the label assiged to A Π. The R S H h X i = k E σ j= = A Π la i:x i A = A Π j h X i la la la. Maipulatig the terms iside the expectatio gives la la la la 2 2 la {, } Jese s iequality = #{i : X i A}, where the last lie follows because σ j = { 0, i j,, i = j. If j = #{i : X i A j }, the k R S H = j= k = j= j P A j, where P A j = j. The oly iequality i the above derivatio was Jese s iequality, ad by the Kitchie- Kahae iequality the reverse iequality holds if we iclude a multiplicative factor of 2, so the calculatio is tight up to this factor. The Rachemacher complexity i this example ca actually be computed exactly i terms of biomal probabilities. This is left as a exercise. 4 Proof of VC Iequality To prove the VC iequality, we will focus o boudig R H i terms of the shatter coefficiet.
6 Theorem 3. Massart s Lemma Let A R, A. Set r = max u 2. The u i r 2 l A, where u = u,..., u T. Proof. t 0, we have that exp t u i = exp t exp t exp t exp The summad is a MGF. Due to idepedece, exp t u i where the boud comes from the followig lemma: t u i u i u i u i. = Jese s iequality expoetial is strictly icreasig i exp t u i exp t 2 2u i 2 /8, Lemma 2. Let V be a radom variable o R with E V = 0 ad V a, b with probability oe. The for all t > 0, E e tv e t2 b a 2 /8. This lemma was give ad proved as Lemma i the otes o Hoeffdig s iequality. It was used to prove Hoeffdig s iequality. I our case, we used V = u i, a = u i, ad b = u i. Cotiuig with the proof of Massart s lemma, exp t 2 2u i 2 /8 Takig the log of both sides ad dividig by t gives u i = exp t 2 2 u 2 i = t 2 u 2 2 exp 2 t 2 r 2 exp 2 t 2 r 2 = A exp. 2 l A + tr2 t 2 = r 2 l A,
7 where the last step follows from choosig t = 2 l A r. Dividig both sides by completes the proof. This theorem is the key result that bridges the gap betwee VC theory ad Rademacher complexity. We first state ad prove a oe-sided versio of the VC iequality. Theorem 4 Oe-sided VC Iequality. For 0 < δ <, with probability δ, Rh R 8 l SH + l /δ h. Equivaletly, for ay ɛ > 0, Pr Rh R h ɛ S H e ɛ2 /8. 6 Proof. Let H = {, } X ad S = X,..., X X. Deote H S = {h X,..., h X : h H}. If u H S, the u 2 =. By Massart s lemma, R H = E S h X i S 2 l H S E S 2 l E H S 7 Jese s iequality 2 l SH, 8 where the last step follows from the fact that H S S H. From Corollary we deduce that with probability δ, Rh R 2 l SH l /δ h +. 2 We ow observe that for a, b 0, a + b a + b + a + b = 2 a + b. Therefore, for 0 < δ <, with probability δ, Rh R 8 l SH + l /δ /4 h 9 8 l SH + l /δ. This establishes the first part of the theorem. To establish the secod part, set the right-had side equal to ɛ ad solve for δ if o such δ exists, the boud holds trivially. Remark. Note that step 7 is ot ecessary. We could have goe directly to 8 usig the defiitio of the shatter coefficiet. However, the itermediate result gives a uiform deviatio bouds i terms of the expected cardiality of H S, which we used to study mootoe layers ad covex sets i the lecture o VC Theory. Fially, we state the stadard two-sided VC iequality, whose proof is left as a exercise.
8 Theorem 5 Two-sided VC Iequality. For 0 < δ <, with probability δ, Rh R 8 l SH + l 2/δ h. Equivaletly, for ay ɛ > 0, Pr Rh R h ɛ 2S H e ɛ2 /8. Exercises. Ca you improve the costats i the empirical Rademacher complexity boud 2 through a sigle, direct applicatio of the bouded differece iequality? 2. Determie a exact formula for the empirical Rademacher complexity of the set of classifiers based o a fixed partitio see example above. 3. Let G, G, G 2 deote arbitrary classes of fuctios Z a, b, ad let c, d be arbitrary real umbers. Show a R S cg + d = c RG, where cg + d := {g z = cgz + d g G}. b R S covg = R S G, where covg := { α ig i N, α i 0, i α i =, g i G}. c R S G + G 2 = R S G + R S G 2, where G + G 2 := {gz = g z + g 2 z g G, g 2 G 2 }. 4. Two-sided uiform deviatio bouds. a Prove Theorem 2. Hit: Apply the oe-sided Rademacher boud agai to G. b Prove Theorem 5. Hit: Observe that R h = Rh ad similarly for the empirical risk. c Show that if G = G, the the two-sided Rademacher boud holds with the same costats as the oe-sided versio. I particular, the substitutio δ δ/2 is uecessary. d Show that if H = H, the the two-sided VC iequality holds with the same costats as the oe-sided versio. I particular, the substitutio δ δ/2 is uecessary. 5. Use iequality 9 to improve the costat i the expoet of 6 at the expese of a larger term i frot of the expoetial.