Learability with Rademacher Complexities Daiel Khashabi Fall 203 Last Update: September 26, 206 Itroductio Our goal i study of passive ervised learig is to fid a hypothesis h based o a set of examples that has small error with respect to some target fuctio. Oe ca improve geeralizatio by cotrollig the complexity of the cocept class H from which we are choosig a hypothesis. Oe way to achieve this is via the ideas i VC dimesio. Here we will itroduce Rademacher complexity as aother way of hadlig hypothesis space complexity, ad as a result, derivig geeralizatio bouds. Here are some major differeces our results will have with those i the discussio of VC dimesio: Oe observatio i the dicusssio of VC dimesio is that it is idepedet of the data distributio. I other words, its gurattees hold for ay data distributio; o the other had, the boud that it gives might ot be tight for certais data distributios. The aalysis of VC dimesio boud apply to discrete problems (such as classificatio), ad it does ot state aythig about problems like regressio. 2 Rademacher Averages/Complexities Here we defie Rademacher complexity which will be used i boudig risk fuctios. Defiitio 2. (Rademacher Average). If H F = {f : X R} be a class of fuctios we are explorig defied o domai X X, ad S = {x i } be the set of samples geerated by some ukow distributio D X o the same domai X. Defie σ i to be uiform radom variable o ±, for ay i. The empirica" Rademacher average or complexity is defied as followig: ˆR S (H) = E σ {x i} = E σ {x i} ad the expectatio of the above measure, with respect to the radom samples {x i }, is called the Rademacher average or complexity: R (H) = E ˆR S (H) = E σ http://web.egr.illiois.edu/~khashab2/lear/vc.pdf Implicit assumptio: premum over the fuctio class H is measurable.
There is a similar defiitio without the absolutes, which have similar properties as above: ˆR a S(H) = E σ {x i } ad R a (H) = E ˆR a S(H) = E σ Aother way of writig the Rademacher complexiity is the followig ˆR S (H) = E σ f S.σ {x i} where f S = (f(x ),..., f(x )), ad σ = (σ,..., σ ). The dot product f S.σ measures the correlatio betwee the fuctio values, ad the radom oise vector. I overall, the Rademacher complexity measures how well the fuctio class H ca correlate with radom oise. The richer the hypothesis class it, the better it will correclate with the radom oise. Here are some useful properties of the Rademacher averages. Lemma 2.. For ay {x i } ad for ay fuctio class F ad H, that map X R:. If H F the ˆR S (H) ˆR S (F). 2. For ay fuctio h : X R, the ˆR a S (F h) = ˆR a S (F). 3. If cvx(f) = {x E f π f(x), π (F)} the ˆR a S (F) = ˆR a S (cvx(f)). 4. ˆRa S (F H) = ˆR a S (F) ˆR a S (H). Proof of this propositio is icluded i Sectio 7. Proof. We prove each propositio:. 2. ˆR S (H) = E σ {x i} E σ ˆR a S(F h) = E σ = E σ = E σ f F f F f F f F {x i} = ˆR S (F) σ i (f(x i ) h(x i )) {x i } σ i h(x i ) {x i } {x i } 0 = ˆR a S(F) 2
3. 4. ˆR a S(cvx(F)) = E σ π (F) σ i E f π f(x i ) {x i } = E σ E f π {x i } π (F) = E σ {x i } = ˆR a S(F) f F ˆR a S(F H) = E σ = E σ f F,h H f F σ i (f(x i ) h(x i )) {x i } h H (swap oly i the corers of the covex set) σ i h(x i ) {x i } = ˆR a S(F) ˆR a S(H) Lemma 2.2. Give real-valued CDF fuctio F (x), ad F beig class of idicator fuctios o half-itervals which defie the empirical CDF fuctio: ˆF S (x) = {Xi x} we ca show that, with S = (X = x,..., X = x ). E S ˆF S (x) F (x) 2R (F) x R Proof. The trick that is commoly used for this is covertig expectatio to empirical mea by itroducig fake/ghost samples S ad symmetrizatio: E S ˆF S (x) F (x) = E S ˆF S (x) E S ˆFS (x) E S,S ˆF S (x) ˆF S (x) x R x R x R = E S,S {Xi x} {X i x} d = ES,S,σ σ i {Xi x} { Xi x} 2E S,σ σ i {Xi x} = 2R (F), for F = half-itervals x R It turs out that this observatio is geeral for ay loss fuctio. techique could be geeralized to ay loss fuctio. The followig boudig 3
Lemma 2.3. Give a class fuctios H = {f : X R} defied o domai X X, we have the followig geeral boud o the Rademacher average: E S Ef ÊSf 2R (H) with S = (x,..., x ) ad ÊSf = f(x i). Proof. The steps for the previous proof hold for this proof, with some mior chages. Agai, we covert expectatio to empirical mea by itroducig fake/ghost samples S ad symmetrizatio: E S Ef ÊSf = E S E S Ê S f ÊSf = E S,S ÊS f ÊSf = E S,S,σ 2E S,σ x R = 2R (F) σ i f(xi ) f(x i) With the followig lemma we show how to geeralize Rademacher averages usig Lipchitz maps. Lemma 2.4 (Ledoux-Talagrad cotractio). Let f : R R be a covex ad icreasig fuctio. Also let φ i (x) : R R, s.t. it satisfies φ i (0) = 0 with Lipchitz costat L (for ay x, y R φ i (x) φ i (y) L x y ). For ay T R, E σ f ( 2 t T ) ( σ i φ i (t i ) E σ f L. t T ) σ i t i Proof. Proof with defiitio of Rademacher average ad properties of covex fuctios. The above lemma will result the followig boud: Corollary 2.5. Let F be a class of fuctios with domai X ad φ(.) be a L-Lipchitz map from R to R with φ(0) = 0. The compositio of the map o the fuctios is defied as φ F = {φ f f F}. The R (φ F) 2LR (F) Proof. I the previous lemma, take the covex icreasig fuctio be the idetity fuctio. 2. Rademacher complexity of liear class Here we aalyze the Rademacher complexity of the followig liear classes. These results will come hady i aalyzig the geeralizatio bouds of may forthcomig problems which ivolve liear models. Defie the followig classes: H = {x x, w : w }, H 2 = {x x, w : w 2 } 4
Lemma 2.6. Let S = (x,..., x ), the R (H 2 S) max i x i 2 Proof. Due to Jese iequality: R a (H) = E σ w: w 2 = E σ w, w: w 2 E σ σ i x i 2 σ i x i, w σ i x i E σ σ i x i 2 E σ 2 σ i x i Sice the Rademacher radom variables are idepedet of each other, we have: E σ 2 σ i x i = E σ σ i σ j x i, x j 2 i,j = x i, x j E σ σ i σ j x i, x i E σ σ 2 i i i,j,i j = x i 2 i max x i 2 = max i x i i 2 Lemma 2.7. Let S = (x,..., x ), the Proof. R a 2 log 2 (H S) max x i i R a (H) = E σ w: w = E σ w, w: w E σ σ i x i σ i x i, w σ i x i 5
The last step is doe via the fiite class lemma (see Lemma 4.). 3 Geeralizatio bouds Here is the mai theorem, which cotais the geeralizatio bouds via Rademacher complexity: Theorem 3.. Let F be a class of fuctios, defied o domai X ad mappig to 0,. For some δ (0, ), ad for ay f F: log /δ Ef(X) E f(x) 2R (F), with probability at least δ 2 Also for ay f F: Ef(X) log 2/δ f(x i ) 2R S (F) 5, with probability at least δ 2 Similar results ca be foud with slightly differet defiitio of the Radmacher average: Theorem 3.2. Let F be a class of fuctios, defied o domai X ad mappig to 0,. For some δ (0, ), ad for ay f F: log /δ Ef(X) E f(x) 2R a (F), with probability at least δ 2 Also for ay f F: Ef(X) log 2/δ f(x i ) 2R a S(F) 3, with probability at least δ 2 A side ote before jumpi ito the proof: usually i practice the set F is a compositio of iput space X, hypothesis fuctios H ad the loss family l whcih measures the quality of the learig: F = l H S For example for SVM, H is space of liear classifiers, ad l is margi based (hard/soft) loss. Aother issue worhty to poit out is that, here we assumed that the rage of the fuctio F is bouded iside 0,. However if the fuctio is raged betwee 0, c, a c coefficiet would appear before log 2/δ 2 (easy to verify through the proof). Proof. For a sample set S = (x, x 2,..., x ), defie the followig fuctio Φ S (F) = f F E f f(x i ) Proof uses the McDiarmid s boud o the fuctio Φ S (F); defie the sample set S to be exactly the same as S, except oe differig sample. Φ S (F) Φ S (F) f(x f F i ) f(x i ) = f(x j ) f(x j ) x i S x i S f F 6 x i S
We used the fact that remum of diffece is bigger tha the differece of remums. Also we implicitly assumed that the fuctio is bouded betwee 0 ad. Hece we proved that Φ S (F) Φ S (F) Usig the boudedess property of Φ(.) ad usig the McDiarmid s iequality we have: log 2/δ Φ S (F) E S Φ S (F), with probably at least δ/2 2 Note that usig Lemma 2.3 we kow: E S Φ S (F) 2R (H) which would give us the first iequality (with δ/2 replaced with δ). To get the secod iequaly, we apply the McDiarmid boud o the Rademacher defiitio: R (H) ˆR log 2/δ S (H), with probably at least δ/2 2 Combie this with the previous result ad we will the 2d iequality the i the defitio of the theorem. 3. Cocetratio bouds for biary classificatio We start with a few examples, ad the move to more geeral theorems. Example 3.3. Let f : X {0, }, ad let (X, Y ) X {0, } be radom i.i.d. sampligs from the joit distributio P XY. Cosider the empirical risk defied as, L (f) = {f(x i ) Y i }. Prove that for ay f F, L(f) L (f) probability at least δ. Hit: Use Beristei s iequality. 2L(f) log(/δ) 2 log(/δ) 3 () 2. Use the result of the previous part to show that, for ay f F, 2L (f) log(/δ) L(f) L (f) 4 log(/δ) with probability at least δ. Use this to prove that if the ERM solutio predicts every test data correctly, i.e., if L ( ˆf ) = 0, the, L( ˆf ) 4 log( F /δ) with probability at least δ. This boud also holds with the relatioship betwee X ad Y is determiistic. Hit: Use the fact that, for ay a, b, c R ad a b c a, the we have a b c 2 c b. 7
Lemma 3.4 (Beristei s iequality). If U,..., U are i.i.d. Beroulli radom variables with parameter p, the, ( ) ) P U i < p ɛ exp ( ɛ2 (2) 2p 2ɛ/3 3.2 Geeralizatio boud for hard SVM usig Rademacher complexity Here we prove geeralizatio boud for hard SVM.. We will resort to Thereom 3.2 whcih cotais the geeralizatio bouds based o the defiitio of the Rademacher average. For SVM, the hypothesis space is a class of ilear predictors: H = { w, x : w R d} with hige loss l(x, y; w) = max {0, y w, x } as the loss fucti. Defie F = l H S = {l(x, y ; w), l(x 2, y 2 ; w),..., l(x, y ; w)} sice the hige loss is -Lipchitz, ad assumig that x R, w B, usig Lemma 2.5 we have: R (F) BR/ I geeral for ay ρ-lipchitz fuctio, R (F) ρbr/. Pluggig this ito Theorem 3.2 we get the followig risk boud for SVM: Ef(X) E f(x) 2BR/ log /δ 2, with probability at least δ So how should we iterpret this? Suppose we make the assumptio that we kow the miimize of the empirical risk, which we deote with w which has zero empirical risk. Also B = w, H ca simply be the set of liear classifier which have orm smaller tha B. The the risk boud ca be refied to Ef(X) ˆL 2R w ( R w ) log /δ 2, with probability at least δ Ad ote that F is (B w )-Lipchitz. With risk boud, oe ca show that the sample complexity of hard-svm R2 w 2. ɛ 2 I practice w is ot kow. Oe way to fix this, is to use the doublig trick o the weight vector size boud B. Suppose B i = 2 i, H be all the liear models with weight orm less tha B i, δ i = 2/. For each i we ca write a iequaliy for the risk. A uio boud over all of the iequalities would give a uified boud which holds for all ws. 4 Gliveko-Catelli Theorem The Gliveko-Catelli guaratees uiform covergece bouds o empirical risk of the distributios. Our characterizatio of GC is based o Rademacher ad Fiite Class lemma, though this is ot the oly way to derive these results. First we itroduce the fiite class lemma which is a tool for boudig Rademacher averages. Details o basic formulatios here: http://web.egr.illiois.edu/~khashab2/lear/svm.pdf 8
Lemma 4. (Fiite Class Lemma (Massart)). Let A be some fiite subset of R ad {σ i } m idepedet Rademacher radom variables, ad L = a A A, m σ i x i Proof. Defie, For ay λ R, e λµ E exp ( λ a A R (A) = E a A µ = E a A ) m σx i = E 2L log A m σ i x i = m R (A) exp a A ( ) ( ) m m λ σx i E exp λ σx i a A = ( ) m E exp λ σx i = m E exp (λσx i ) = m a A a A a A m exp ( λ 2 x i2 /2) m exp ( λ 2 L 2 /2 ) m A exp ( λ 2 L 2 /2 ) a A a A exp ( λx i ) exp (λx i ) 2 µ l A λ λl2 2. Set λ = 2 l A, ad we will have, µ L 2 l A L 2 More details: TBW The fiite class lemma could be geeralized to the class of biary-valued fuctios. Now defie F be class of biary valued fuctios, F = {f : Z {0, }}. I other words, give radom samples {Z i }, ad F(Z ) {(f(z ),..., f(z )) : f F}, We geeralize the boud usig the Rademacher boud for this class of fuctios, Lemma 4.2 (Rademacher boud for biary-valued fuctios). For class of biary-valued fuctios F, log F(Z R (F (Z )) 2 ) Proof. Proof i the Sectio 7. Theorem 4.3 (Gliveko-Catelli). Let, F (x) {X i x} 9
if, the for big eough. F (x) F (x) a.s 0 x Proof. The proof cosists of two mai pars. First usig the Rademacher for boudig the risk, ad the secod, usig the Fiite-Class lemma for boudig the Rademacher average. More details for later 5 Bibliographical otes The first use of Rademacher complexity for risk bouds is probably due to, 2. Refereces Peter L Bartlett ad Shahar Medelso. Rademacher ad gaussia complexities: Risk bouds ad structural results. Joural of Machie Learig Research, 3(Nov):463 482, 2002. 2 Vladimir Koltchiskii ad Dmitry Pacheko. Empirical margi distributios ad boudig the geeralizatio error of combied classifiers. Aals of Statistics, pages 50, 2002. 0
6 Appedix: Uio boud for risk Let s assume we have prove the followig boud for ay f F, p(l(f) L (f) a(δ)) δ, for ay f F which is equivalet to, L (f) L(f) b(δ) with probability at least δ (3) for some values a, b (fuctios of parameters). The, p( f F L (f) = 0 L(f) a) F δ or, equivaletly, L (f) L(f) b(δ/ F ) with probability at least δ Proof. p( f F L (f) = 0 L(f) a) p ( f F (L (f) = 0 L(f) a)) Now defie δ = δ F, ad the usig 3 we have f F p ((L (f) = 0 L(f) a)) F δ L (f) L(f) b(δ ) = L(f) b(δ/ F ) with probability at least δ which proves our desired statemet. 7 Proofs 7. Proof of lemma 4.2 Proof. Sice each f is a biary-valued fuctio, F {0, }. For ay set of samples {Z i }, ad ay fuctio f F, we kow, f(z i ) = For a fixed set of radom samples,{z i }, the set F(Z ) {(f(z ),..., f(z )) : f F} is equivalet to the set A, i Lemma 4., as N = F(Z ) 2 ad L =. As such, R (F (Z )) 2 log F(Z )
8 Aswers Here aswers to some of the questios are icluded. The aswers are mostly by the authors, ad might be buggy. Therefore, read cautiously! 8. Aswer to example 3.3 8.. First part : We ( first use the ) Berstei s iequality ad simplify it. Cosider the Equatio 2 ad take δ = exp ɛ2 2p2ɛ/3. The, ( 2 ɛ 2 3 l ) ɛ 2p l δ δ = 0 2 ( 3 ɛ = l δ ± 2 3 l ) 2 δ 8p l δ 2 Based o the assumptio of the iequality the ɛ 0 ad we ca choose the value with the sig i the about equatio. Usig this simplificatio, we ca rewrite the Berstei iequality i the followig equivalet form: EU U i 2 ( 3 l δ 2 3 l ) 2 δ 8p l δ 2, probability at least δ Now, for a specific f F, we ca cosider U i = {y i f(x i )} as a Beroulli distributio, with the probability of success defied by p = EU = L(f). The empirical estimatio is the Beroulli distributio is U i = {Y i f(x i )} = L (f). This we ca rewrite the boud as: L(f) L (f) l δ 3 Now we use the fact that, a b a b L(f) L (f) l δ 3 L (f) l δ 3 L (f) 2 l δ 3 Which proves the desired result. ( 2 3 l ) 2 δ 8L(f) l δ 2 ( 2 3 l ) 2 δ 8L(f) l δ 2 ( 2 3 l δ ) 2 8L(f) l δ, probability at least δ 2 2L(f) l δ, probability at least δ 2
8..2 Secod part : We use the hit o the boud which we foud i the previous part, i Equatio, with the followig defiitios: a = L(f), b = L (f) 2 log(/δ) 2 log(/δ), c = 3 This would imply the followig iequality: L(f) L (f) 2 log(/δ) 3 L(f) L (f) 8 log(/δ) 3 We use the iequality a b a b, L(f) L (f) 8 log(/δ) 3 ( ) 2 ( ) 2 log(/δ) 2 log(/δ) L (f) 2 log(/δ) 3 2L (f) log(/δ) 4 3 2L (f) log(/δ) 2L (f) log(/δ) 4 3 ( log(/δ) ( log(/δ) L (f) 8 log(/δ) ( 4 log(/δ) 3 3 ( 2 L (f) 3 8 ) log(/δ) 2L (f) log(/δ) 3 3.83 log(/δ) 2L (f) log(/δ) =L (f) L (f) 4 log(/δ) 2L (f) log(/δ) Which proves the desired result. Now usig this boud, we prove the last part of the questio. Before that we state the uio boud for risk. Sice this boud holds for ay f F, this also holds for ˆf F. Based o the assumptio of the questio, the risk for this fuctio is zero. For a fixed ˆf F, if we have L ( ˆf) = 0, L(f) 4 log(/δ) sice ˆf is ot kow a priori ad it ca ay fuctio i the class of fuctios F, we eed to use the uio boud, as i Equatio 3: L(f) 4 log( F /δ) ) 2 ) 2 ) 2 3