Rademacher Complexity. Examples

Algorthmc Foudatos of Learg Lecture 3 Rademacher Complexty. Examples Lecturer: Patrck Rebesch Verso: October 16th 018 3.1 Itroducto I the last lecture we troduced the oto of Rademacher complexty ad showed that t yelds a upper boud o the expected value of the uform (over the choce of actos/rules devato betwee the expected rsk r ad the emprcal rsk R, amely, where we recall the otato E {r(a R(a} E Rad(L {Z 1,..., Z } a A L := {z Z l(a, z R : a A}. I ths lecture we establsh bouds for Rad(L {z 1,..., z } for ay z 1,..., z Z the settg of regresso. I ervsed learg, the observed examples correspod to pars of pots,.e., Z = (X, Y X Y. The pot X s called feature or covarate, ad the pot Y s ts correspodg label. The set of admssble decsos s a subset of the set fuctos from X to Y,.e., A B := {a : X Y}, ad the loss fucto s of the form l(a, (x, y = φ(a(x, y, for a fucto φ : Y Y R +. The regresso settg s represeted by the choce X = R d for a gve dmeso d, Y = R. We have S = {(X 1, Y 1,..., (X, Y } ad s = {(x 1, y 1,..., (x, y } whch represets a realzato of the trag sample. Let us recall the followg otato: A {x 1,..., x } := {(a(x 1,..., a(x Y : a A}. Proposto 3.1 Let the fucto ŷ φ(ŷ, y be γ-lpschtz for ay y Y. The, for ay (x 1, y 1,..., (x, y X Y, Rad(L {(x 1, y 1,..., (x, y } γ Rad(A {x 1,..., x } Proof: By the cotracto property of Rademacher complexty, Lemma.10, we get 1 Rad(L s = E Ω φ(w x, y = Rad((φ(, y 1,..., φ(, y A {x 1,..., x } w R d : w c γ Rad(A {x 1,..., x }. Below we show how to cotrol the quatty Rad(A {x 1,..., x } for some fucto classes A of terest. 3. Lear predctors l /l costrats I the case of l /l costrats, the Rademacher complexty of lear predctors does ot deped explctly o the dmeso d (the depedece o d s mplct, va the term max x. 3-1

3- Lecture 3: Rademacher Complexty. Examples Proposto 3. Let A := {x R d w x : w R d, w 1}. The, for ay x 1..., x R d, Rad(A {x 1,..., x } max x Proof: We have Rad(A {x 1,..., x } = E w R d : w 1 w R d : w 1 Ω w x = E w R d : w 1 w ( Ω x w E Ω x by Cauchy-Schwarz s eq. x y x y E Ω x E Ω x d ( = E Ω x,j j=1 j=1 by Jese s, as x x s cocave d = E (Ω x,j as the Ω s are depedet ad EΩ = 0 = E x max x as Ω = 1. Remark 3.3 Note that as the predctors that we are cosderg are lear,.e., x R d w x, the costrat w 1 the defto of A Proposto 3. s wthout loss of geeralty. I fact, f w c for a gve costat c 0, the we ca rescale w x = ( w w ( w x ad we have the equvalece {x R d w x : w R d, w c} = {x R d w (cx : w R d, w 1}. Proposto 3. stll apples, wth a costat c o the rght-had sde of the boud. 3.3 Lear predctors l 1 /l costrats (l 1 Boostg I the case of l 1 /l costrats, the Rademacher complexty of lear predctors oly depeds logarthmcally o the dmeso d. Proposto 3.4 Let A 1 := {x R d w x : w R d, w 1 1}. The, for ay x 1..., x R d, Rad(A 1 {x 1,..., x } max x log(d

Lecture 3: Rademacher Complexty. Examples 3-3 Proof: We have Rad(A 1 {x 1,..., x } = E w R d : w 1 1 w R d : w 1 1 E Ω x Ω w x = E w 1 E Ω x w R d : w 1 1 Let t j := (x 1,j,..., x,j R for ay j 1 : d, ad let T = {t 1,..., t d }. The, Ω x = max Ω x,j = max w ( Ω x by Hölder s equalty x y x 1 y Ω t j, = max t T, Ω t whose expectato looks lke a Rademacher complexty apart from the absolute value (ad the ormalzato by 1/. To remove the absolute value, ote that for ay ω 1,..., ω { 1, 1} we have max t T ω t = max t T T ω t, where we have defed T = { t 1,..., t d }, wth t j = ( x 1,j,..., x,j. Hece, we have ad the proof follows by Massart s lemma as Rad(T T Rad(A 1 {x 1,..., x } Rad(T T, max t T T t log T T max log(d x. Remark 3.5 Note that as the predctors that we are cosderg are lear,.e., x R d w x, the costrat w 1 1 the defto of A 1 s wthout loss of geeralty. I fact, f w 1 c for a gve costat c 0, the we ca rescale w x = ( w w 1 ( w 1 x ad we have the equvalece {x R d w x : w R d, w 1 c} = {x R d w (cx : w R d, w 1 1}. Proposto 3.4 stll apples, wth a costat c o the rght-had sde of the boud. 3.4 Lear predctors smplex/l costrats (Boostg Proposto 3.6 Let d := {w R d : w 1 = 1, w 1,..., w d 0} ad let A := {x R d w x : w d }. The, for ay x 1..., x R d, Rad(A {x 1,..., x } max x log d

3-4 Lecture 3: Rademacher Complexty. Examples Proof: We have Rad(A {x 1,..., x } = E Note that for ay vector v = (v 1,..., v d R d we have The, Ω w x = E w ( Ω x. w v = max v j. E w ( Ω x = E max Ω x,j = Rad(T, wth T = {t 1..., t d } wth t j = (x 1,j,..., x,j for ay j {1,..., d}. The proof follows by Massart s lemma as log T Rad(T max t log d max x. t T 3.5 Feed-forward eural etworks Let us defe a feed-forward eural etworks wth actvato fuctos appled elemet-wse to ts uts. A layer l (k : R d k 1 R d k cossts of a coordate-wse composto of a actvato fucto σ (k : R R ad a affe map, amely, l (k (x := σ (k (W (k x + b (k, for a gve teracto matrx W (k ad bas vector b (k. A eural etwork wth depth p (ad p 1 hdde layers s gve by the fucto f p : R d R defed as f (p (x := l (p l (1 (x l (p ( l ( (l (1 (x, wth d 0 = d, d p = 1, σ (r = σ for a gve fucto σ for all r < p, ad σ (p (x = x (.e., the last layer s smply a affe map. The actvato fucto σ s kow to the desg maker, whle the teracto matrces ad the bas vectors are treated as parameters to tue. For stace, a class of eural etworks wth depth p s gve by A (p := {x R d f (p (x : W (k ω, b (k β k}, (3.1 where for a gve matrx M, the l orm s defed as M := max l M j. The Rademacher complexty of a feed-forward eural etwork ca be bouded recursvely by cosderg each layer at a tme. A boud that ca be used for the recurso s gve by the followg proposto, that expresses the Rademacher complextes at the outputs of oe layer terms of the outputs at the prevous layers. Proposto 3.7 Let L be a class of fuctos from R d to R that cludes the zero fucto. Let σ : R R be α-lpschtz ad defe L := {x R d σ( m j=1 w jl j (x + b R : b β, w 1 ω, l 1,..., l m L}. The, for ay x 1,..., x R d, β Rad(L {x 1,..., x } α( + ω Rad(L {x 1,..., x } (3.

Lecture 3: Rademacher Complexty. Examples 3-5 If L = L, the β Rad(L {x 1,..., x } α( + ω Rad(L {x 1,..., x } (3.3 Proof: We gve a proof that makes use of may of the property of the Rademacher complexty descrbed the prevous lecture. Let F : = {x R d m w j l j (x R : w 1 ω, l 1,..., l m L}, G : = {x R d b R : b β}. By the cotracto property ad the summato property of Rademacher complextes, we have ( Rad(L {x 1,..., x } α Rad(F {x 1,..., x } + Rad(G {x 1,..., x }. O the oe had, as L cotas the zero fucto we have F {x 1,..., x } = ω cov(l L, where L L = {l l : l L, l L }. I fact, frst of all ote that where Rad(F {x 1,..., x } = Rad(F {x 1,..., x } F := {x R d m w j l j (x R : w 1 = ω, l 1,..., l m L} (ths s because the maxmum over w 1 ω s acheved for the values satsfyg w 1 = ω. The, ote that for ay w R m such that w 1 = 1 we have w l = w (l 0 + w (0 l, :w 0 :w <0 where 0 represets the zero fucto. The rght-had sde s a covex combato of elemets L L. Hece, by the covex hall property of Rademacher complexty we fd Rad(F {x 1,..., x } = ω Rad(cov(L L {x 1,..., x } = ω Rad((L L {x 1,..., x } = ω Rad(L {x 1,..., x } + ω Rad( L {x 1,..., x } = ω Rad(L {x 1,..., x }, where the factor s ot ecessary f L = L. O the other had, Rad(G {x 1,..., x } = E b: b β b Ω E b Ω = β E Ω β, b: b β where the last equalty follows by Jese s equalty, as E Ω E[( Ω ] = usg the depedece of the Ω s ad that Ω = 1. We are ow ready to gve a boud for the full eural etwork. We ca use Proposto 3.7 to ru the recurso, otcg that the last layer volves a lear fucto (whch s 1-Lpschtz. The frst layer requres a dfferet treatmet, ad we ca use Proposto 3.4. Usg Proposto 3.7 we ca establsh the followg boud for the Rademacher complexty of a layered eural etwork.

3-6 Lecture 3: Rademacher Complexty. Examples Proposto 3.8 Let σ be λ-lpschtz. Let A (p be defed as 3.1. The, for ay x 1..., x R d, Rad(A (p {x 1,..., x } 1 p 3 (β + ωβλ (ωλ k + ω(ωλ p max x log(d If λ = 1 ad σ s at-symmetrc, amely, σ(x = σ( x, we have Rad(A (p {x 1,..., x } 1 k=0 p (β k=0 ω k + ω p 1 max x log(d Proof: As the last layer of the eural etwork s lear,.e., σ (p (x = x, we ca apply Proposto 3.7 wth α = 1 (as σ (p s 1-Lpschtz oce ad the apply (3. Proposto 3.7 wth α = λ for p tmes. We fd Rad(A (p {x 1,..., x } β ( βλ + ω p 3 (ωλ k + (ωλ p Rad(A 1 {x 1,..., x }. k=0 The result of the frst equalty follows by Proposto 3.4. The secod equalty ca be proved usg the same strategy, usg (3.3 stead of (3..