Stat 928: Statistical Learning Theory Lecture: Syetrization and Radeacher Averages Instructor: Sha Kakade Radeacher Averages Recall that we are interested in bounding the difference between epirical and true expectations uniforly over soe function class G. In the context of classification or regression, we are typically interested in a class G that is the loss class associated with soe function class F. That is, given a bounded loss function l : D Y 0,, we consider the class l F := {x, y lfx, y f F}. Radeacher averages give us a powerful tool to obtain unifor convergence results. We begin by exaining the quantity gz gz i, where Z, {Z i } are i.i.d. rando variables taking values in soe space Z and G a, bz is a set of bounded functions. We will later show that the rando quantity we are interested in, naely gz gz i, will be close to the above expectation with high probability. Let ɛ,..., ɛ be i.i.d. {±}-valued rando variables with P ɛ i = + = P ɛ i = = /2. These are also independent of the saple Z,..., Z. Define the epirical Radeacher average of G as ˆR G := ɛ i gz i Z. The Radeacher average of G is defined as R G := ˆR G. Theore.. We have, gz gz i 2R G. Proof. Introduce the ghost saple Z,..., Z. By that we ean that Z i s are independent of each other and of Z i s
and have the sae distribution as the latter. Then we have, gz gz i = gz gz i = gz i gz i Z gz i gz i Z = gz i gz i = ɛ i gz i gz i ɛ i gz i + ɛ i gz i = 2R G. Since R G = R G, we have the following corollary. Corollary.2. We have, gz i gz 2R G. Since gx i a, b, gz gz i does not change by ore than b a/ if soe Z i is changed to Z i. Applying the bounded differences inequality, we get the following corollary. Corollary.3. With probability at least δ, gz ln/δ gz i 2R G + b a 2 Recall that we denote the epirical l-loss iniizer by ˆf l. We refer to L l ˆf l in f F L l f as the estiation error. The next theore bounds the estiation error using Radeacher averages. 2
2 xpected Regret Now let us exaine the expected regret of the epirical risk iniizer e.g. analogous to the statistical risk. Let ĝ = arg in gz i where τ is the training set and which is true iniizer. g = arg in gz Lea 2.. The expected regret is: ĝz g Z 2R G + g Z i g Z 4R G where the expectation is with respect ĝ due to randoness in the training set. Proof. Let The expected regret is: ĝz g Z ĝz ĝz i + ĝz i g Z ĝz ĝz i + g Z i g Z g G ĝz ĝz i + g Z i g Z 2R G + g Z i g Z The final clai is straightforward. ĝ 3 Growth function Consider the case Y = {±} classification. Let l be the 0- loss function and F be a class of ±-valued functions. We can relate the Radeacher average of l F to that of F as follows. Lea 3.. Suppose F {±} X and let ly, y = y y be the 0- loss function. Then we have, R l F = 2 R F. 3
Proof. Note that we can write ly, y as yy /2. Then we have, Y i fx i R l F = ɛ i f F 2 X, Y Y i fx i = ɛ i f F 2 X, Y = 2 ɛ i Y i fx i f F X, Y = 2 ɛ i fx i f F X, Y = 2 R F. 2 quation follows because ɛ i X, Y = 0. quation 2 follows because ɛ i Y i s jointly have the sae distribution as ɛ i s. Note that the Radeacher average of the class F can also be written as R F = a F X, where F X is the function class F restricted to the set X,..., X. That is, F X := {fx,..., fx f F}. Note that F X is finite and Thus we can define the growth function as F X in{ F, 2 }. Π F := ax F x X x. The following lea due to Massart allows us to bound the Radeacher average in ters of the growth function. Lea 3.2. Finite Class Lea Let A be soe finite subset of R and ɛ,..., ɛ be independent Radeacher rando variables. Let r = a. Then, we have, r 2 ln A. Proof. Let µ =. 4
We have, for any λ > 0, e λµ exp = exp λ λ exp λ = exp λ = = exp λ e λ2 a 2 i /2 e λ2 a 2 /2 Jensen s inequality Hoeffding s lea A e λ2 r 2 /2 Taking logs and dividing by λ, we get that, for any λ > 0, µ ln A λ + λr2 2. Setting λ = 2 ln A /r 2 gives, which proves the lea. µ r 2 ln A, 5