Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi Cosider m evets E 1,... E m, we have P (E 1 E m ) P (E 1 ) + + P (E m ). I other words, with probability 1 P (E 1 ) P (E m ), oe of the evets E i (i = 1,..., m) occurs. If we assume the probability j P (E j) is small. Uio boud is relatively tight whe the evets E j are idepedet. P (E 1 E m ) j P (E j ) j k P (E j E k ) j P (E j ) 0.5( j P (E j )) 2. If E j are correlated, the it is ot tight. For example whe they are completely correlated: E 1 = = E m, the P (E 1 E m ) = N 1 j P (E j ). We will come back to this whe we discuss chaiig. 3 Motivatio of Empirical Process Cosider learig problem with observatios Z i = (X i, Y i ), predictio rule f(x i ) ad loss fuctio L(f(X i ), Y i ). Assume further that f is parameterized by Θ as f (X i ). Example, f (x) = x be a liear fuctio, ad L(f (x), y) = ( x y) 2 is least squares loss. I the followig, we itroduce simplified otatio g (Z i ) = L(f (X i ), Y i ). We are iterested i estimatig ˆ from traiig data. That is, ˆ depeds o Z i. Sice we are usig the traiig data as a surrogate of the test (true uderlyig) distributio, we hope traiig error is similar to test error. I learig theory, we are iterested i estimatig the followig tail quatities for some ɛ > 0: ad P ( 1 P ( 1 gˆ(z i ) Egˆ(Z) + ɛ) gˆ(z i ) Egˆ(Z) ɛ). 1
The above two quatities ca be bouded usig the followig two quatities: P ( 1 gˆ(z i ) Egˆ(Z) + ɛ) P [( 1 g (Z i ) Eg (Z)) ɛ] Θ ad P ( 1 gˆ(z i ) Egˆ(Z) ɛ) P [(Eg (Z) 1 Θ g (Z i )) ɛ]. Notatio: i the above settig the collectio of radom variables 1 g (Z i ) idexed by Γ is call a empirical process. We may also call 1 g (Z i ) Eg (Z) empirical process. For each fixed, 1 g (Z i ) Eg (Z) 0 i probability, by LLN. However, i empirical process, we are iterested i uiform law of large umbers, that is the followig remum of empirical process defied as Θ 1 g (Z i ) Eg (Z) coverges to zero i probability. Give traiig data Z1 = {Z 1,..., Z }, we may let ˆ(Z 1 ) achieve the remum above. The 1 g (Z i ) Eg (Z) = 1 gˆ(z 1 ) (Z i) Egˆ(Z 1 ) (Z), Θ where ˆ(Z 1 ) depeds o the traiig data. This meas that gˆ(z 1 ) (Z i) is ot sum of idepedet radom variable aymore. Supreme of empirical process is basically the worst case deviatio of empirical mea (traiig error) ad true mea (test error) for parameter that is chose based o traiig data. Coceptually, as log as you select ˆ based o traiig data, you eed to use empirical process ad uiform law of large umbers. However, if you oly cosider fixed idepedet of traiig data, the you ca use stadard law of large umbers because g (Z i ) are idepedet radom variable. 4 Oracle Iequality for empirical risk miimizatio Cosider the empirical risk miimizatio algorithm: ˆ = arg mi Θ g (Z i ), ad the optimizatio parameter that miimizes the test error (with iifite amout of data): = arg mi Θ Eg (Z). We wat to kow how much worse is the test error performace of ˆ compared to that of. Results of this flavor is referred to as oracle iequality. We ca obtai simple oracle iequality usig ULLN of empirical process as follows. Assume that we have the tail boud for the empirical mea of g (Z) as: P ( 1 g (Z i ) Eg (Z) ɛ 1 ) δ 1 (ɛ 1 ) 2
Assume that we have the followig uiform tail boud for empiricla process for some γ [0, 1): P ([ 1 g (Z i ) + (1 γ)eg (Z) + γeg (Z)] ɛ 2 ) δ 2 (ɛ 2 ) Takig the uio boud, we obtai with probability 1 δ 1 (ɛ 1 ) δ 2 (ɛ 2 ), 1 Sice by defiitio, we have g (Z i ) Eg (Z) < ɛ 1, [ 1 gˆ(z i ) + (1 γ)egˆ(z) + γeg (Z)] < ɛ 2. Therefore by addig the three iequalities: 1 gˆ(z i ) 1 g (Z i ). (1 γ)egˆ(z) + γeg (Z)] Eg (Z) < ɛ 1 + ɛ 2. That is, we have Egˆ(Z) < Eg (Z) + (1 γ) 1 (ɛ 1 + ɛ 2 ). If Θ cotais oly fiite umber of fuctios: N = Θ, the we ca simply apply the uio boud P ([ 1 P ([ 1 Θ Θ P ([ 1 Θ g (Z i ) + (1 γ)eg (Z) + γeg (Z)] ɛ) g (Z i ) + (1 γ)eg (Z) + γeg (Z)] ɛ) g (Z i ) + (1 γ)eg (Z) + γeg (Z) ɛ). 5 Recap: Oracle Iequality Cosider the empirical risk miimizatio algorithm: ˆ = arg mi Θ g (Z i ), ad the optimizatio parameter that miimizes the test error (with iifite amout of data): If P ( 1 = arg mi Θ Eg (Z). g (Z i ) Eg (Z) ɛ 1 ) δ 1 (ɛ 1 ), which meas that the traiig error of the optimal parameter is t much larger tha test error. Assume also that we have the followig uiform tail boud for empiricla process for some γ [0, 1): P ([ 1 g (Z i ) + (1 γ)eg (Z) + γeg (Z)] ɛ 2 ) δ 2 (ɛ 2 ), 3
which meas that the traiig error of a arbitrary iferior parametr is t much smaller tha its test error. The we have oracle iequality with probability 1 δ 1 (ɛ 1 ) δ 2 (ɛ 2 ), Egˆ(Z) < Eg (Z) + (1 γ) 1 (ɛ 1 + ɛ 2 ). This meas that the geeralizatio performace of ERM is t much worst tha that of the optimal parameter. 6 Lower bracketig coverig umber If Θ is ifiite, the we ca use the idea of coverig umber. There are differet defiitios. Let G = {g : Θ} be the fuctio class of the empirical process. G N = {g 1 (z),..., g N (z)} is a ɛ-lower bracketig cover of G if for all Θ, there exists j = j() such that [g j (z) g (z)] 0 Eg j (z) Eg (z) ɛ. z The smallest cardiality N LB (G, ɛ) of such G N is called ɛ-lower bracketig coverig umber. Similarly oe ca defie upper bracketig coverig umber. The logarithm of coverig umber is called etropy. We shall metio that the fuctios g j (z) may ot ecessarily be a fuctio g (z) for Θ. Let G(ɛ/2) be a ɛ/2 lower bracketig cover of G, the pick j = j() Thus, [ 1 = 1 [ 1 [ g (Z i ) + (1 γ)eg (Z) + γeg (Z)] g (Z i ) g j (Z i )] g j (Z i ) + (1 γ)eg j (Z) + γeg (Z) + (1 γ)[ Eg j (Z) + Eg (Z)]] j=j() [ 1 P ([ 1 P ( G(ɛ/2) [ 1 P ( 1 g j (Z i ) + (1 γ)eg j (Z) + γeg (Z) + (1 γ)ɛ/2]. g (Z i ) + (1 γ)eg (Z) + γeg (Z)] ɛ) P ( 1 g j (Z i ) + (1 γ)eg j (Z) + γeg (Z) + (1 γ)ɛ/2] ɛ) g j (Z i ) + Eg j (Z) γ(eg j (Z) Eg (Z)) + 0.5(1 + γ)ɛ) g j (Z i ) + Eg j (Z) γ(eg j (Z) Eg (Z)) + 0.5(1 + γ)ɛ). The summatio boud with γ > 0 is a form of a idea i empirical referred to as peelig, ad some times also called shell bouds. We will preset a simple example below to illustrate the basic cocepts. 4
7 A Simple Example This example is to get you familiar with the ituitios ad otatios. We will cosider more complex examples i future lectures, but the basic idea resembles this example. Cosider oe dimesioal classificatio problem, with x [ 1, 1] ad y {±1}. Assume that coditioed o x, the class label is give by y = ɛ(2i(x ) 1) for some ukow, with idepedet radom oise ɛ { 1, 1}, ad p = P (ɛ = 1) > 0.5. This meas that the optimal Bayes classifier is f (x) = 1 whe x ad f (x) = 1 whe x <, ad the Bayes error is 1 p. Sice we do t kow the true threshold, we ca cosider a family of classifiers f (x) = 2I(x ) 1, with to be leared from traiig data. Give sample Z = (X, Y ), the classifier error fuctio for this classifier is g (Z) = I(f (X) Y ). Give traiig data Z1 = {(X 1, Y 1 ),..., (X, Y )}, we ca lear a threshold ˆ usig empirical risk miimizatio that fids by miimizig the traiig error: ˆ = arg mi g (Z i ). We wat to kow the geeralizatio performace of ˆ compared to the Bayes error. That is, to give a upper boud of Eg (Z) (1 p). We will examie the followig few issues i order to uderstad what is goig o: 1/ covergece (usig Cheroff boud) versus 1/ covergece (usig refied Cheroff boud or Beet). The role of peelig. 7.1 Bracketig cover of the fuctio class Give ɛ, ad let j = 1 + jɛ for j = 1,..., 2/ɛ. Let { 0 if x [ j ɛ, j ] g j (z) = g j (z) otherwise, where z = (x, y). It follows that for ay [ 1, 1], if we let j be the smallest j such that j, the we have g j (z) = 0 g (z) whe x [, j ], ad g j (z) = g (z) whe x / [, j ], where z = (x, y). Moreover, Eg j (z) Eg (z) = E x [,j] g (z) ɛ. Note that sice oly the aalysis depeds o coverig umber, geerally we ca deisg a coverig umber that depeds o the truth, ad may cover the space o-uiformly. This is ot cosidered here. 5
7.2 Usig Stadard Cheroff boud without peelig At, we have from Cheroff boud: P ( 1 Alteratively, we say that with probability 1 δ 1 : g (Z i ) Eg (Z) ɛ) exp( 2ɛ 2 ). g (Z i ) Eg (Z) < ɛ 1 = l(1/δ 1 )/2. Now we wat to evaluate usig lower brackig cover G(ɛ/2) as: P ([ 1 G(ɛ/2) 4/ɛ e ɛ2 /2. g (Z i ) + Eg (Z)] ɛ) P ( 1 g j (Z i ) + Eg j (Z) 0.5ɛ) We used G(ɛ/2) 4/ɛ. Alteratively, we say that with probability 1 δ 2 (ad ote that ɛ 2 2/): [ 1 g (Z i ) + Eg (Z)] < ɛ 2 = 2(l 4/ɛ 2 l δ 2 )/. Let δ = 2δ 1 = 2δ 2, we have with probability at least 1 δ: Egˆ(Z) (1 p) < l(2/δ)/2 + 2(l 4 /2 + l(2/δ))/ < 2(l 4 /2 l δ 2 )/. 2 l 4 /2 / + 3 l(2/δ)/2. 6