Agnostic Learning and Concentration Inequalities

ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture we cosider a learig problem i which the optimal fuctio beloged to a fiite class of fuctios. Specifically, for some collectio of fuctios Fwith fiite cardiality F, we have mi R(f) 0 f F This is almost always ot the situatio i the real-world learig problems. Let us suppose we have a fiite collectio of cadidate fuctios F. Furthermore, we do ot assume that the optimal fuctio f, which satisfies R(f ) if R(f) f, where the if is take over all measurable fuctios, is a member of F. That is, we make few, if ay, assumptios about f. This situatio is sometimes termed as Agostic Learig. The root of the word agostic literally meas ot kow. The term agostic learig is used to emphasize the fact that ofte, perhaps usually, we may have o prior kowledge about f. The questio the arises about how we ca reasoably select a f F i this settig. 1.2 The Problem The PAC style bouds discussed i the previous lecture, offer some help. Sice we are selectig a fuctio based o the empirical risk, the questio is how close is ˆR (f) to R(f) f F. I other words, we wish that the empirical risk is a good idicator of the true risk for every fuctio i F. If this is case, the selectio of f that miimizes the empirical risk fˆ arg mi ˆR (f) should also yield a small true risk, that is, R( ˆ f ) should be close to mi R(f). Fially, we ca thus state our desired situatio as P ( ˆ R (f) R(f) > ɛ) < δ, f F I other words, f F, with probability at least 1 δ, R ˆ (f) R(f) > ɛ. I this lecture, we will start to develop bouds of this form. First we will focus o boudig P ( R ˆ (f) R(f) > ɛ) for oe fixed f F. 1

Agostic Learig ad Cocetratio Iequalities 2 2 Developig Iitial Bouds To begi, let us recall the defiitio of empirical risk for {X i, Y i } be a collectio of traiig data. The the empirical risk is defied as ˆR (f) 1 l(f(x i ), Y i ) Note that sice the traiig data {X i, Y i } are assumed to be i.i.d. pairs, each term i the sum is a i.i.d radom variables. Let L i l(f(x i ), Y i ) The collectio of losses {L i } is i.i.d accordig to some ukow distributio (depedig o the ukow joit distributio of (X,Y) ad the loss fuctio). The expectatio of L i is E[l(f(X i ), Y i )] E[l(f(X), Y )] R(f), the true risk of f. For ow, let s assume that f is fixed. E[ ˆ R (f)] 1 E[l(f(X i ), Y i )] 1 E[L i ] R(f) We kow from the strog law of large umbers that the average (or empirical mea) R ˆ (f) coverges almost surely to the true mea R(f). That is, R ˆ (f) R(f) almost surely as. The questio is how fast. 3 Cocetratio of Measure Iequalities Cocetratio iequalities are upper bouds o how fast empirical meas coverge to their esemble couterparts, i probability. Area of the shaded tail regios is P ( R ˆ (f) R(f) > ɛ). We are iterested i fidig out how fast this probability teds to zero as. At this stage, we recall Markov s Iequality. Let Z be a oegative radom variable. E[Z] 0 t 0 0 + t zp(z)dz zp(z)dz + tp (Z t) P (Z t) E[Z] t P (Z 2 t 2 ) E[Z2 ] t 2 t u zp(z)dz zp(z)dz Take Z R ˆ(f) R(f) ad t ɛ

Agostic Learig ad Cocetratio Iequalities 3 Figure 1: Distributio of ˆ R (f) P ( R ˆ (f) R(f) ɛ) E[ R ˆ(f) R(f) 2 ] ɛ 2 var( ˆR (f)) ɛ 2 var( Li ) ɛ 2 var(l(x), Y ) ɛ 2 σ2 L ɛ 2 So, the probability goes to zero at a rate of at least 1. However, it turs out that this is a extremely loose boud. Accordig to the Cetral Limit Theorem ˆ R (f) 1 L i N ( ) R(f), σ2 L as i distributio. This suggests that for large values of, ) P ( R ˆ (f) R(f) ɛ) O (e ɛ2 2σ L 2 That is, the Gaussia tail probability is tedig to zero expoetially fast.

Agostic Learig ad Cocetratio Iequalities 4 4 A Dichotomy Obviously, the boud based o Markov s iequality is extremely loose for large. Tighter cocetratio iequalities ca be derived usig more sophisticated techiques. There is a importat dichotomy at this poit ito the class of bouded loss fuctios (leadig to bouded radom variables L i ) ad ubouded loss fuctios (leadig to ubouded radom variables L i ). Example 1 Bouded Loss Fuctios By this, we mea ay loss fuctio mappig ito a bouded set, for example, l : Y Y [0, 1] So here, L i 0 or 1. 0 1 loss, R(f) E[1 f(x) Y ] P (f(x) Y ). Example 2 Ubouded Loss Fuctios Ay loss fuctio mappig ito a ubouded set, for example squared error, R(f) E[(f(X) Y ) 2 ]. The case of ubouded losses is simpler, sice we ca exploit the boudedess i a key way. Therefore, we ca cocetrate o bouded loss fuctios ad classificatio problems first, ad later we will look at ubouded losses ad estimatio problems. 5 Bouded Loss Fuctios ad Cheroff s Boud Note that for ay oegative radom variable Z ad t > 0, P (Z t) P (e sz e st ) E[esZ ] e st, s > 0 by Markov s iequality Cheroff s boud is based o fidig the value of s that miimizes the upper boud. If Z is a sum of idepedet radom variables. For example, say Z ( ) (l(f(x i ), Y i ) R(f)) ˆR (f) R(f) the the boud becomes ( ) P (L i E[L i ]) t e st E[e s (Li E[Li]) ] e st E[e s(li E[Li]) ], from idepedece. Thus, the problem of fidig a tight boud boils dow to fidig a good boud for E[s s(li E[Li]) ]. Cheroff ( 52), first studied this situatio for biary radom variables. The, Hoeffdig ( 63) derived a more geeral result for arbitrary bouded radom variables. Theorem 1 Hoeffdig s Iequality Let Z 1, Z 2,..., Z be idepedet bouded radom variables such that Z i [a i, b i ] with probability 1. Let S Z i. The for ay t > 0, we have P ( S E[S ] t) 2e 2t 2 (b i a i ) 2

Agostic Learig ad Cocetratio Iequalities 5 Applicatio: Let Z i 1 f(xi) Y i R(f), as i the classificatio problem. The for a fixed f, it follows from Hoeffdig s iequality (i.e., Cheroff s boud i this special case) that ( ) P ( R ˆ 1 (f) R(f) ɛ) P S E[S ] ɛ P ( S E[S ] ɛ) 2e 2(ɛ)2 2e 2ɛ2 Proof: The key to provig Hoeffdig s iequality is the followig upper boud: if Z is a radom variable with E[Z] 0 ad a Z b, the E[e sz ] e s2 (b a) 2 This upper boud is derived as follows. By the covexity of the expoetial fuctio, e sz z a b a esb + b z b a esa, for a z b Figure 2: Covexity of expoetial fuctio. Thus, E[e sz ] [ Z a E b a b b a esa ] e sb + E [ ] b Z e sa b a a b a esb, sice E[Z] 0 (1 θ + θe s(b a) )e θs(b a), where θ a b a

Agostic Learig ad Cocetratio Iequalities 6 Now let The we have u s(b a) ad defie φ(u) θu + log(1 θ + ) E[e sz ] (1 θ + θe s(b a) )e θs(b a) e φ(u) To miimize the upper boud let s express φ(u) i a Taylor s series with remaider : φ(u) φ(0) + uφ (0) + u2 2 φ (v) for some v [0, u] φ (u) θ + 1 θ + φ (u) 0 φ (u) 1 θ + (1 θ + ) 2 1 θ + (1 1 θ + ) ρ(1 ρ) Now, φ (u) is maximized by So, ρ 1 θ + 1 2 φ (u) 1 4 φ(u) u2 s2 (b a) 2 E[e sz ] e s2 (b a) 2 Now, we ca apply this upper boud to derive Hoeffdig s iequality. P (S E[S ] t) e st E[e s(li E[Li]) ] e st e s 2 (bi a i ) 2 e st e s2 (b i a i ) 2 e 2t 2 (b i a i ) 2 by choosig s 4t (b i a i ) 2 2t 2 (b Similarly, P (E[S ] S t) e i a i ) 2. This completes the proof of the Hoeffdig s theorem.

Agostic Learig ad Cocetratio Iequalities 7 Now, we wat a boud like this to hold for all f F. Let us eumerate the fuctios i F as f 1, f 2,..., f F, where F deotes the cardiality of F. We would like to boud the probability that R ˆ (f) R(f) ɛ for ay f F. This probability is ( P R ˆ (f 1 ) R(f 1 ) ɛ or or R ˆ ) (f F ) R(f F ) ɛ P R ˆ (f) R(f) ɛ. P R ˆ (f) R(f) ɛ P ( R ˆ (f) R(f) ɛ), the uio of evets boud 2 F e 2ɛ2, by Hoeffdig s iequality. Thus, we have show that f F with probability at least 1 2 F e 2ɛ2, R ˆ (f) R(f) < ɛ. Ad accordigly, we ca be reasoably cofidet i selectig f from F based o the empirical risk fuctio ˆR.