Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe performig a biary classificatio over classes of fiite VC dimesio This result geeralizes the agostic boud for fiite classes, discussed i the previous lecture Most of the material follows the expositio of Bousquet et al 004) I also ivite the iterested studets to thik about questios marked with blue You wo t get extra poits for them, but you will certaily get a better uderstadig of the material Let s recall the settig ad some basic facts We have a iput space X ad a output space Y := 0, There is a ukow probability distributio P over X Y We receive a traiig sample S := X i, Y i ) of iid iput-output pairs from P We fix a set of classifiers H We deote the expected risk for ay h H as ad the empirical risk as Lh) := P X,Y ) P hx) Y L h) := hx i ) Y i We itroduce the Empirical Risk Miimizatio ERM) algorithm ĥ := ĥs, H): L Ĥ) = if g H L g) We will require the followig cocetratio iequality, itroduced i the secod lecture: Theorem Hoeffdig s iequality) Let ξ,, ξ be idepedet radom variables such that ξ i [a i, b i ], a i, b i R, for i =,, with probability oe Deote Z := ξ i The for ay ε > 0 it holds that: ε ) PZ E[Z ] ε exp b i a i ) Same iequality holds for PE[Z ] Z ε Moreover, ε ) P Z E[Z ] ε exp b i a i ) Show that the third iequality of theorem follows simply from the first two oes The uio boud is our favourite trick!

Agostic boud for fiite classes Let s shortly recall the agostic excess risk boud for fiite classes, itroduced i the secod lecture We will provide a slightly modified proof leadig to mior chages i the costat factors: Theorem Assume H = h,, h N The for ay δ > 0 with probability larger tha δ the followig holds: Lĥ) mi Lh log N + log δ i) + Th),,N Proof For our further discussio it will be useful to recall the idea behid a proof Assumig h is the miimizer of the expected risk over H we may write: Next we write: Lĥ) Lh ) = Lĥ) L ĥ) + L ĥ) L h ) + L h ) Lh ) Lĥ) L ĥ) + L h ) Lh ) *) Lh) L h) ) + L h) Lh) ) Lh) L h) ) PLh) L h) ɛ = P N Lh i ) L h i ) ɛ ) N P Lh i ) L h i ) ɛ, 3) where we used the uio boud i the last lie We may ow apply Hoeffdig s iequality of Theorem ad get: PLh) L h) ɛ N e ɛ / = Ne ɛ We wat the rhs of the previous iequality to be smaller tha δ I other words, we wat to fid ɛ such that: δ = Ne ɛ Solvig the equatio for ɛ we get: Note that for this choice of ɛ we have logn/δ) ɛ = PLh) L h) ɛ δ, or equivaletly PLh) L h) < ɛ δ I other words, with probability larger tha δ we have Lh) L h) Isertig this boud back to ) we coclude the proof logn/δ)

log N+log Try to slightly improve this result You may replace δ log N+log δ log δ i the upper boud with + For this get back to *) ad do somethig smarter Notice that h does ot deped o S so why upper boudig last two terms with remum? 3 Oe step further: ifiite classes H, VC-boud The mai goal of this lecture is to drop the assumptio of Theorem that the class H is fiite Now we assume that H may be ifiite Actually, there ca be ucoutably may classifiers i H just thik about liear classifiers i R d or simply about thresholds i oe dimesio) 3 Spoiler Before eve itroducig all the ecessary defiitios, let us start with the statemet of theorem, which we are goig to prove Theorem 3 VC-boud) For ay δ with probability larger tha δ it holds that: Lĥ) if Lg) + log S H ) + log 4 δ g H Compare this boud to Th) It looks almost the same, but N is replaced with S H ) a quatity kow as the growth fuctio, which will be itroduced later i the proof For ow it is istructive to ote the similarity betwee these two results: perhaps, it meas that we ca proceed with the same or almost the same) proof, where, magically, N evets appearig o lies ) 3) will be evetually replaced with S H ) evets? It turs out that this is ideed the case! I the followig we preset the proof of Theorem 7 3 Debuggig the proof Ca we still repeat the proof of Theorem? Let s assume for ow that there is h H such that Lh ) = if g H Lg) Show that geerally this is ot true) It turs out that we ca still repeat the first steps, but we ca o more apply the uio boud Ideed, the uio boud P i A i ) i PA i) holds at most for coutable set of evets A i I our case, as we already metioed, we may ed up with ucoutably may evets I summary, we ca ot apply step ) 3) ay more Let s try to fid a workaroud What is actually causig the problem? Note that L h) i lies betwee ) ad ) still takes oly fiitely may values as h rus through the H prove this yourself!) If we had oly L h) appearig iside of probability sig i ) we could still eumerate all the differet values of L h) ad get back to fiitely may evets ad proceed with all the previous steps The real problem is the Lh) term, which also appears i the evets of ) I priciple, Lh) ca take ay value betwee 0 ad for h H prove this yourself!) This is the reaso we may ed up with ucoutably may evets Fortuately, the followig otrivial iequality helps us to get rid of the adversarial Lh) term: Lemma 4 Symmetrizatio iequality) Assume S := X i, Y i )) is a idepedet copy of S, that is S S forms a sequece of iid iput-output pairs distributed accordig to P Deote L h) := hx i ) Y i The for ay ɛ > 0, such that ɛ, it holds that: P S Lh) L h) ) ɛ P S S L h) L h) ) ɛ/ 3

Iequality also holds for L h) Lh) ) 33 Modifyig the proof: gettig rid of Lh) Now, let us retur to the begiig ad try to apply this result: Lĥ) Lh ) = Lĥ) L ĥ) + L ĥ) L h ) + L h ) Lh ) Lĥ) L ĥ) + L h ) Lh ) Lh) L h) ) + L h) Lh) ) As we already ow, if for two evets A ad B it holds that A B the ecessarily PA) PB) This gives us PLĥ) Lh ) ɛ P Lh) L h) ) + L h) Lh) ) ɛ 4) Also ote that by the same reaso for ay radom variables a ad b we have Pa + b ɛ Pa ɛ/ b ɛ/ Pa ɛ/ + Pb ɛ/ Applyig this to 4) ad usig Lemma 4 we get: PLĥ) Lh ) ɛ P Lh) L h) ) ɛ/ + P L h) Lh) ) ɛ/ 4 P S S L h) L h) ) ɛ/ 5) At this poit ote that o matter what h is, L h) L h) ca take oly fiitely may values prove this yourself!) The value of L h) L h) depeds oly o the projectio of H o the double sample S S, where for ay sample S m := X j, Y j ) m j= we defie a projectio i the followig way: hx ) H Sm := ) Y, hx ) Y,, hx m ) Y m, h H 0, m Note that H S S is a subset of the 0, ad thus its cardiality cardh S S ) is upper bouded by We may write PLĥ) Lh ) ɛ 4 P S S L v) L v) ) ɛ/ v H S S where we have overloaded otatios L v) ad L v) i a atural way All i all, it seems like we may ow proceed with the origial ) 3) steps to boud the rhs of the previous iequality, sice is ow over the fiite set This is ideed what we did durig the lecture, but the thig is, this step is ot quite correct Notice that the uio boud assumes that evets A i are fixed I our case, there are fiitely may evets A v := L v) L v) ɛ idexed by v, but they all deped o the radom samples S ad S, so the uio boud at least i its usual form) ca ot be applied, 4

34 Aother eat trick: Rademacher symmetrizatio Istead, we will proceed with a trick commoly kow as the Rademacher symmetrizatio Next lies are take from Sectio 4 of Devroye et al 996) Itroduce radom variables σ,, σ which are all idepedet also idepedet from S ad S ) ad take values ad + with probabilities 05 Rewrite 5) i the followig way: PLĥ) Lh ) ɛ 4 P S S ad otice that distributio of is the same as distributio of proof this yourself!) We may thus write PLĥ) Lh ) ɛ 4 P S S = 4 P σ,s S hx i ) Y i hx i ) Y i ) ɛ/ hx i ) Y i hx i ) Y i ) σ i hx i ) Y i hx i ) Y i ) hx i ) Y i hx i ) Y i ) ɛ/ σ i hx i ) Y i hx i ) Y i ) ɛ/ Next we use the tower rule of expectatio, which ca be writte for ay evet A ad ay radom variable Z as PA) = E Z [PA Z)] This gives us PLĥ) Lh ) ɛ 4E S S [P σ σ i hx i ) Y i hx i ) Y i ) ɛ ] S S It is left to boud the coditioal probability appearig iside of expected value defiitio of the projectio we may rewrite P σ σ i hx i ) Y i hx i ) Y i ) ɛ S S ) = P σ σ i v ɛ v H S S i v i S S, Usig our where we oce agai perhaps cofusigly) used v i ad v i to deote idicators h vx i ) Y i ad h v X i ) Y i, where h v H is ay classifier with projectio equal to v Notice that, because we coditioed o S ad S, these sets are ow fixed, ad thus the projectio H S S is ow ot radom ay more, but istead just some fixed subset of 0, We may ow safely use our iitial ) 3) trick uio boud) ad write P σ σ i hx i ) Y i hx i ) Y i ) ɛ S S ) σ i v ɛ i v i S S v H S S P σ 5

Idividual probabilities may be agai bouded usig Hoeffdig s iequality prove it yourself!): ) P σ σ i v ɛ i v i S S e ɛ /4 4/ = e ɛ /8 35 VC combiatorics Puttig all the bits together we fially get: PLĥ) Lh ) ɛ 4e ɛ /8 E S S [ cardhs S )] Agai, makig the upper boud equal to δ ad solvig for ɛ we get that for ay δ > 0 with probability larger tha δ it holds that: Lĥ) if Lg) + log E H ) + log 4 δ, g H where we deoted E H ) := E S [cardh S )] The quatity E H ) is kow as the VC etropy Obviously, the VC etropy ca be upper bouded i the followig perhaps, extremely crude) way: E H ) S H ) := cardh S ) S : cards)= All we did is replaced the average expectatio) with the maximum value The quatity S H ) is commoly kow as the growth fuctio We showed that with probability larger tha δ it also holds that: Lĥ) if g H Lg) + log S H ) + log 4 δ This cocludes the proof of Theorem 7 But are we satisfied with this result? The good thig about Theorem is that as the sample size grows to ifiity the last term o the rhs of Th) decreases to zero, showig that the performace of ERM achieves the best possible oe Does Theorem 7 have the same behaviour? Of course, the aswer depeds o the growth fuctio S H ), which is defied purely by the geometry of H As we already metioed, the trivial upper boud gives S H ) However, if we isert it i the VC-boud we ed up with, which does ot ted to zero A importat questio is: how should H look like so that log S H )/ 0 as? The aswer to this questio is hidde i the followig defiitio: Defiitio 5 VC dimesio) The VC dimesio of the class H is the largest such that S H ) = If there is o such a we say that H has ifiite VC dimesio The followig fact establishes the polyomial growth of S H ) for classes H of fiite VC dimesio: There is a curious history behid this lemma It was apparetly) simultaeously proved by several groups aroud late 60s early 70th, icludig Vapik ad Chervoekis, Sauer, ad Shelah ad Perles A woderful overview of this fact ca be foud i Leo Bottou s slides available olie here: http://leobottouorg/_media/ papers/vapik-symposium-0pdf 6

Lemma 6 Vapik, Chervoekis, Sauer, Shelah) Let H be a class of VC dimesio d < The for all it holds that d ) S H ), i ad for all d it holds that: S H ) e ) d d We may fially state the followig boud, which behaves exactly like the oe of origial Theorem : Theorem 7 VC-boud) Assume H has a VC dimesio d < For ay δ with probability larger tha δ it holds that: Lĥ) if Lg) + d log e d + log 4 δ g H Refereces Olivier Bousquet, Stéphae Bouchero, ad Gábor Lugosi Itroductio to statistical learig theory Lecture Notes i Artificial Itelligece, 004 URL http://wwwkybmpgde/fileadmi/ user_upload/files/publicatios/pdfs/pdf89pdf Luc Devroye, László Györfi, ad Gábor Lugosi A Probabilistic Theory of Patter Recogitio Spriger, 996 7