COS 511: Theoretical Machine Learning

COS 5: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture #0 Scrbe: José Sões Ferrera March 06, 203 In the last lecture the concept of Radeacher coplexty was ntroduced, wth the goal of showng that for all f n a faly of functons F we have ÊS f E f Let us suarze the defntons of nterest: F faly of functons f : Z 0, S = z,, z ˆR S (F) = E σ σ f(z ) = R (F) = E S ˆRS (F) We also began provng the followng theore: Theore Wth probablty at least δ and f F: ln /δ E f ÊS f + 2R (F) + O E f ÊS f + 2 ˆR ln /δ S (F) + O Whch we now prove n full Proof Let us defne: Step ( Φ(S) E S Φ(S) + O ( ) Φ(S) = E f ÊS f E f = E z D f(z) Ê S f = f(z ) ) ln /δ Ths was proven last lecture and follows fro McDard s nequalty Step 2 E S Φ(S) E S,S = ) (ÊS f ÊS f Ths was also shown last lecture We also consdered generatng new saples T, T by flppng a con, e runnng through =,, we flp a con, swappng z wth z f heads, and dong nothng otherwse We then claed that the dstrbutons thus generated are dstrbuted the sae as S and S, and we noted Ê S f ÊS f = ( f(z ) f(z ) )

whch eans we can wrte Ê T f ÊT f = ( σ f(z ) f(z ) ) whch s wrtten n ters of Radeacher rando varables, σ We now proceed wth the proof Step 3 We frst cla E S,S ) (ÊS f ÊS f = E S,S,σ ( ( f(z ) f(z ) ) ) σ To see ths, note that the rght hand sde s effectvely the sae expectaton as the left hand sde, but wth respect to T and T, whch are dentcally dstrbuted to S and S Now we can wrte E S,S,σ ( ( f(z ) f(z ) ) ) σ E S,S,σ σ f(z ) + E S,S,σ ( σ )f(z ) where we are just axzng over the sus separately We now note two ponts: The rando varable σ has the sae dstrbuton as σ ; 2 The expectaton over S s rrelevant n the frst ter, snce the ter nsde the expectaton does not depend on S Slarly, the expectaton over S s rrelevant n the second ter Therefore E S,S,σ σ f(z ) = E S,σ = E S E σ = R (F) σ f(z ) σ f(z ) and, slarly E S,S,σ ( σ )f(z ) = R (F) 2

Step 4 We have thus shown E S,S Channg our results together, we obtan ) (ÊS f ÊS f 2R (F) ( ) Φ(S) = E f ÊS f 2R (F) + O ln /δ We conclude that wth probablty at least δ and f F ln /δ E f ÊS f Φ(S) 2R (F) + O Therefore, wth probablty at least δ and f F ln /δ E f ÊS f + 2R (F) + O Ths s one of the results we were seekng Provng the result for ˆR S (F) s just a atter of applyng McDard s nequalty to obtan, wth probablty at least δ ln /δ ˆR S (F) R (F) + O Motvaton The orgnal otvaton behnd the theore above was to obtan a relatonshp between generalzaton error and tranng error We want to be able to say that, wth probablty at least δ, h H err(h) err(h) ˆ + sall ter We note that err(h) s evocatve of E f and err(h) ˆ s evocatve of ÊS f, whch appear n our theore Let us wrte err(h) = Pr (x,y) D h(x) y = E (x,y) D {h(x) y} err(h) ˆ = {h(x ) y } = ÊS {h(x) y} as per our defntons We see that, to ft our defnton, we ust work wth functons f whch are ndcator functons Let us defne Z = X {, +} and for h H: f h (x, y) = {h(x) y} 3

Now we can wrte: E (x,y) D {h(x) y} = E f h Ê S {h(x) y} = ÊS f h F H = {f h : h H} Ths allows us to use our theore to state that: Wth probablty δ h H err(h) err(h) ˆ + 2R (F H ) + O err(h) err(h) ˆ + 2 ˆR S (F H ) + O ln /δ ln /δ We want to wrte the above n ters of the Radeacher coplexty of H, whch we can do by lookng at the defnton of Radeacher coplexty We have ˆR S (F H ) = E σ σ f h (x, y ) f h F H Now, our functons f h are just ndcator functons and can be wrtten f h (x, y ) = y h(x ) 2 Further, we are ndexng each functon by a functon h H Therefore, we can just ndex the reu wth h H nstead of f h F H Wrtng ths out gves ˆR S (F H ) = E σ = 2 E σ h H ( ) y h(x ) σ 2 σ + ( y σ )h(x ) h H Because σ s a Radeacher rando varable, ts expectaton s just 0 For the second ter, we note that because the saple S s fxed, the y s are fxed, and therefore the ter y σ s dstrbuted the sae as σ Hence, we conclude We have therefore shown ˆR S (F H ) = 0 + 2 E σ h H err(h) err(h) ˆ + R (H) + O err(h) err(h) ˆ + ˆR S (H) + O σ h(x ) = 2 ˆR S (H) ln /δ ln /δ 4

2 Obtanng other bounds It was alluded to n class that obtanng the above bounds n ters of Radeacher coplexty subsues other bounds prevously shown, whch can be deonstrated wth an exaple We frst state a sple theore (a slghtly weaker verson of ths theore wll be proved n a later hoework assgnent) Theore For H < : ˆR S (H) 2 ln H Now consder agan the defnton of eprcal Radeacher coplexty: ˆR S (H) = E σ σ h(x ) h H We see that t only depends on how the hypothess behaves on the fxed set S We therefore have a fnte set of behavors on the set Defne H H, where H s coposed of one representatve fro H for each possble labelng of the saple set S by H Therefore = H = Π H (S) Π H () Snce the coplexty only depends on the behavors on S, we cla ˆR S (H) = E σ σ h(x ) = h H ˆR S (H ) = We can now use the theore stated above to wrte 2 ln ˆR S (H ΠH (S) ) Fnally, we recall that after provng Sauer s lea, we showed Π H () ( ) e d, d for d Therefore 2d ln ( ) e d ˆR S (H) We have thus used the Radeacher coplexty results to get an upper bound for the case of nfnte H n ters of VC-denson 3 Boostng Up untl ths pont, the PAC learnng odel we have been consderng requres that we be able to learn to arbtrary accuracy Thus, the proble we have been dealng wth s: 5

Strong learnng C s strongly PAC-learnable f algorth A dstrbutons D c C ɛ > 0 δ > 0 A, gven = poly (/ɛ, /δ, ) exaples, coputes h such that Pr err(h) ɛ δ But what f we can only fnd an algorth that gves slghtly better than an even chance of error (eg 40%)? Could we use t to develop a better algorth, teratvely provng our soluton to arbtrary accuracy? We want to consder the followng proble: Weak learnng C s weakly PAC-learnable f γ > 0 algorth A dstrbutons D c C δ > 0 A, gven = poly (/ɛ, /δ, ) exaples, coputes h such that Pr err(h) 2 γ δ We note that n ths proble we no longer requre arbtrary accuracy, but only that the algorth pcked be able to do slghtly better than rando guessng, wth hgh probablty The natural queston that arses s whether weak learnng s equvalent to strong learnng Consder frst the spler case of a fxed dstrbuton D In ths case, the answer to our queston s no, whch we can llustrate through a sple exaple Exaple: For fxed D, defne X = {0, } n {z} D pcks z wth probablty /4 and wth probablty 3/4 pcks unforly fro {0, } n C = { all concepts over X } In a tranng saple, we expect to see z wth hgh probablty, and therefore z wll be correctly learned by the algorth However, the reanng ponts are exponental n, so that wth only poly(/ɛ, /δ, ) nuber of exaples, we are unlkely to do uch better than even chance on the rest of the doan We therefore expect the error to be gven roughly by err(h) 2 3 4 + 0 4 = 3 8 n whch case C s weakly learnable, but not strongly learnable We wsh to prove that n the general case of an arbtrary dstrbuton the followng theore holds: Theore Strong and weak learnng are equvalent under the PAC learnng odel The way we wll reach ths result s by developng a boostng algorth whch constructs a strong learnng algorth fro a weak learnng algorth 6

3 The boostng proble The challenge faced by the boostng algorth can be defned by the followng proble Boostng proble Gven: (x, y ),, (x, y ) wth y {, +} access to a weak learner A: dstrbutons D gven exaples fro D coputes h such that Pr err D (h) 2 γ δ Goal: fnd H such that wth hgh probablty err D (H) ɛ for any fxed ɛ Fgure : Scheatc representaton of boostng algorth The an dea behnd the boostng algorth s to produce a nuber of dfferent dstrbutons D fro D, usng the saple provded Ths s necessary because runnng A on the sae saple alone wll not, n general, be enough to produce an arbtrarly accurate hypothess (certanly so f A s deternstc) A boostng algorth wll therefore run as follows: Boostng algorth for t =,, T run A on D t to get weak hypothess h t : X {, +} ɛ t = err Dt (h t ) = 2 γ t, where γ t γ end output H, where H s a cobnaton of the weak hypotheses h,, h T In the above, the dstrbutons D t are dstrbutons on the ndces,,, and ay vary fro round to round It s by adjustng these dstrbutons that the boostng algorth wll be able to acheve hgh accuracy Intutvely, we want to pck the dstrbutons D t such that, on each round, they provde us wth ore nforaton about the ponts n the saple 7

that are hard to learn The boostng algorth can be seen scheatcally n Fgure Let us defne: D t () = D t (x, y ) We pck the dstrbuton as follows: : D () = where α t > 0 D t+ () = D { t() e α t f h t (x ) y Z t e αt f h t (x ) = y Intutvely, all our exaples are consdered equally n the frst round of boostng Gong forward, f an exaple s sclassfed, ts weght n the next round wll ncrease, whle the weghts of the correctly classfed exaples wll decrease, so that the classfer wll focus on the exaples whch have proven harder to classfy correctly 8