Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Size: px

Start display at page:

Download "Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017"

Elijah Davidson
5 years ago
Views:

1 Lecture 9: Boostig Akshay Krishamurthy October 3, 07 Recap Last week we discussed some algorithmic aspects of machie learig We saw oe very powerful family of learig algorithms, amely oparametric methods that make very weak assumptios o that data-geeratig distributio, but cosequetly have poor geeralizatio error/covergece rates These methods ted to have low approximatio errors, but extremely high estimatio errors The we saw some pricipled ways to maage the approximatio/estimatio tradeoff Today we ll see aother powerful class of learig algorithm that teds to have small approximatio errors i practice Weak Learig Boostig refers to the techique of aggregatig may predictors together to form oe (hopefully better) predictor There are may boostig algorithms for may differet settigs ad the theory here is quite rich Boostig arose to positively aswer a questio about weak versus strog learig from the computatioal learig theory literature: Defiitio (Weak PAC learig) A hypothesis class H is weakly-pac learable with edge γ if there exists a fuctio weak H : (0, ) N ad a algorithm such that for every δ (0, ) ad for every realizable data distributio, if the algorithm is ru o weak H (δ) samples from the distributio, the with probability at least δ it produces a predictor ĥ with R(ĥ) / γ At face value, this is much weaker tha the previous defiitio of (realizable) PAC learig that we saw before I particular, with the associatio ɛ = / γ it could very well be the case that H (ɛ, δ) = weak H (δ) / ɛ which would ot be a polyomial sample complexity boud for (strog) PAC learig O the other had, we saw that for classificatio, a hypothesis class is PAC learable if ad oly if the VC dimesio is fiite This applies equally for weak learig agai usig the associatio ɛ = / γ i the sample complexity lower boud The mai advatage of weak learig is computatioal Whe we are learig a class H for which we might ot kow how to fid the ERM efficietly, the idea is to istead use a differet class B for which (a) ay ERM i B satisfies the weak learig property for the origial problem, ad (b) we ca fid a ERM for B i polyomial time Thikig about it this way, the algorithm just may ot be able to fid a hypothesis with error rate ɛ, but it ca always fid a hypothesis with error rate 045 which is better tha radom Example (Decisio stumps) Decisio stumps are similar to the decisio lists that we saw o the homework The decisio stump class is: B DS = {x sig(θ x i ) b θ R, i [d], b {, +}} These are fuctios that pick a axis, ad split the data o that axis ad assig positive label to oe side ad egative label to the oither side These are essetially axis aliged thresholds It is easy to fid the ERM for decisio stumps i polyomial time For each coordiate, sort the data accordig to that coordiate value, ad compute the risk of the threshold betwee each pair of poits The take the miimal

2 A atural questio is: suppose a class H ca be γ-weak leared i a computatioal efficiet maer The is there a way to strogly PAC-lear this class i a computatioally efficiet maer? Boostig provides a positive resolutio to this questio Theorem Let H be ay cocept class ad let B be aother hypothesis class The if H is efficetly weakly PAC learable usig B, H is also efficietly strogly PAC learable usig a hypothesis class that is derived from B This hypothesis class cosists of terery majority trees with leaves from B We will ot prove this theorem (it is quite complicated) However at a high level the idea is to combie the weak learers, traied o differet data distributios, i a clever way to boost the accuracy of the predictios This idea has bee tured ito a much more practical algorithm called AdaBoost, which itself has may iterestig theoretical properties 3 Boostig Let s assume for ow that we have a weak learig algorithm with edge γ We d like to use this algorithm iteratively to compute a globally good predictor Whe thikig about this, a atural idea is to first trai a weak predictor o the empirical distributio over the samples, the see where the weak predictor is makig mistakes ad trai a predictor focuses o just this part Ituitively if we aggregate these predictors together, we should do well o the etire domai This leads to the AdaBoost algorithm which formalizes this ituitio We are give samples {(x i, y i )} where y i {, +} ad x i X, some abstract istace space We also have a weak learig algorithm Alg The algorithm will operate o distributios over or re-weightigs of the traiig set, specified by D t where D t R + with D t(i) = With this re-weightig, the empirical error of hypothesis h t is ɛ t = R Dt (h t ) = D t (i){h t (x i ) y i } () The algorithm operates iteratively We start with D (i) = / for all i For each roud t =,, T Let h t be the weak learer traied o D t Let ɛ t be as i Eq () Set α t = l ( ɛ t ɛ t ) 3 Update where Z t ormalizes the distributio { D t+ (i) = D t(i) e α t if h t (x i ) = y i = D t(i) exp( α t y i h t (x i )) Z t e αt if h t (x i ) y i Z t After T iteratios we output the weighted majority hypothesis H(x) = sig( T t= α th t (x)) Let s try to uderstad the algorithm ituitively first If the weak learig assumptio holds, the ɛ t < / ad this meas that α t > 0 With this i mid, the re-weightig scheme clearly dowweights poits that h t labels correctly ad upweights poits that h t gets wrog Based o this, if the weak classifier at roud t + is to also satisfy the weak learig assumptio, the it must focus its attetio o the examples that h t got wrog, which will help us whe we do the aggregatio I particular, it is ot too hard to check that the error rate of h t o the distributio D t+ is exactly /, ad this meas that the weak learer (if it is to maitai its edge), must choose a differet hypothesis o the ext roud There are may theoretical properties of boostig that are worthy of exploratio Ideed there is a whole textbook dedicated to it! Today we ll discuss just a few of them Theorem 3 (Traiig error) Let γ t = / ɛ t be the edge at roud t The T {H(x i ) y i } t= 4γ T exp ( t= γ t )

3 Remark 4 Uder a slightly weaker otio tha the weak learig assumptio (we just eed it o the empirical distributio), we are guarateed that γ t γ for all rouds t Sice if the traiig error is strictly less tha /, it must be exactly zero, this meas that after ) γ rouds, AdaBoost fids a predictor with zero traiig error Remark 5 Of course this says othig about the geeralizatio performace of AdaBoost, which is what we are ultimately iterested i We will cover this shortly Proof The ituitio for the proof is that ay example that you are predictig icorrectly after T rouds has a expoetially large weight i the distributio But sice it is a distributio, it must sum to oe, so you caot be wrog o may poits Defie F (x) = T t= α th t (x) ad let us uravel the recurrece for the defiitios of D T D T + (i) = D (i) exp( α y i h (x i )) Z exp( α T y i h T (x i )) Z T = D (i) exp( y i T t= α th t (x i )) T t= Z t = D (i) exp( y i F (x i )) T t= Z i The ext step is to observe that exp( yf (x)) is a upper boud o the 0/ loss This follows sice H(x) = sig(f (x)) ad if H(x) y the yf (x) 0 so that exp( yf (x)) = {H(x) y} Thus we ca express the empirical error as {H(x i ) y i } D (i) exp( y i F (x i )) = T T D T + (i) Z t = The last equality follows sice D T + is a distributio The last step i the proof is to show that Z t = 4γ t, which requires aalyzig the update scheme Z t = D t (i) exp( α t y i h t (x i )) = = exp( α t )( ɛ t ) + exp(α t )ɛ t i:y i=h t(x i) = exp( α t )(/ + γ t ) + exp(α t )(/ + γ t ) = D t (i) exp( α t ) + 4γ t t= i:y i h t(x i) t= Z t D t (i) exp(α t ) I the last step we use the defiitio of α t + x e x = ɛt ɛ t ) = /+γt / γ t ) Fially, we use the approximatio There are two atural questios you might be askig at this poit First, how do we esure that the weaklearig assumptio is satisifed at every iteratio so that we ca drive the traiig error to zero? Secod, what about geeralizatio? O our way to thikig about geeralizatio, let us first take a detour to describe aother perspective about boostig 4 Boostig ad Loss Miimizatio Aother deep coectio for boostig is that it ca be see as a algorithm for computig the empirical risk miimizer with a particular hypothesis space ad loss fuctio Revisitig the proof of the traiig error boud, we used the iequality {y i H(x i )} exp( y i F (x i )) I fact, every other step i the proof was a equality! This term o the right had side is the empirical expoetial loss of the fuctio F Rather tha derive AdaBoost as a algorithm, we ca thik of the ERM problem, where 3

4 the hypothesis class cosists of distributios over base learers ie R N +, h,, h N H}, ad the loss fuctio is the expoetial loss F = {x N j= α jh j (x) N N, α miimize f F exp( y i f(x i )) We have see may problems of this form before, although ot usig the expoetial loss ad ot with this strage fuctio class But AdaBoost ca be viewed as a coordiate descet algorithm for solvig this optimizatio problem Specifically, cosider the followig procedure, which iitializes F 0 = 0 ad repeats for T iteratios: Choose h t H, α t R to miimize Update F t F t + α t h t exp( y i (F t (x i ) + α t h t (x i ))) If you thik about H = N, this procedure is just coordiate descet o the α parameters, which are i R N + At each iteratio you fid the coordiate (hypothesis) that helps miimize the loss fuctio as much as possible, ad the you take a step i the α space by addig it i to the curret predictor I fact this is exactly what AdaBoost is also doig Theorem 6 The coordiate descet algorithm is equivalet to AdaBoost if the weak learer always miimizes the traiig error o the weighted distributio Proof We already saw i AdaBoost s traiig error boud that Ad this meas that exp( y i (F t (x i )) = exp( y i F t (x i )) = ( t D t (i) τ= t D t (i) Z τ ) exp( y i α t h t (x i )) τ= Z τ D t (i) exp( y i α t h t (x i )) = Z t Thus the optimizatio i the coordiate descet procedure is equivalet to miimizig the ormalizatio factor i AdaBoost We ow show that the choices h t, α t from AdaBoost i fact miimize these First for fixed h with error ɛ, the choice α = ɛ ɛ ) miimizes the ormalizatio factor, sice it amouts to Z t = e α ( ɛ) + e α ɛ By takig derivative, we ca verify that the above choice of α miimizes the expressio, ad moreover the miimum is ɛ( ɛ), which is mootoically decreasig i ɛ Thus we should choose the h t with miimum error o D t This coects AdaBoost with covex optimizatio ad loss miimizatio, as AdaBoost ca be see as computig a approximate ERM This ca also help us obtai geeralizatio bouds by thikig of AdaBoost as a ERM algorithm over the fuctio space F 5 Boostig ad Geeralizatio Thikig about AdaBoost as optimizig the expoetial loss over F gives us a immediate VC-based boud Theorem 7 (VC-based geeralizatio boud) Suppose AdaBoost is ru for T iteratios o radom examples usig base classifiers from H with VCdim(H) = d Assume max{d, T } With probability at least δ R(H) ˆR(H) 3 (T e/t ) + dt e/d) + 8/δ)) + 4

5 Moreover, if H has ˆR(H) = 0 the R(H) (T e/t ) + d e/d) + /δ)) The terms here should be somewhat familiar from the VC-theorem proof Basically what we re goig to show is that if you ru AdaBoost for T iteratios, the you are searchig for a hypothesis of the form sig( T t= α th t ( )) The VC-dimesio of this liear combiatio class scales liearly with umber of coefficiets T This more or less proves the theorem But if you remember the Rademacher complexity problem o homework, this should be usatisfyig sice we are takig a covex combiatio of base learers but somehow payig for it, while there we did ot pay for the umber of hidde uits We ll see ext time how to get a much better geeralizatio boud that is idepedet of the umber of iteratios Proof Let S be a traiig sample of size ad let H S be the restrictio of H to S as we have see before We kow by the Sauer-Shelah lemma that H C (e/d) d sice VCdim(H) = d (ad usig the fact that d for the approximatio to kick i) Let us choose a fixed sequece of T of these restrictios, which we call h,, h T From these, create a modified sample S where x i = (h (x i ),, h T (x i )) is a T -dimesioal biary vector Oce we commit to these T hypotheses, the aggregatio step is equivalet to applyig a liear separator to this augmeted dataset Sice we kow that liear separators i T dimesios have VC-dimesio T, we get that umber of labeligs of the augmeted dataset is at most (e/t ) T Thus the total umber of labeligs is at most τ() ( e ) T ( e ) T d T d Pluggig this i for the growth fuctio i the usual VC theorem proof gives the result Propositio 8 I T dimesios, the set of liear thresholds, h(x) = sig( w, x ), has VC-dimesio T Note that these are axis-aliged thresholds, which is why we are ot gettig T + as we predicted earlier Proof That there exists T poits that ca be shattered is obvious if we use the stadard basis vectors as the poits The other part is less obvious If we have T + poits, the they must be liearly depedet, so i particular we ca write x T + = T β ix i If this set ca be shattered, the we ca fid a predictor w that agrees with β i for each x i ad predicts h(x T + ) = This appears to be a cotradictio 0 = w, β t x t x T + = t= β t w, x t w, x T + > 0 t= 5

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it