6.867 Machine learning, lecture 13 (Jaakkola)

Size: px

Start display at page:

Download "6.867 Machine learning, lecture 13 (Jaakkola)"

Neal Newton
5 years ago
Views:

1 Lecture topics: Boostig, argi, ad gradiet descet copleity of classifiers, geeralizatio Boostig Last tie we arrived at a boostig algorith for sequetially creatig a eseble of base classifiers. Our base classifiers were decisio stups that are siple liear classifiers relyig o a sigle copoet of the iput vector. The stup classifiers ca be writte as h(; θ) = sig( s( k θ 0 ) ) where θ = {s, k, θ 0 } ad s {, } specifies the label to predict o the positive side of k θ 0. Figure below shows a possible decisio boudary fro a stup whe the iput vectors are oly two diesioal Figure : A possible decisio boudary fro a traied decisio stup. The stup i the figure depeds oly o the vertical 2 -ais. The boostig algorith cobies the stups (as base learers) ito a eseble that, after rouds of boostig, takes the followig for h () = αˆj h(; θˆj ) () j= where αˆj 0 but they do ot ecessarily su to oe. We ca always oralize the eseble by j= αˆj after the fact. The siple Adaboost algorith ca be writte i the followig odular for:

2 2 (0) Set W 0 (t) = / for t =,...,. () At stage, fid a base learer h(; θˆ) that approiately iiizes W (t)y t h( t ; θ ) = 2ɛ (2) where ɛ is the weighted classificatio error (zero-oe loss) o the traiig eaples, weighted by the oralized weights W (t). (2) Give θˆ, set ( ) ˆɛ αˆ = 0.5 log (3) ɛˆ where ˆɛ is the weighted error correspodig to θˆ chose i step (). For biary {, } base learers, ˆα eactly iiizes the weighted traiig loss (loss of the eseble): ( ) J(α, θˆ) = W (t) ep y t α h( t ; θˆ) (4) I cases where the base learers are ot biary (e.g., retur values i the iterval [, ]), we would have to iiize Eq.(4) directly. (3) Update the weights o the traiig eaples based o the ew base learer: ( ) W (t) = c W (t) ep y t αˆh( t ; θˆ) (5) where c is the oralizatio costat to esure that W (t) su to oe after the update. The ew weights ca be agai iterpreted as oralized losses for the ew eseble h ( t ) = h () + ˆα h(; θˆ). Let s try to uderstad the boostig algorith fro several differet perspectives. First of all, there are several differet types of errors (errors here refer to zero-oe classificatio losses, ot the surrogate epoetial losses). We ca talk about the weighted error of base learer, itroduced at the th boostig iteratio, relative to the weights W (t) o the traiig eaples. This is the weighted traiig error ˆɛ i the algorith. We ca also easure the weighted error of the sae base classifier relative to the updated weights,

3 3 i.e., relative to W (t). I other words, we ca easure how well the curret base learer would do at the et iteratio. Fially, i ters of the eseble, we have the uweighted traiig error ad the correspodig geeralizatio (test) error, as a fuctio of boostig iteratios. We will discuss each of these i tur. Weighted error. The weighted error achieved by a ew base learer h(; θˆ) relative to W (t) teds to icrease with, i.e., with each boostig iteratio (though ot ootoically). Figure 2 below shows this weighted error ˆɛ as a fuctio of boostig iteratios. The reaso for this is that sice the weights cocetrate o eaples that are difficult to classify correctly, subsequet base learers face harder classificatio tasks weighted traiig error uber of iteratios Figure 2: Weighted error ˆɛ as a fuctio of. Weighted error relative to updated weights. We clai that the weighted error of the base learer h(; θˆ) relative to the updated weights W (t) is eactly 0.5. This eas that the base learer itroduced at the th boostig iteratio will be useless (at chace level) for the et boostig iteratio. So the boostig algorith would ever itroduce the sae base learer twice i a row. The sae learer ight, however, reappear later o (relative to a differet set of weights). Oe reaso for this is that we do t go back ad update αˆj s for base learers already itroduced ito the eseble. So the oly way to chage the previously set coefficiets is to reitroduce the base learers. Let s ow see that the clai is ideed true. We ca equivaletly show that the weighted agreeet relative to W (t) is eactly zero: W (t)y t h( t ; θˆ) = 0 (6)

4 4 Cosider the optiizatio proble for α after we have already foud θˆ: ( ) J(α, θˆ) = W (t) ep y t α h( t ; θˆ) (7) The derivative of J(α, θˆ) with respect to α has to be zero at the optial value ˆα so that d J(α, θˆ) dα α = ( ) W (t) ep y =αˆ t αˆh( t ; θˆ) y t h( t ; θˆ) (8) = c W (t)y t h( t ; θˆ) = 0 (9) where we have used Eq.(5) to ove fro W (t) to W (t). So the result is a optiality coditio for α. Eseble traiig error. The traiig error of the eseble does ot ecessarily decrease ootoically with each boostig iteratio. The epoetial loss of the eseble does, however, decrease ootoically. This should be evidet sice it is eactly the loss we are sequetially iiizig by addig the base learers. We ca also quatify, based o the weighted error achieved by each base learer, how uch the epoetial loss decreases after each iteratio. We will eed this to relate the traiig loss to the classificatio error. I fact, the aout that the traiig loss decreases after iteratio is eactly c, the oralizatio costat for the updated weights (we have to oralize the weights precisely because the epoetial loss over the traiig eaples decreases). Note also that c is eactly J(ˆα, θˆ). Now, ( ) J(ˆα, θˆ) = W (t) ep y t αˆh( t ; θˆ) (0) = W (t) ep ( αˆ) + W (t) ep ( ˆα ) () t: y t =h( t;θˆ) t: y t h( t;θˆ) = ( ɛˆ) ep ( αˆ) + ˆɛ ep ( ˆα ) (2) ˆɛ ɛˆ = ( ɛˆ) + ˆɛ (3) ɛˆ ɛˆ = 2 ˆɛ ( ɛˆ) (4) Note that this is always less tha oe for ay ˆɛ < /2. The traiig loss of the eseble after boostig iteratios is eactly the product of these ters (reoralizatios). I

5 5 other words, ep ( y t h ( t ) ) = 2 ɛˆk( ɛˆk) (5) k= This ad the observatio that step(z) ep(z) (6) for all z, where the step fuctio step(z) = if z > 0 ad zero otherwise, suffices for our purposes. A siple upper boud o the traiig error of the eseble, err (h ), follows fro err (h ) = step ( y t h ( t ) ) (7) ep ( y t h ( t ) ) (8) = 2 ˆɛ k ( ˆɛ k ) (9) k= Thus the epoetial loss over the traiig eaples is a upper boud o the traiig error ad this upper boud goes dow ootoically with provided that the base learers are learig soethig at each iteratio (their weighted errors less tha half). Figure 3 shows the traiig error as well as the upper boud as a fuctio of the boostig iteratios. Eseble test error. We have so far discussed oly traiig errors. The goal, of course, is to geeralize well. What ca we say about the geeralizatio error of eseble geerated by the boostig algorith? We have repeatedly tied the geeralizatio error to soe otio of argi. The sae is true here. Cosider figure 5 below. It shows a typical plot of the eseble traiig error ad the correspodig geeralizatio error as a fuctio of boostig iteratios. Two thigs are otable i the plot. First, the geeralizatio error sees to decrease (slightly) eve after the eseble has reached zero traiig error. Why should this be? The secod surprisig thig sees to be the fact that the geeralizatio error does ot icrease eve after a large uber of boostig iteratios. I other words, the boostig algorith appears to be soewhat resistat to overfittig. Let s try to eplai these two (related) observatios. The votes {αˆj } geerated by the boostig algorith wo t su to oe. We will therefore reoralize eseble h () = αˆh(; θˆ) +... αˆh(; θˆ) αˆ ˆα (20)

6 traiig error uber of iteratios Figure 3: The traiig error of the eseble as well as the correspodig epoetial loss (upper boud) as a fuctio of the boostig iteratios. so that h () [, ]. As a result, we ca defie a votig argi for traiig eaples as argi(t) = y t h ( t ). The argi is positive if the eaple is classified correctly by the eseble. It represets the degree to which the base classifiers agree with the correct classificatio decisio (egative value idicates disagreeet). Note that argi(t) [, ]. It is a very differet type of argi (votig argi) tha the geoetric argi we have discussed i the cotet liear classifiers. Now, i additio to the traiig error err (h ) we ca defie a argi error err (h ; ρ) that is the fractio of eaple argis that are at or below the threshold ρ. Clearly, err (h ) = err (h ; 0). We ow clai that the boostig algorith, eve after err (h ; 0) = 0 will decrease err (h ; ρ) for larger values of ρ > 0. Figure 4a-b provide a epirical illustratio that this is ideed happeig. This is perhaps easy to uderstad as a cosequece of the fact that epoetial loss, ep( argi(t)), decreases as a fuctio of the argi, eve after the argi is positive. The secod issue to eplai is the apparet resistace to overfittig. Oe reaso is that the copleity of the eseble does ot icrease very quickly as a fuctio of the uber of base learers. We will ake this stateet ore precise later o. Moreover, the boostig iteratios odify the eseble i sesible ways (icreasig the argi) eve after the traiig error is zero. We ca also relate the argi, or the argi error err (h ; ρ) directly to geeralizatio error. Aother reaso for resistace to overfittig is that the sequetial procedure for optiizig the epoetial loss is ot very effective. We would overfit uch ore quickly if we reoptiized {α j } s joitly rather tha through the sequetial procedure (see the discussio of boostig as gradiet descet below). Boostig as gradiet descet. We ca also view the boostig algorith as a siple

7 a) c) Figure 4: The argi errors err (h ; ρ) as a fuctio of ρ whe a) = 0 b) = 50. gradiet descet procedure (with lie search) i the space of discriiat fuctios. To uderstad this we ca view each base learer h(; θ) as a vector based o evaluatig it o each of the traiig eaples: h( ; θ) h(θ) = (2) h( ; θ) The eseble vector h, obtaied by evaluatig h () at each of the traiig eaples, is a positive cobiatio of the base learer vectors: h = αˆ h(θˆ) (22) j= The epoetial loss objective we are tryig to iiize is ow a fuctio of the eseble vector h ad the traiig labels. Suppose we have h. To iiize the objective, we ca select a useful directio, h(θˆ), alog which the objective sees to decrease. This is eactly how we derived the base learers. We ca the fid the iiu of the objective by ovig i this directio, i.e., evaluatig vectors of the for h + α h(θˆ). This is a lie search operatio. The iiu is attaied at ˆα, we obtai h = h + ˆα h(θˆ), ad the procedure ca be repeated. Viewig the boostig algorith as a siple gradiet descet procedure also helps us uderstad why it ca overfit if we cotiue with the boostig iteratios.

8 traiig/test error uber of iteratios Figure 5: The traiig error of the eseble as well as the correspodig geeralizatio error as a fuctio of boostig iteratios. Copleity ad geeralizatio We have approached classificatio probles usig liear classifiers, probabilistic classifiers, as well as eseble ethods. Our goal is to uderstad what type of perforace guaratees we ca give for such ethods based o fiite traiig data. This is a core theoretical questio i achie learig. For this purpose we will eed to uderstad i detail what copleity eas i ters of classifiers. A sigle classifier is ever cople or siple; copleity is a property of the set of classifiers or the odel. Each odel selectio criterio we have ecoutered provided a slightly differet defiitio of odel copleity. Our focus here is o perforace guaratees that will evetually relate the argi we ca attai to the geeralizatio error, especially for liear classifiers (geoetric argi) ad esebles (votig argi). Let s start by otivatig the copleity easure we eed for this purpose with a eaple. Cosider a siple decisio stup classifier restricted to coordiate of 2 diesioal iput vectors = [, 2 ] T. I other words, we cosider stups of the for h(; θ) = sig ( s( θ 0 ) ) (23) where s {, } ad θ 0 R ad call this set of classifiers F. Eaple decisio boudaries are displayed i Figure 6. Suppose we are gettig the data poits i a sequece ad we are iterested i seeig whe our predictios for future poits becoe costraied by the labeled poits we have already

9 a) b) c) ? h? - h 2 d) e) f) Figure 6: Possible decisio boudaries correspodig to decisio stups that rely oly o coordiate. The arrow (oral) to the boudary specifies the positive side. see. Such costraits pertai to both the data ad the set of classifiers F. See Figure 6e. Havig see the labels for the first two poits, ad +, all classifiers h F that are cosistet with these two labeled poits have to predict + for the et poit. Sice the labels we have see force us to classify the ew poit i oly oe way, we ca clai to have leared soethig. We ca also uderstad this as a liitatio of (the copleity of) our set of classifiers. Figure 6f illustrates a alterative sceario where we ca fid two classifiers h F ad h 2 F, both cosistet with the first two labels i the figure, but ake differet predictios for the ew poit. We could therefore classify the ew poit either way. Recall that this freedo is ot available for all label assigets to the first two poits. So, the stups i F ca classify ay two poits (i geeral positio) i all possible ways (Figures 6a-d) but are already partially costraied i how they assig labels to three poits (Figure 6e). I ore techical ters we say that F shatters (ca geerate all possible labels over) two poits but ot three. Siilarly, for liear classifiers i 2 diesios, all the eight possible labeligs of three poits ca be obtaied with liear classifiers (Figure 7a). Thus liear classifiers i two diesios shatter three poits. However, there are labels over four poits that o liear classifier ca produce (Figure 7b). VC-diesio. As we icrease the uber of data poits, the set of classifiers we are cosiderig ay o loger be able to label the poits i all possible ways. Such eergig costraits are critical to be able to predict labels for ew poits. This otivates a key

10 a) b) - + Figure 7: Liear classifiers o the plae ca shatter three poits a) but ot four b). easure of copleity of the set of classifiers, the Vapik-Chervoekis diesio. The VC-diesio is defied as the aiu uber of poits that a classifier ca shatter. So, the VC-diesio of F is two, deoted as d V C (F ), ad the VC-diesio of liear classifiers o the plae is three. Note that the defiitio ivolves a aiu over the possible poits. I other words, we ay fid less tha d V C poits that the set of classifiers caot shatter (e.g., liear classifiers with poits eactly o a lie i 2 d) but there caot be ay set of ore tha d V C poits that the classifier ca shatter. The VC-diesio of the set of liear classifiers i d diesios is d +, i.e., the uber of paraeters. This is ot a useful result for uderstadig how kerel ethods work. For eaple, the VC-diesio of liear classifiers usig the radial basis kerel is. We ca icorporate the otio of argi i VC-diesio, however. This is kow as the V γ diesio. The V γ diesio of a set of liear classifiers that attai geoetric argi γ whe eaples lie withi a eclosig sphere of radius R is bouded by R 2 /γ 2. I other words there are ot that ay poits we ca label i all possible ways if ay valid labelig has to be with argi γ. This result is idepedet of the diesio of the classifier, ad is eactly the istake boud for the perceptro algorith!

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it