Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Size: px

Start display at page:

Download "Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities"

Doris Robertson
6 years ago
Views:

1 CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We cotiue with our discussio o AdaBoost ad derive risk bouds of the classifier. Recall that for a fuctio f, we have the followig relatioship betwee the expected excess risk ad the excess φ approximatio risk for a loss fuctio φ, Ψ Rf) R ) R φ f) R φ where, R is the optimal Bayes Risk, Rφ is the risk of the optimal f i.e.r φ = if f Rφ f) ad Hη) is the fuctio Hη) = if ηφα) η)φ α)] ad Ψθ) is the fuctio ) θ Ψθ) = H = φ0) if ) θ H θ φα) θ ] φ α) I the cotext of AdaBoost the loss fuctio φα) = e α is covex ad classificatio calibrated. Thus, Hη) = if ηe α η)e α] Differetiatig w.r.t. α ad settig to zero gives us the optimal η αη) = l η This suggests that if we could choose fx) separately for each x, it would be a mootoically trasformed versio of coditioal probabilitysee ext sectio). Pluggig this α ito H yields Hη) = η η), which is cocave ad symmetric aroud /. The Ψθ) simplifies to Ψθ) = θ) θ) = θ. Fially, pluggig this i to the origial iequality yields Rf) R ) R φ f) R φ. Examiig the Taylor series of the left side about 0 shows that this is equivalet, for some costat c, to Rf) R c R φ f) Rφ ) whe the excess φ-risk is sufficietly small. Thus, drivig the excess φ-risk to zero will drive the discrete loss to zero as well, which justifies AdaBoost s use of this particular covex loss fuctio.

2 Ada Boost, Risk Bouds, Cocetratio Iequalities Relatioship to logistic regressio It turs out that we ca iterpret the value of F x) where F is the boosted classifier retured by AdaBoost) as a trasformed estimate of PrY = X = x). Cosider a logistic model where PrY = X = x) = e = e fx) fx) e fx) e, fx) a rescaled versio of the logistic fuctio. I this model, the log loss egative log likelihood) takes the form l PrY = y i X = x i ) = l e fx) y = l e fx)) y = l e yifxi)). y i= y i= l ) e fx) l e fx)) Thus, the maximum likelihood logistic regressio solutio attempts to miimize the sample average of φα) = l e α). This is closely related to AdaBoost, which miimizes the sample average of φα) = e α. To see the coectio, ote that the first few terms of the Taylor expasio of l e α) l about 0, are idetical to those of e α. α α..., While the two fuctios are very similar ear zero, their asymptotic behavior is very differet. I geeral we have that l e α) e α ; furthermore, the former grows liearly as α approaches, whereas the latter grows expoetially. Thus, we ca view AdaBoost as approximatig the maximum likelihood logistic regressio solutio, except with sometimes expoetially) larger pealties for mistakes. A further similarity betwee the methods is that the α for φα) = l e α) is the same as for AdaBoost. 3 Risk Bouds ad Uiform Covergece So far, we ve looked at algorithms icludig AdaBoost) that optimize over a set of traiig samples: mi f F ˆRf) = Êly, fx)) = ly i, fx i )). If the empirical miimizer is ˆf, we are iterested i boudig the true loss R ˆf) = Ely, ˆfx)) uder this fuctio. I particular, we hope that ˆR ˆf) will coverge to if f F Rf) as For the trivial) case where our fuctio class F cotais oly a sigle fuctio, we ca simply appeal to the law of large umbers. For example, i the case of discrete loss, the Cheroff boud gives a upper boud o Pr ˆRf) Rf) > ɛ) that shriks expoetially i for ay give ɛ. This argumet, however, fails whe F is ot a sigleto. We caot simply apply the law of large umbers to each f F ad the argue that the desired property holds whe miimizig over all of F. The problem

3 Ada Boost, Risk Bouds, Cocetratio Iequalities 3 is that we are cosiderig Rargmi f F ˆRf)), where the ier part depeds o the data. I particular, if F is such that for ay ad data set there are fuctios f F with small ˆRf) but large Rf), the choosig a f that miimizes ˆRf) may ot ted to miimize Rf). Example. Let F = F F with F = {x fx) : {x : fx) = } < } F = {x fx) : {x : fx) = } < } Note that for ay fiite sequece, we ca choose f from either F or F to explai it. Now, suppose we have a distributio P such that P Y = X) = 0.95 almost surely, ad for all x, P X = x) = 0. The, we have f F Rf) = 0.05 = R f F Rf) = 0.95 > R where, R is the Bayes risk. However, for ay fiite sample there is a f F with Rf) = 0 but Rf) R = 0.9. So, choosig a fuctio from a class via empirical risk miimizatio does ot guaratee risk miimizatio with such a rich class. Restated, Rargmi f F Rf)) if f F Rf) Example. If the set of fuctios F {, } X is fiite, the we ca say somethig about the true risk give that the empirical risk is zero. The followig theorem makes this explicit. Theorem 3.. i.e., with probability at least δ, Pr f F ad ˆRf) = 0 & Rf) ɛ) F e ɛ if, ˆRf) = 0, the Rf) log F log /δ Proof. To show this we use the properties of the expoetial fuctios ad uio bouds. For ay f F with Rf) ɛ, we have Pr ˆRf) = 0) ɛ) = exp log ɛ)) exp ɛ) ) Usig the uio boud Boole s iequality: the probability of a uio of evets is o more tha the sum of their probabilities), we have ) Pr {f F : Rf) ɛ & ˆRf) = 0} Pr Rf) ɛ & ˆRf) ) = 0 f F F e ɛ ) Example. Decisio Trees) Cosider the class of decisio trees of fiite umber of odes N over x {, } d. Thus F d ) N, because we ca specify the tree by listig, i breadth-first order, the N odes of the tree, ad each ca be either oe of the covariates or outputs {, }. Thus, if ˆRf) = 0, the with probability δ, Rf) N logd ) log /δ

4 4 Ada Boost, Risk Bouds, Cocetratio Iequalities Example. F is parameterized usig N bits, ı.e. F = {x φx, b), b {0, } N } with f b x) = φx, b). F = N ad thus if ˆRf) = 0, the with probability δ, Rf) N log /δ Typically whe we lear classifiers o traiig data the empirical risk is small but ot zero ad the above theorem ca ot be applied directly. I the ext few sectios we will be developig tools to show properties relatig the empirical risk miimizer ad the miimal risk. 4 Cocetratio Iequalities We will be iterested ot oly i whether Rarg mi f F ˆRf)) iff F Rf), but how fast this covergece happes, called the rate of covergece. 4. Classic bouds For this, several classic iequalities are useful that impose upper bouds o the total probability mass cotaied withi the tail of a distributio. Theorem 4.. Markov s Iequality) If X 0 a.s. ad t > 0, the PrX t) EX) t Proof. EX EXX t)] t PrX t) 0 PrX < t) = t PrX t) Theorem 4.. Chebyshev s Iequality) If t > 0, the Pr X EX t)) V arx) t Proof. Apply Markov s iequality to X EX) These upper bouds are ot ecessarily tight as see i the followig example Example. Let Z i = {0, } be i.i.d with PrZ i = ) = p. Deote S = i z i, the usig Chebyshev s iequality o the variable S / we have S Pr ES ) > t V ars /) p p) t = t O the other had, the cetral limit theorem says ) S σ p N0, ) Thus, )) lim Pr S σ p ct t = Φt) exp where Φt) is the cumulative distributio fuctio of N0, ). So, Pr S p ɛ) should decrease as ɛ exp σ ), which is much faster tha the rate implied by Chebyshev s iequality.

5 Ada Boost, Risk Bouds, Cocetratio Iequalities 5 4. Hoeffdig s Iequality We ca show cocetratio iequalities for sums of idepedet radom variables more geerally. Note that the followig bouds leverage idepedece, but do t require idetical distributios amog the variables ivolved. Theorem 4.3. Hoeffdig s Iequality) Cosider idepedet X i a i, b i ] ad their sum, S = X i. The, t ) PrS ES t) exp b i a i ) Proof. A mootoic trasformatio ad expoetiatio usig s > 0, gives us a positive radom variable. Applyig Markov s iequality we get, PrS ES t) = Pre ss ES) e st ) e st E e ss ES)] Markov s Iequality) ] = e st E exps X i EX i ))) = e st E e sxi EXi)] 3) where, the last iequality uses the idepedece of the variables. We will see i the ext lecture a boud o the last iequality, which will complete the proof. Example. Let X i = {0, } deote the loss o the i th example. The, S = ˆRf) ad ES = Rf). Applyig Hoeffdig s iequality we get P ˆRf) ɛ ) Rf) > ɛ) exp = exp ɛ ) Note that though the rate is right ad this boud is tighter tha Markov s, there is still a factor of σ missig compared to the bouds oe would expect from cetral limit theorem.

Agnostic Learning and Concentration Inequalities

Agnostic Learning and Concentration Inequalities ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture