Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set of weak learners, {g t (x)} T,.e. we have traned these learners on the tranng data {(x, y )} n =1, whch correspond to the weghts {w } n =1. We can create a strong model, as a combnaton of all these weak learners va creatng a lnear ( combnaton of them. We defne T ) the weghted output as G T (x) = sgn α tg t (x), where the sgn(.) s the sgn functon. One of the most famous method for such boostng based data s called AdaBoost. Based on AdaBoost one could develop methods for gradent boostng, especally for trees. Here we frst explan the AdaBoost. 2 AdaBoost Manly ntroduced n [1], s consdered as one of the most mportant steps n the Statstcal Learnng. One could show that AdaBoost s mnmzng an upper bound on the emprcal error. To be accurate, the emprcal error (for a classfcaton) s defned as the followng: ˆP (Y G T (X) 0) = 1 n { : G(x ) y } whch can be upper-bounded by the multplcaton of the normalzaton constants: ˆP (Y G T (X) 0) Z t. Or Boostng Technques, Ensemble Methods,... 1
Algorthm 1: Adaboost. Input: The set of weak learners, {g t (x)} T Output: The weghts of the generalzed learner {α t } T Intalze the weghts, w (1) = 1 n, = 1,..., n for t = 1 to T do Ft the learner g t (x) to the tranng data {(x, y )} n =1, weghted { by } n =1. Choose α t R; w (t+1) normalzaton factor. Return the output G(x). exp[ α ty g t(x )] Z t, for = 1,..., n, where Z t s a One unanswered queston by now s that, how do we choose α t? One opton s to choose t n a greedy fashon to make Z t (α) at each step. Snce Z t (α) s a convex functon, t has a unque mnmum. Gven that g t (x) {±1}, the greedy choce of α t would gve the followng answer: ɛ t I (y g t (x )) (1) α t 1 2 log 1 ɛ t ɛ t (2) Wth ths choce we can fnd a guarantee on the emprcal error bound of ths algorthm, as stated by the next algorthm. Theorem 1. The gven procedure n Algorthm 1, wth the choce of α t n the Equaton 1, f ɛ t 1 2 for all t wll result n the followng bound: ˆP (Y G T (X) 0) δ (3) for any arbtrary > 0, f T s bg enough. More accurately f T ln 1/δ 2 2. Now we wll provde the proof of Theorem 1 to the next secton. If you don t feel lke readng a relatvely borng proof, you can skp t to the next secton! We chunk the proof nto smaller sectons. Frst we prove the followng lemma. 2
Lemma 1. The gven procedure n Algorthm 1, wth the choce of α t n Equaton 1, wll result n the followng bounds: ˆP (Y G T (X) 0) = 1 n { : G(x ) y } Z t. (4) Proof. The equvalent event to Y G T (X) 0 s exp ( Y G T (X)) 1. It s easy to see that: ˆP (Y G T (X) 0) = 1 n I {G(x ) y =1 } (5) n Usng the updates of w (t+1) whch would gve: 1 n 1 n 1 n =1 n exp ( y G(x )) (6) =1 ( n exp y =1 n =1 T α t g t (x ) ) (7) exp ( y α t g t (x )) (8) n the Algorthm 2, we have: w (t+1) exp ( y α t g t (x )) = Z t n ˆP (Y G T (X) 0) 1 n =1 ( T ) = Z t w (t+1) Z t n 1 n =1 (T +1) w w (1) (9) (10) Snce we chose w (1) (T +1) = 1/n, and w s a proper probablty dstrbuton, ( T n (T +1) ˆP (Y G T (X) 0) Z t) w (11) =1 Z t. (12) 3
Lemma 2. The gven procedure n Algorthm 1, wth the choce of α t n Equaton 1, the greedy choce of α t to mnmze Z t wll result n, and where, ɛ t α t 1 2 log 1 ɛ t ɛ t Z t = 2 ɛ t (1 ɛ t ) I (y g t (x )) Proof. The proof s easy. constant explctly: Let s wrte the defnton of the normalzaton Z t = e αty g t(x ) = = (1 ɛ t )e αt + ɛ t e αt :y =g t(x ) e αt + :y g t(x ) e αt Z t s a convex functon of α t and has unque mnmum. By talkng dervatve we can fnd the mnmzer, whch s the followng: α t = 1 2 log 1 ɛ t ɛ t By pluggng ths nto the defnton of Z t, we wll fnd ts mnmum value: Z t = 2 ɛ t (1 ɛ t ) Lemma 3. The results of Lemma 1 and Lemma 2, and gven that ɛ t 1 2, t, the bound on the emprcal error s not more than, ( 1 4 2) T/2 (where (0, 0.5)). Proof. By the end of Lemma 1 and Lemma 2, we have proven that: ˆP (Y G T (X) 0) 2 ɛ t (1 ɛ t ) Gven that (0, 0.5), we can show that, ˆP (Y G T (X) 0) ( 1 4 2) T/2 4
Now we have everythng we needed for the proof of the Theorem 1. Proof of the Theorem 1. Gven the result of the Lemma 3, we have ˆP (Y G T (X) 0) ( 1 4 2) T/2 Now defne whch s equvalent to δ ( 1 4 2) T/2 T ln 1/δ 2 2 2.1 AdaBoost as mnmzng a global objectve functon One alternate vew to what mentoned above s mnmzng a global objectve functon, wth coordnate-descent (greedy) updates. It can be shown that the global objectve s the followng: L = 1 n e y t αtft(x ) when optmzng locally wth respect to α t. 2.2 More general AdaBoost As shown prevously, the standard form of AdaBoost can be nterpreted as mnmzng an exponental loss functon. One can show a general form of AdaBoost, on arbtrary loss functon, but due to computatonal cost these general forms are not very popular. 3 Gradent Boostng Usng the boostng methods, new methods are proposed for Gradent Boostng methods. In fact, the gradent boostng methods, are usng both of boostng and gradent methods, especally gradent descent (n the functonal level). Usng only gradent methods have cones, e.g. negatve effect on generalzaton error and local optmzaton. However, n glm (a package n R language) suggests selectng a class of functons that uses the covarate nformaton to approxmate the gradent, usually a regresson tree. The 5
algorthm s shown n the Algorthm 3. At each teraton the algorthm determnes the gradent, the drecton n whch t needs to mprove the approxmaton to ft to the data, by selectng from a class of functons. In other words, t selects a functon whch has the most agreement wth the current approxmaton error. Just to remnd you what problem s formally, we want to fnd a functon F such that: G (x) = arg mn E x,y [L(y, F (x))] = arg mn L(y, F (x))p(x, y)dxdy G G x,y But n practce the true dstrbuton P(x, y) s not known, and nstead we have samples of t, n the form of D = {(x, y )} n =1. Also the set of functons we can choose from s also lmted, whch we represent wth G. Thus the problem s approxmated n the followng form: G (x) arg mn L(y, F (x )) G G (x,y ) D Suppose a good approxmaton could be wrtten as a lnear combnaton of some coarse approxmatons: T α t g t (x) Suppose the followng s our ntal approxmaton, wth a constant functon: G 0 (x) = arg mn L(y, α) α (x,y ) D Followed by the ncremental approxmatons: G m (x) = G m 1 (x) + arg mn L(y, G m (x ) + g(x )) g G (x,y ) D Snce the mnmzaton n the prevous equaton s done over functons (functonal mnmzaton) t s relatvely hard to solve. Instead we can approxmate t wth greedy (functonal-)gradent based updates. The negatve functonal-gradent of the loss functon: g L(y, G(x) + g(x)) 6
s the drecton n whch loss functons has the most decrease. Thus followng updates, approprate choce of step sze wll result n reducton: G m (x) = G m 1 (x) m g L(y, G m 1 (x )) x D One possble way to fnd the step sze s va lne search: m = arg mn L (y, G m 1 (x ) g L(y, G m 1 (x ))) x D Algorthm 2: Gradent boostng algorthm. Input: The set of weak learners, {g t (x)} T Output: The weghts of the generalzed learner {α t } T Intalze an approxmaton wth a constant g 0 (x) = arg mn L(y, ). for m = 1 to M do Compute the negatve gradent, r m = ( r 1m, r 2m,..., r nm ), such that r m = L(y, G(x )) G(x ) G(x)=Gm 1 (x) Ft the a functon h m (x) on the gradent resduals {(x, r m )} n =1. Fnd the scalng parameter m such that the followng objectve s mnmzed: m = arg mn L (y, G m 1 (x ) + h m (x)) x D Update, G m (x) = G m 1 (x) + m h m (x) Return the output G M (x). Suppose we want to generalze the Algorthm 2 to trees. The only dfference s that, the approxmatng functon h m (x) s made of a tree, whch we can represent wth j b ji (x R j ), where I(.) s an ndcator functon whch shows whether the nput x belongs to a specfc regon R j or not, and b j s the predcton for the values n ths regon. In the followng step, a value of m s estmated va lne search. In the [2] t s suggested to use value for each regon. In other words change m = arg mn L (y, G m 1 (x ) + h m (x)) x D 7
to jm = arg mn x R jm L (y, G m 1 (x ) + b j ). Note that, n ths lne search, the values of {b j } M j=1 do not have any effect n the fnal result; the only thngs that matter are the set of regons {R j } M j=1. Thus we can smplfy t, and wrte t as the followng: jm = arg mn L (y, G m 1 (x ) + ). x R jm The overall algorthm s shown n Algorthm 3. Algorthm 3: Gradent boostng for trees. Input: The set of weak learners, {g t (x)} T Output: The weghts of the generalzed learner {α t } T Intalze a sngle-node tree G 0 (x) = arg mn L(y, ). for m = 1 to M do Compute the negatve gradent, r m = ( r 1m, r 2m,..., r nm ), such that r m = L(y, G(x )) G(x ) G(x)=Gm 1 (x) Ft the regresson to the pseudo-responses (resduals) {(x, r m )} n =1, whch results n termnal regons, R jm, j = 1,..., J m. for j = 1, 2, 3,..., J m do jm = arg mn x R jm L (y, G m 1 (x ) + ) Update, G m (x) = G m 1 (x) + j jmi (x R jm ) Return the output G M (x). In the glm lbrary of R, gven the scenaro shown n the Algorthm 3, the shrnkage parameter s the (or learnng rate) parameter n gradent updates. So the man effort n varable selecton s n selectng are the choce of n.trees and shrnkage parameters. 3.0.1 Regularzaton Snce the gradent boostng for trees has a bg degree of freedom, t s hghly prone to overfttng. One possble way to reduce the amount of overfttng s 8
shrnkage n the coeffcents. Suppose we have a parameter λ (0, 1), such that: G m (x) = G m 1 (x) + λ m h m (x) 4 Fnal notes Some ntuton s from Davd Forsyth s class at UIUC. Peter Bartlett s class notes provded a very good summary of the man ponts. References [1] Yoav Freund and Robert E Schapre. A descon-theoretc generalzaton of on-lne learnng and an applcaton to boostng. In Computatonal learnng theory, pages 23 37. Sprnger, 1995. [2] J. Fredman, T. Haste, and R. Tbshran. The elements of statstcal learnng, 2008. 9