Boosting with log-loss - PDF Free Download

Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the sign of F x i ) giving the prediction on exaple i. We assue that F is a su of a set of basis functions f j F, where F is the set of possible basis functions: T F x) = f j x i ) j= For now, we focus on weak learners of the following for, although siilar algoriths apply for other fors for the weak learners. { c x A f A,c x) = 0 x A where we have soe set A of possible partitions A and a score c R, so that F A R. An obvious objective to iniize is the average training error: J err = [y i signfx))] i This is very challenging optiization proble because it is not even continuous. Machine learning algoriths use various other loss functions that approxiate this objective but are easier to optiize. 2 Adaboost as greedy coordinate descent Adaboost attepts to iniize the exponential loss: J exp = T )) exp y i f j x i ) j=

which is an upper bound on the average training error. As discussed in [3] and [2], Adaboost iniizes this loss by doing greedy coordinate descent on the functions f j. That is, each f k is chosen such that: f k = argin f F k )) exp y i f j x i ) + f k x i ) where the previous f j for j =... k are fixed. Let F k x i ) = k j= f jx i ) be the aggregated contributions fro the existing ters for each data point i. Then, J exp = exp y i F k x i ) + f k x i ))) = w i exp y i f k x i )) j= where the weights w i un-noralized) are defined to be w i := exp y i F k x i )) Each iteration is then interpreted as training a new weak learner f k on the weighted training set with respect to the exponential loss. This coordinate-wise procedure can also be seen as a procedure for feature selection. We now derive the Adaboost updates for our choice of basis functions. At each iteration: J exp A, c) = w i exp y i f A,c x i )) [ = w i exp c) + w i expc) + ] w i i:x i A,y i =+ i:x i A,y i = i:x i A = [ exp c)w + A) + expc)w A) + W A) ] where W + A), W A), and W A) are the sus of weights for positive exaples with x i A, negative exaples with x i A, and exaples with x i A, respectively. We first find the optiu value of c for each choice of A. Differentiating with respect to c and setting to zero gives: c A) = ) W + 2 log A) W A) Plugging this expression into J exp A, c) gives the iniu acheivable loss using a basis function with partition set A: JexpA) = ) W W + + /2 ) A) W A) + W + /2 A) A) + W A)) W A) W A) = 2 ) W + A)W A) + W A) This expression is very convenient, because it is expressed in ters of sus of weights, which can be efficiently coputed for all A with sparse atrix ultiplication when A is large but discrete. 2

3 Greedy coordinate descent with the log-loss We can derive an algorith siilar to Adaboost that will perfor feature selection, but with the log-loss instead of the exponential loss. The log-loss does not increase exponentially as the argin gets ore and ore negative see Figure ), and is therefore ore robust to outliers and noisy data. The log-loss is: Figure : The log-loss as a function of the new c for a given set of weights w and a given A, with the scores chosen by the various algoriths above. J log = log [ + exp y i F k x i ) y i f k x i ))] Following what we did for Adaboost, we seek to greedily iniizing the following function at each step, with respect to the new function f k, keeping the previous functions fixed. f k = argin f F log [ + exp y i F k x i ) y i f k x i ))] Unfortunately, this cannot be expressed in ters of an optiization on a weighted training set because each log-loss ter does not factorize into the contribution of the previous functions F k and the contribution fro the new function f k, as was the case for the exponential loss. Algorith LogitBoost) For our choice of basis functions F, to solve this proble exactly we would have to perfor a nuerical optiization for each A to find the best c A), and then plug this c A) into the loss function to get Jlog A) and choose the A with the iniu value. The objective for the optiization of c, J log c; A), is convex in c, as we now show. Let Fk i = F k x i ) for ore copact notation. 3

d 2 dc 2 J logc; A) = = i:x i A d dc J logc; A) = i:x i A y i exp y i F i k + c)) + exp y i F i k + c)) [ y 2 i exp y i Fk i + c)) + exp y i Fk i + c)) + y i exp y i Fk i + c)) y i) ) exp y i Fk i + c)) ] + exp y i Fk i + c)))2 [ y 2 i exp y i Fk i + c)) + exp y i Fk i + c)) + y2 i exp 2y i Fk i + c)) ] + exp y i Fk i + c)))2 i:x i A Each single ter is positive if: exp y i Fk i + c)) + exp y i Fk i + c)) exp 2y i Fk i + c)) + exp y i Fk i + c)))2 + exp y i F i k + c))) exp y i F i k + c)) exp 2y i F i k + c)) exp y i F i k + c)) 0 Therefore the optiization over c is convex, and can be solved with Newton s ethod. The LogitBoost algorith of Friedan, Hastie and Tibshirani [2]) executes one Newton step, and this appears to be sufficient to get very close to the true iniu, as seen in in Figure 2. However, Friedan, Hastie and Tibshirani also suggest intentionally using a single Newton step instead of exact iniization for the Adaboost objective, which they call Gentle Adaboost. The optial c for the log-loss single Newton step algorith LogitBoost) is just the Newton step itself, since the starting point is always c = 0: c A) = = d J dc logc; A) c=0 d 2 J dc 2 log c; A) c=0 i:x i A y i exp y i Fk i ) i:x i A +exp y i Fk i ) [ y ] 2 i exp y i Fk i ) + y2 i exp 2y ifk i ) +exp y i Fk i ) +exp y i Fk i ))2 i:x i A,y i =+ w i i:x i A,y i = w i i:x i A w i w i ) = W + A) W A) W A) where w i := are interpreted as un-noralized) weights, and W + +expy i F k x i A) and )) W A), are sus of weights as before and W A) = i:x i A w i w i ). Plugging in this value of c for one Newton step gives an estiate for the best loss achievable for a choice of 4

A: J loga) i:x i A ) W + A) W A) log + exp y i F i y i ) W A) + i:x i A log + exp y i F i )) We refer to this process of choosing the c A) by one Newton step, and then finding the A with iniu resulting log-loss as LogitBoost. Unfortunately, this does not decopose into a direct function of sus over weights, and so we cannot use sparse atrix ultiplcation loss to choose the best A as we did with the exponential loss. Instead, for each A, we ust loop over the training set to copute the true log-loss. Algorith 2 QuadBoost) If we approxiate the iniu loss for each A at c A) by the iniu of the second-order expansion of the loss which is what the Newton step actually iniizes), then we get a quantity that is expressed in ters of sus. This is given by: d dc J logc; A) c=0 ) 2 J A) J prev 2 d 2 J dc 2 log c; A) c=0 = i:x i A y ) 2 iw i 2 i:x i A w i w i ) i:xi A,yi=+ w i 2 i:xi A,yi= i) w = J prev 2 i:x i A w i w i ) This expression only uses sus of weights, which we can copute efficiently for all A at once using atrix ultiplication. We refer to this process of choosing c A) by a single Newton step, and then using the second-order approxiation criterion above to find the best A as QuadBoost ). Note that for MEDUSA this update only requires one set of atrix ultiplications instead of two for the existing code. Algorith 3 GradientBoost) An different approxiation also allows the search over our set A to be done using sparse atrix ultiplication. Instead of using the A that gives the optial loss when J log c; A) is optiized, we use the A that gives the highest value of d J dc logc; A) c=0. The intuition is that this is the local direction of steepest descent in A space. Then, we perfor a line search for the optial c for our choice of A. For the log-loss, the criterion for choosing A is: d dc J logc; A) c=0 = i:x i A y i exp y i F k x i )) + exp y i F k x i )) = W + A) W A) 5

where the un-noralized weights w i are as in LogitBoost. The A A that axiizes this criterion can be found efficiently, since the W s are found for all A with sparse atrix ultiplication. Then, we can find the exact optial choice of c A) by a single Newton step. A siilar idea to this is presented in [?]. Algorith 4 CollinsBoost) Collins, Schapire, and Singer []) present a sequential update algorith for both the exponential and log-loss in Figure 2, using the language of Bregan divergences to ake the connection. When applied to the exponential loss, the algorith is basically identical to Adaboost. When applied to the log-loss, the algorith is distinct fro any of the above. We derive a version of the algorith here in a siple way without discussing Bregan distances. Consider the following upper bound on the log-loss for a given training exaple i, as a function of the choice of new basis function f k : J i logf k ) = log + exp y i F i k + f k x i )) ) = log + exp y i Fk i + f k x i ) ) log + exp y i Fk ) ) i + log + exp y i Fk ) ) i + exp yi Fk i = log + f ) kx i )) + exp y i Fk i ) + log + exp y i Fk ) ) i expyi Fk i = log ) + exp y ) if k x i )) + expy i Fk i x + log + exp y i Fk ) ) i i)) ) = log + expy i Fk i ) + + expy i Fk i ) expy if k x i )) + log + exp y i Fk ) ) i + expy i Fk i ) exp y if k x i )) ) + log + exp y i Fk ) ) i where the last line coes fro log+x) x. Note that this bound is exact when f k x i ) = 0. An upper bound on the total log-loss is then: J log f k ) + expy i Fk i ) exp y if k x i )) ) + log + exp y i Fk ) ) i = w i exp y i f k x i )) wi log + exp y i F i k )) ) where w i is again as in Algoriths and 2. The ters on the right are constant, so iniizing this upper bound on each iteration is done by finding: f k = argin f w i exp y i fx i )) Note that this is identical to the update rule for Adaboost, except for the redefinition of the weights w i. This was noted briefly in [3]. This eans that the update rules for finding the optial A and c are also identical, except for the choice of weighting function. 6

Note that each step is not training a log-loss weak learner on the weighted training set, but an exponential-loss weak learner. The fact that the upper bound is exact for c = 0 suggests that this algorith will be ost exact in the later stages of the training. This algorith appears to perfor significantly worse than Algorith. Algorith 5 MedusaBoost) The current logit-boost MEDUSA code iniizes the following function exactly) at each step with respect to f A,c : J ed A, c) = + expy i F i k ) log + exp y if A,c x i ))) = w i log + exp y i f A,c x i ))) The intuition is that we are training a log-loss weak learner on a re-weighted version of the training set on each iteration. This akes sense conceptually, but is not any of the algoriths in the literature discussed above. For our set of basis functions, this function is: J ed A, c) = w i log + exp c)) + w i log2) + i:x i A,y i =+ i:x i C,y i = w i log + expc)) + Finding the optial c for a given A by differentiating: c A) = log W + A) W A) i:x i A,y i =+ i:x i A,y i = w i log2) Note that this is twice the prediction of Algorith 3. Plugging this into the objective gives the expression for the optial loss for a given A: JedA) = W total log 2 + W + A) log 0.5 + 0.5 W ) A) + W A) log 0.5 + 0.5 W ) + A) W + A) W A) which is the criterion used to choose a rule A in the code. If interpreted as an algorith for approxiately greedily iniizing the total log loss J log, this algorith finds the optial c by using the following approxiation of the derivative of the log-loss: d dc log + exp y i Fk i + c) ) y i = + expy i Fk i + c)) y i + expy i Fk i ) + expy i c) This approxiation is best when the argin y i F i k is near zero. 7

3. Coparison of algoriths I evaluated how well each of the above algoriths greedily iniizes the log-loss at each iteration, as well as the train and test accuracy, for a particular data setting: I used a set of 3D chip-features exonarray_cons_vs_uw_fdr0.0_pks_proxk_v), with no paired regulator features, in addition to the NOT of all these chip features, and a bias feature for a total of 30 + 30 + base features). The training log-loss the quantity being iniized) is plotted in Figure 2 for each of the algoriths. I also evaluated the train and test accuracy of each of these algoriths, shown in Figure 3. Figure 2: The LogitBoost, MedusaBoost, and QuadBoost all perfor very siilarly, with QuadBoost becoing less optial in later iterations. LogitBoost is a uch ore coputationally heavy algorith, since it requires an additional loop of the training exaples that the other two do not. AdaBoost is shown even though it is not attepting to iniize this loss function. 8

Figure 3: All algoriths have test accuracy within or 2 % on this proble. CollinsBoost appears to perfor the worst of the log-loss algoriths. The training accuracy of LogitBoost, QuadBoost and MedusaBoost are basically identical. 3.2 Extension to real-valued features Although it was not presented this way, the Alternating Decision Tree algorith uses a predictor function of the following for: F x) = T t= f tj x) where the f tj x) are base features that are boolean conditions on x, and the product of these functions fors an AND. The loss function the exponential loss was used in the ADT algorith) is the iniized greedily with respect to adding a new ter at each iteration. The tree part of the algorith derived fro the restriction that new ters can only involve a cobination of base features that existed in a previous ter, plus one additional base feature. We can attept extend this odel to incorporate general real-valued base features f tj x). Let c be the score of the new feature, and let f k x) be the new feature. The exponential loss used by Adaboost and the original ADT is then: J exp c, f k ) = w i exp y i cf k x i )) Whereas previously the f k x i ) was either zero or one. In our new case, there is no longer a closed-for expression for the optial c given a choice of f k. We would have to resort to a nuerical optiization. The sae holds if we want to use the log-loss. 9 c t n t j=

Interestingly, if we use QuadBoost or GradientBoost procedure as above, we can still use atrix ultiplications to approxiate which f k x i ) should be chosen. The sae derivations as above apply to this case. For GradientBoost the criterion for choosing a feature f k becoes: dj log c; f k ) y i f k x i ) exp y i F k x i ))) dc = c=0 + exp y i F k x i ) = y i f k x i )w i = w i f k x i ) w i f k x i ) i:y i = i:y i =+ where the w i := as before. If the f +expy i F k x i )) kx i ) are expressed in a atrix or pair of atrices for decoposable features used in MEDUSA then these sus can be perfored with a atrix ultiplication. For LogitBoost, having chosen an f k using the above GradientBoost ethod, we would choose the optial c using a single Newton step, which is now: c i:y f k ) = i =+ w if k x i ) i:y i = w if k x i ) i f kx i ) 2 w i w i ) i = w iy i f k x i ) i f kx i ) 2 w i w i ) The QuadBoost criterion for choosing an f k is then: J f k ) = w iy i f k x i )) 2 2 f kx i ) 2 w i w i ) and these sus can be coputed by atrix ultiplication. For MEDUSA with decoposable features, the su over i becoes a su over g and e, and code does not have to change. 3.3 Binary response variable logistic regression) The loss as a function of the new feature choice k and new coefficient c is: Jk, c) = log + exp y i F i + ca ki ))) The derivative with respect to c is: dj dc = exp y i F i + ca ki )) y i A ki + exp y i F i + ca ki )) = y i A ki + expy i F i + ca ki )) Setting the derivative equal to zero does not given an analytic solution even if we restrict ourselves to binary features). Instead, we 0

3.4 Poisson response variable If the response variable is distributed according to a Poisson distribution where logλ) is a weighted su of features, then, the loss as a function of the new choice of feature k and the new score c derived fro the negative log-likelihood) is: Jk, c) = = c y i F i + ca ki ) + exp F i + ca ki ) y i A ki + The derivative with respect to c is dj dc = y i A ki + expca ki ) expf i ) y i F i A ki expca ki ) expf i ) Setting the derivative to zero does not given an analytic solution for the optial c in the general case, and we will resort to using one Newton step as before. However, in the case of boolean features A ki {0, }) we can find an analytic solution: expc) A ki expf i ) = y i A ki = c k = log A ) kiy i A ki expf i ) Substituting this expression into Jk, c) gives the optial loss achievable for each choice of feature k to add: ) Jk = J prev A ki expf i ) A ki y i log A ) ) kiy i A ki expf i ) The required sus are A kiy i and A ki expf i ), which can each be coputed for all k with atrix ultiplications. If we don t restrict ourselves to boolean features, then we resort to a Newton step. The first and second derivatives with respect to c, evaluated at c = 0 are: dj dc = y i A ki + A ki expf i ) c=0 The Newton step is then: d 2 J dc 2 = c=0 c k = A 2 ki expf i ) dj dc c=0 = d 2 J c=0 dc 2 A kiy i F i ) A2 ki expf i)

3.5 Gaussian response variable If the response variable is norally distributed, then the loss as a function of the new choice of feature k and the new score c is: Jk, c) = y i F i ca ki ) 2 Miniizing with respect to c we have the optial c if for each k: c k = A kiy i F i ) A2 ki Substituting this expression for c k into Jk, c) gives the optial loss achievable for each choice k of feature to add: Jk j= = y i F i A A ) 2 kjy j F j ) ki j= A2 kj = y i F i ) 2 j= 2 y i F i )A A kjy j F j ) ki + A 2 j= A ) 2 kjy j F j ) j= A2 ki kj j= A2 kj = J prev A kiy i F i )) 2 A2 ki The nuerator and denoinator can be calculated efficiently for all k at once using atrix ultiplication. If the data consists of two diensions g and e, and the features are the product of two base features and r of the for A r ge = BgeC e, r then the optial loss for a choice of and r is: J r = J prev G E g= e= B gecey r ge F ge ) E ) e= B 2 ge C r e ) 2 G g= Now, the nuerator involves a specialized atrix ultiplication By F )C of the sae coplexity as a noral atrix ultiplication). References [] Michael Collins, Robert E. Schapire, and Yora Singer. Logistic regression, adaboost and bregan distances. In MACHINE LEARNING, pages 58 69, 2000. [2] Jeroe Friedan, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28:2000, 998. [3] Robert E. Schapire. The boosting approach to achine learning: An overview, 2002. ) 2 2