Statistical Machine Learning Notes 9 Instructor: Justin Domke Boosting 1 Additive Models The asic idea of oosting is to greedily uild additive models. Let m (x) e some predictor (a tree, a neural network, whatever), which we will call a ase learning. In oosting, we will uild a model that is the sum of ase learners as f(x) = M m (x). m=1 The ovious way to fit an additive model is greedily. Namely, with start with the simple function f (x) =, then iteratively add ase learners to minimize the risk of f m 1 (x)+ m (x). Forward stagewise additive modeling f (x) For m = 1,..., M m arg min L(f m 1 (ˆx) + (ˆx), ŷ) f m (x) f m 1 (x) + v m (x) Notice here that once we have fit a particular ase learner, it is frozen in, and is not further changed. Once can also create procedures for additive modeling that re-fit the ase learners several times. 2 Additive egression Trees It is quite easy to do forward stagewise additive modeling when we are interested in minimizing the least-squares regression loss. 1
Boosting 2 arg min L ( f m 1 (ˆx) + (ˆx), ŷ ) = arg min = arg min ( fm 1 (ˆx) + (ˆx) ŷ ) 2 ( ) 2 (ˆx) r(ˆx, ŷ) (2.1) where r(ˆx, ŷ) = ŷ f m 1 (ˆx). However, minimizing a prolem like Eq. 2.1 is exactly what all our previous regression methods do. Thus, we can apply any method for least-squares regression essentially unchanged. The only difference is that the target values are given y ŷ f m 1 (ˆx) the difference etween ŷ and the additive model so far. Here are some examples applying this to two datasets. Here (as in all the examples in these notes) our ase learner is a regression tree with two levels of splits.
Boosting 3 data.2.15.1.5 2 4 6 8 1 1 2 s 3 s 4 s 9 s 5 s
Boosting 4 data.4.3.2.1 2 4 6 8 1 1 2 s 3 s 9 s 49 s 1 s 3 Boosting Suppose we want to minimize some other loss function. Unfortunately, if we aren t using the least-squares loss, the optimization
Boosting 5 arg min L(f m 1 (ˆx) + (ˆx), ŷ) (3.1) will not reduce to a least-squares prolem. One alternative would e to use a different aselearner, that fits a different loss. With oosting, we don t want to change our ase learner. This could e ecause it is simply inconvenient to create a new algorithm for a different loss function. The general idea of oosting is to fit a new ase learner to, if not minimize the risk for f m, to decrease it. As we will see, we can do this for a large range of loss functions, all ased on a least-squares ase learner. 4 Gradient Boosting Gradient oosting is a very clever algorithm. In each, we set as target values the negative gradient of the loss with respect to f. So, if the ase learner closely matches the target values, when we add some multiple v of the ase learner to our additive model, it should decrease the loss. Gradient Boosting f (x) For m = 1,..., M For all(ˆx, ŷ) r(ˆx, ŷ) d df L(f m 1(ˆx), ŷ) m arg min ( ) 2 (ˆx) r(ˆx, ŷ) f m (x) f m 1 (x) + v m (x) Consider changing f(x) to f(x)+v(x) for some small v. We want to minimize the loss, i.e. L(f m 1 (ˆx) + (ˆx), ŷ) L(f m 1 (ˆx), ŷ) + dl(f m 1 (ˆx), ŷ) (ˆx). (4.1) df However, it doesn t make much sense to simply minimize the right hand side of Eq. 4.1, since we can always make the decrease igger y multiplying y a larger constant. Instead, we choose to minimize
Boosting 6 dl(f m 1 (ˆx), ŷ) (ˆx) + 1 (ˆx) 2 df 2 which prevents from ecoming too large. It is easy to see that arg min dl(f m 1 (ˆx), ŷ) (ˆx) + 1 df 2 (ˆx) 2 = arg min ((ˆx) + dl(f m 1(ˆx), ŷ) df = arg min ( ) 2 (ˆx) r(ˆx, ŷ) ) 2 Some example loss functions, and their derivatives: least-squares. L = (f ŷ) 2 dl/fd = 2(f ŷ) least-asolute-deviation. L = f ŷ dl/df = sign(f ŷ) logistic regression. L = log(1 + exp( ŷf)) dl/df = exponential loss. L = exp( ŷf) dl/df = ŷ exp( ŷf) ŷ 1+exp(fŷ). Fun fact: gradient oosting with a step size of v = 1 on the least-squares loss gives exactly 2 the forward stagewise modeling algorithm we devised aove. Here are some examples of gradient oosting on classification losses (logistic and exponential). The figures show the function f(x). As usual, we would get a predicted class y taking the sign of f(x).
Boosting 7 Gradient Boosting, Logistic Loss, step size of v = 1. data.8.6.4.2 2 4 6 8 1 1 2 s 3 s 9 s 49 s 1 s
Boosting 8 Gradient Boosting, Exponential loss, step size of v = 1. data 1.8.6.4.2 2 4 6 8 1 1 2 s 3 s 9 s 49 s 1 s
Boosting 9 5 Newton Boosting The asic idea of Newton oosting is, instead of just taking the gradient, to uild a local quadratic model, and then to minimize it. We can again do this using any least-squares regressor as a ase learner. (Note that Newton oosting is not generally accepted terminology. The general idea of fitting to a quadratic optimization is used in several algorithms for specific loss function (e.g. Logitoost), ut doesn t seem to have een given a name. The class voted to use Newton Boosting in preference to Quadratic Boosting.) Newton Boosting f (x) For m = 1,..., M For all(ˆx, ŷ) h(ˆx, ŷ) d2 df 2L(f m 1(ˆx), ŷ) g(ˆx, ŷ) d L(f df m 1(ˆx), ŷ) r(ˆx, ŷ) 1 d h(ˆx, ŷ) df L(f m 1(ˆx), ŷ) m arg min h(ˆx, ŷ) ( (ˆx) r(ˆx, ŷ) ) 2 f m (x) f m 1 (x) + v m (x) To understand this, consider search searching for the that minimizes the empirical risk for f m. To start with, we make a local second-order approximation. (Here, g(ˆx, ŷ) = dl(f m 1(ˆx),ŷ) df, and h(ˆx, ŷ) = d2 L(f m 1 (ˆx),ŷ) df 2.) arg min L(f m 1 (ˆx) + (ˆx), ŷ) arg min 1 = arg min 2 = arg min = arg min ( L(fm 1 (ˆx), ŷ) + (ˆx)g(ˆx, ŷ) + 1 2 (ˆx)2 h(ˆx, ŷ) ) h(ˆx, ŷ) ( g(ˆx, ŷ) 2 (ˆx) + (ˆx)2) h(ˆx, ŷ) h(ˆx, ŷ) ( (ˆx) g(ˆx, ŷ) ) 2 h(ˆx, ŷ) h(ˆx, ŷ) ( (ˆx) r(ˆx, ŷ) ) 2
Boosting 1 This is a weighted least-squares regression, with weights h(ˆx, ŷ) and targets g h. Here, we collect in a tale all of the loss functions and derivatives we might use. Loss Functions L(f, ŷ) dl d 2 L Name L df df 2 least-squares (f ŷ) 2 2(f ŷ) 2 least-asolute-deviation f ŷ sign(f ŷ) logistic log ( 1 + exp( ŷf) ) ŷ 1 + exp(fŷ) ŷ 2 2 + exp(fŷ) + exp( fŷ) exponential exp( ŷf) ŷ exp( ŷf) y 2 exp( fŷ) Here are some experiments using Newton oosting on the same dataset as aove. Generally, we see that Newton oosting is more effective than gradient oosting in quickly reducing the ing error. However, this does also means it egins to overfit after a smaller numer of s. Thus, if running time is an issue (either at ing or time), Newton oosting may e preferale, since it more quickly reduces ing error.
Boosting 11 Newton Boosting, Logistic Loss, step size of v = 1. data.8.6.4.2 2 4 6 8 1 1 2 s 3 s 9 s 49 s 1 s
Boosting 12 Newton Boosting, Logistic Loss, step size of v = 1. data 1.8.6.4.2 2 4 6 8 1 1 2 s 3 s 9 s 49 s 1 s
Boosting 13 6 Shrinkage and Early Stopping Now, we have not talked aout regularization. What is suggested in practice is to regularize y early stoppping. Namely, repeatedly oost, ut hold out a set. Select the numer of ase learners M in the model y the numer that leads the minimum error on the set. (It would e est to select this M y cross validation, then re-fit with all models, though this is often not done is practice to reduce running times.) Notice that we have another parameter controlling complexity, however: the step size v. For a small step size, we will need more models M to achieve the same complexity. Experimentally, etter results are almost always given y choosing a small step size v, and a larger numer of ase learners M. The following experiments suggest that this is a product of the fact that it yields smoother models. Note also from the following experiments that Newton oosting with a given step size performs similarly to gradient oosting with a smaller step size.
Boosting 14 Gradient oosting, least-squares loss, v = 1 2.2.15.1.5 Gradient oosting, least-squares loss, v = 1 2 2 4 6 8 1.2.15.1.5 Gradient oosting, least-squares loss, v = 1 2 2 4 6 8 1.2.15.1.5 2 4 6 8 1
Boosting 15 In the following figures, the left column shows f(x) after 1 s, while the middle column shows sign(f(x)). Newton oosting, logistic loss, v = 1.8.6.4.2 2 4 6 8 1 Newton oosting, logistic loss,v = 1 1.8.6.4.2 2 4 6 8 1 Newton oosting, logistic loss,v = 1 1.8.6.4.2 2 4 6 8 1
Boosting 16 Gradient oosting, logistic loss,v = 1.8.6.4.2 2 4 6 8 1 Gradient oosting, logistic loss, v = 1 1.8.6.4.2 2 4 6 8 1 Gradient oosting, logistic loss, v = 1 1.8.6.4.2 2 4 6 8 1 7 Discussion Note that while our experiments here have used trees, any least-squares regressor could e used at the ase learner. If so inclined, we could even use a oosted regression tree as our ase learner. Most of these loss functions can e extended to multi-class classification. This works y, in each, not fitting a single tree, ut C trees, where there are C classes. For newton oosting, this is made tractale y taking a diagonal approximation of the Hessian.
Boosting 17 Our presentation here has een quite ahistorical. The first oosting algorithms were presented from a very different, and more theoretical perspective.