STK-IN4300 Statistical Learning Methods in Data Science

Outline of the lecture STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no AdaBoost Introduction algorithm Statistical Boosting Boosting as a forward stagewise additive modelling Why exponential loss? Gradient boosting STK-IN4300: lecture 10 1/ 28 STK-IN4300: lecture 10 2/ 28 AdaBoost: introduction L. Breiman: [Boosting is] the best off-shelf classifier in the world. originally developed for classification; as a pure machine learning black-box; translated into the statistical world (Friedman et al., 2000); extended to every statistical problem (Mayr et al., 2014), regression; survival analysis;... interpretable models, thanks to the statistical view; extended to work in high-dimensional settings (Bühlmann, 2006). AdaBoost: introduction Starting challenge: Can [a committee of blockheads] somehow arrive at highly reasoned decisions, despite the weak judgement of the individual members? (Schapire & Freund, 2014) Goal: create a good classifier by combining several weak classifiers, in classification, a weak classifier is a classifier which is able to produce results only slightly better than a random guess; Idea: apply repeatedly (iteratively) a weak classifier to modifications of the data, at each iteration, give more weight to the misclassified observations. STK-IN4300: lecture 10 3/ 28 STK-IN4300: lecture 10 4/ 28

AdaBoost: introduction STK-IN4300: lecture 10 5/ 28 AdaBoost: algorithm Consider a two-class classification problem, y i P t 1, 1u, x i P R p. AdaBoost algorithm: 1. initialize the weights, w r0s p1{n,..., 1{Nq; 2. for m from 1 to m stop, (a) fit the weak estimator Gp q to the weighted data; (b) compute the weighted in-sample missclassification rate, err rms ř N wrm 1s i 1 i 1py i Ĝrms px i qq; (c) compute the voting weights, α rms logpp1 err rms q{err rms q; (d) update the weights w i w rm 1s i w rms i exptα rms 1py i Ĝrms px 1qqu; w i{ ř N i 1 wi; 3. compute the final result, mÿ stop Ĝ AdaBoost signp α rms Ĝ rms px 1 qq m 1 STK-IN4300: lecture 10 6/ 28 First iteration: apply the classier Gp q on observations with weights: w i 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 observations 1, 2 and 3 are misclassified ñ err r1s 0.3; compute α r1s 0.5 logpp1 err r1s q{err r1s q «0.42; set w i w i exptα r1s 1py i Ĝr1s px i qqu: w i 0.15 0.15 0.15 0.07 0.07 0.07 0.07 0.07 0.07 0.07 STK-IN4300: lecture 10 7/ 28 STK-IN4300: lecture 10 8/ 28

Second iteration: apply classifier Gp q on re-weighted observations (w i { ř i w i): w i 0.17 0.17 0.17 0.07 0.07 0.07 0.07 0.07 0.07 0.07 observations 6, 7 and 9 are misclassified ñ err r2s «0.21; compute α r2s 0.5 logpp1 err r2s q{err r2s q «0.65; set w i w i exptα r2s 1py i Ĝr2s px i qqu: w i 0.09 0.09 0.09 0.04 0.04 0.14 0.14 0.04 0.14 0.04 STK-IN4300: lecture 10 9/ 28 STK-IN4300: lecture 10 10/ 28 Third iteration: apply classifier Gp q on re-weighted observations (w i { ř i w i): w i 0.11 0.11 0.11 0.05 0.05 0.17 0.17 0.05 0.17 0.05 observations 4, 5 and 8 are misclassified ñ err r3s «0.14; compute α r3s 0.5 logpp1 err r3s q{err r3s q «0.92; set w i w i exptα r3s 1py i Ĝr3s px i qqu: w i 0.04 0.04 0.04 0.11 0.11 0.07 0.07 0.11 0.07 0.02 STK-IN4300: lecture 10 11/ 28 STK-IN4300: lecture 10 12/ 28

STK-IN4300: lecture 10 13/ 28 STK-IN4300: lecture 10 14/ 28 Statistical Boosting: Boosting as a forward stagewise additive modelling Statistical Boosting: Boosting as a forward stagewise additive modelling The statistical view of boosting is based on the concept of forward stagewise additive modelling: minimizes a loss function Lpy i, fpx i qq; using an additive model, fpxq ř M m 1 β mbpx; γ m q; (see notes) bpx; γm q is the basis, or weak learner; at each step, ř pβ m, γ m q argmin N β,γ i 1 Lpy i, f m 1 px i q ` βbpx i ; γqq; the estimate is updated as f m pxq f m 1 pxq ` β m bpx; γ m q e.g., in AdaBoost, β m α m {2, bpx; γ m q Gpxq; STK-IN4300: lecture 10 15/ 28 STK-IN4300: lecture 10 16/ 28

Statistical Boosting: Why exponential loss? Statistical Boosting: Why exponential loss? The statistical view of boosting: allows to interpret the results; by studying the properties of the exponential loss; It is easy to show that i.e. f pxq argmin fpxq E Y X x re Y fpxq s 1 2 P rpy 1 xq 1 ` 1 e 2f pxq ; log P rpy 1 xq P rpy 1 xq, therefore AdaBoost estimates 1/2 the log-odds of P rpy 1 xq. Note: the exponential loss is not the only possible loss-function; deviance (cross/entropy): binomial negative log-likelihood, where: lpπ x q y 1 logpπ x q p1 y 1 q logp1 π x q, y 1 py ` 1q{2, i.e., y 1 P t0, 1u; π x PrpY 1 X xq e fpxq e fpxq`e fpxq 1 1`e 2fpxq ; equivalently, lpπ x q logp1 ` e 2yfpxq q. same population minimizers for Er lpπ x qs and Er Y fpxqs. STK-IN4300: lecture 10 17/ 28 STK-IN4300: lecture 10 18/ 28 Statistical Boosting: Why exponential loss? Statistical Boosting: Gradient boosting We saw that AdaBoost iteratively minimizes a loss function. In general, consider Lpfq ř N i 1 Lpy i, fpx i qq; ˆf argmin f Lpfq; the minimization problem can be solved by considering where: f mstop mÿ stop m 0 h m f0 h 0 is the initial guess; each fm improves the previous f m 1 though h m ; hm is called step. STK-IN4300: lecture 10 19/ 28 STK-IN4300: lecture 10 20/ 28

The steepest descent chooses where Statistical Boosting: steepest descent h m ρ m g m g m P R N is the gradient descent of Lpfq evaluated at f f m 1 and represents the direction for the minimization, g im BLpy i, fpx i qq ˇ Bfpx i q ˇfpxi q f m 1 px i q ρ m is a scalar and tells how much to minimize ρ m argmin ρ Lpf m 1 ρg m q. Statistical Boosting: example Consider the linear Gaussian regression case: Lpy, fpxqq 1 2 ř N i 1 py i fpx i qq 2 ; fpxq X T β; initial guess: β 0. Therefore: g B 1 2 ř N i 1 py i fpx i qq 2 Bfpx i q py X T βq; g m py X T βqˇˇβ 0 y; ρ m argmin ρ 1 2 py ρyq2 Ñ ρ m XpX T Xq 1 X T. Note: overfitting! STK-IN4300: lecture 10 21/ 28 STK-IN4300: lecture 10 22/ 28 Statistical Boosting: shrinkage To regularize the procedure, a shrinkage factor is introduced, where 0 ă ν ă 1. f m pxq f m 1 pxq ` νh m Moreover, h m can be a general weak learner (base learner): stump; spline;... the idea is to fit the base learner to the gradient descent to iteratively minimize the loss function. Statistical Boosting: Gradient boosting Gradient boosting algorithm: 1. initialize the estimate, e.g., f 0 pxq 0; 2. for m 1,..., m stop, 2.1 compute the negative gradient vector, u m BLpy,fpxqq ˇ Bfpxq ˇfpxq ˆfm 1pxq; 2.2 fit the base learner to the negative gradient vector, h m pu m, xq; 2.3 update the estimate, f m pxq f m 1 pxq ` νh m pu m, xq. 3. final estimate, ˆf mstop pxq ř m stop m 1 νh mpu m, xq Note: u m g m ˆf mstop pxq is a GAM. STK-IN4300: lecture 10 23/ 28 STK-IN4300: lecture 10 24/ 28

Statistical Boosting: example Consider again the linear Gaussian regression case: Lpy, fpxqq 1 2 ř N i 1 py i fpx i, βqq 2, fpx i, βq x i β; hpy, Xq XpX T Xq 1 X T y. Therefore: initialize the estimate, e.g., ˆf 0 px, βq 0; m 1, u 1 BLpy,fpX,βqq ˇ BfpX,βq py 0q y; ˇfpX,βq ˆf0pX,βq h 1 pu 1, Xq XpX T Xq 1 X T y; ˆf1 pxq 0 ` νxpx T Xq 1 X T y. m 2, u 2 BLpy,fpX,βqq BfpX,βq ˇ py X ˇfpX,βq T pν ˆβqq; ˆf1pX,βq h 2 pu 2, Xq XpX T Xq 1 X T py X T pν ˆβqq; update the estimate, ˆf2 px, βq νxpx T Xq 1 X T y ` νxpx T Xq 1 X T py X T pν ˆβqq. STK-IN4300: lecture 10 25/ 28 Statistical Boosting: remarks Note that: we do not need to have a linear effects, hpy, Xq can be, e.g., a spline; using fpx, βq X T β, it makes more sense to work with β: 1. initialize the estimate, e.g., ˆβ 0 0; 2. for m 1,..., m stop, 2.1 compute the negative gradient vector, u m BLpy,fpX,βqq ˇ ; BfpX,βq ˇβ ˆβm 1 2.2 fit the base learner to the negative gradient vector, b mpu m, Xq px T Xq 1 X T u m; 2.3 update the estimate, ˆβ m ˆβ m 1 ` νb mpu m, xq. 3. final estimate, ˆf mstop pxq X T ˆβ m. STK-IN4300: lecture 10 26/ 28 Statistical Boosting: remarks References I Further remarks: for m stop Ñ 8, ˆβ mstop Ñ ˆβ OLS ; the shrinkage is controlled by both m stop and ν; usually ν is fixed, ν 0.1 m stop is computed by cross-validation: it controls the model complexity; we need an early stop to avoid overfitting; if it is too small Ñ too much bias; if it is too large Ñ too much variance; Bühlmann, P. (2006). Boosting for high-dimensional linear models. The Annals of Statistics 34, 559 583. Friedman, J., Hastie, T. & Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting. The Annals of Statistics 28, 337 407. Mayr, A., Binder, H., Gefeller, O. & Schmid, M. (2014). Extending statistical boosting. Methods of Information in Medicine 53, 428 435. Schapire, R. E. & Freund, Y. (2014). Boosting: Foundations and Algorithms. MIT Press, Cambridge. the predictors must be centred, ErX j s 0. STK-IN4300: lecture 10 27/ 28 STK-IN4300: lecture 10 28/ 28