Boosting. Jiahui Shen. October 27th, / 44

Boosting Jiahui Shen October 27th, 2017 1 / 44

Target of Boosting Figure: Weak learners Figure: Combined learner 2 / 44

Boosting introduction and notation Boosting: combines weak learners into a strong one iteratively. Weak learners (base learner, dictionary): error rate is only slightly better than random guessing. Some notations: Samples x and responses y; Weak learners f (or T for trees), combined (strong) learner F; Iteration k; Loss function L; Weight for the learners α, β < a, b > denotes the inner product between a and b 3 / 44

Example: Boosting based on tree Each tree is a weak learner (T) Boosting combines different trees (T k ) into a strong learner (F = α k T k ) through linear combination Figure: Combined learner can be regarded as linear combination of trees 4 / 44

Discrete AdaBoost Initial the same weight on each sample The base learner with the lowest weighted error is selected Add this base learner with a coefficient based on its accuracy After adding the selected base learner, the weights of misclassified samples are increased Repeat the above three steps Figure: Discrete AdaBoost (each stump is a simple weak learner) 5 / 44

Discrete AdaBoost Another interpretation: AdaBoost fits an additive model F (x) = K α k f (x w k ) k=1 by minimizing a loss function L(y, F (x)), where α k, k = 1, 2,..., K are the coefficients, f (x w k ) is the k-th weak learner characterized by parameter w k In each iteration, solve (α k, w k ) = argmin Ni=1 α,w L(y i, F k 1 (x i ) + αf (x i w)) and update the model as F k (x) = F k 1 (x) + α k f k (x w k ) By fitting an additive model of different simple functions, it expands the class of functions that can be approximated 6 / 44

Characteristics of Boosting Starts with an empty model, gradually add new weak learners Select (or build) base learners in a greedy way Add the selected base learner with a coefficient (or coefficients) 7 / 44

Greedy algorithms Boosting algorithms are all greedy algorithms Greedy: Find local optimal in each iteration; may not be optimal in a global view Figure: Achieve the maximum number: Greedy algorithm optimal result and global optimal result 8 / 44

AdaBoost variants Figure: A (possibly incomplete) timeline of AdaBoost variants 9 / 44

Key ingredients in Boosting Choice of the loss function: exponential loss, logistic loss, savage loss Base learner selection: neural networks, wavelets, trees Selection criterion: criterion on choosing a new classifier Iterative format: define a new estimator by the selected classifiers Termination rule: time to terminate the learning process 10 / 44

Choice of the loss function Convex loss functions (lead to unconstrained loss): Exponential loss: L(y, F (x)) = e yf (x) Quadratic loss: L(y, F (x)) = (y F (x)) 2 Logistic loss: L(y, F (x)) = log(1 + e yf (x) ) Non-convex loss function (to deal with noise): Savage loss L(y, F (x)) = 1 (1+e 2yF (x) ) 2 11 / 44

Base learner selection Single hidden layer neural network: f (x w) = 1/(1 + e w T x ), where w parameterizes a linear combination of the input variables Tree: f k (x) = J k j=1 b ji(x R j ), where J k is the number of leaves of decision tree f k at iteration k, b j is the value predicted in the region R j Figure: Decision tree Some other base learner: wavelets, radial basis function, etc 12 / 44

Selection criterion In k-th iteration, select the weak learner that has largest inner product with the residual r = y F k 1 Generalize the idea of residual to gradient, select] the weak learner with largest inner product with r = [ L(y,F (x)) F (x) F (x)=f k 1 (x) (If using quadratic loss, this criterion is just the same as the first one) Similar to the above two criterion, but choose multiple weak learners in one time Select an arbitrary base learner whose inner product with residual is larger than a predefined threshold σ (being less greedy) (proposed by Xu Lin, et al., 2017) Some other criterion in a greedy sense 13 / 44

Iterative format Add the weak learner with a fixed step-length factor ν Add the weak learner with a fixed infinitesimal coefficient ɛ Commonly used iterative schemes under high dimensional regression setting (weak learners are much more than samples): pure greedy algorithm (PGA), orthogonal greedy algorithm (OGA), relaxed greedy algorithm (RGA) Design some backward steps: delete useless weak learners added in the previous iterations 14 / 44

Iterative format (PGA, OGA, RGA) All three algorithms find the weak learner f k that has largest inner product with the residual r = y F k 1 Pure greedy algorithm: add the weak learner with weight as <r,f k> f k Orthogonal greedy algorithm: do fully corrective step each time, add the weak learner with the calculated weight and modify each weight of all the previous weak learners (project y to the span of {f 1,..., f k }) Relaxed greedy algorithm: Add the weak learner with some weight and modify the previous weights by the same amount, i.e. F k = α k F k 1 + β k f k, where (α k, β k, f k ) = argmin α,β,f y αf k 1 βf PGA does not work well (it even has a lower bound for approximation error); OGA and RGA achieve similar performance; OGA is more computationally inefficient 15 / 44

Representative Boosting algorithms with more details Boosting for classification: AdaBoost, LogitBoost, SavageBoost Boosting for regression: L2 Boosting, Incremental forward stagewise regression (FS ɛ ) A more general Boosting algorithm: Gradient (Tree) Boosting 16 / 44

LogitBoost Less sensitive to outliers compared to AdaBoost Choice of the loss function: logistic loss Selection criterion: regard the strong learner as the critical point F (x) that can minimize the loss function ln(1 + exp( yf (x))); based on the Newton s method, fit a weak learner f k to H 1 s where [ ] L(y, F (x) + f (x)) s(x) = f (x) and H(x) = [ 2 ] L(y, F (x) + f (x)) f (x) 2 f (x)=0 f (x)=0 so that F k = F k 1 + f k approximate the Newton s update Iterative format: add the weak learner directly (F k = F k 1 + f k ) 17 / 44

SavageBoost Handle the outliers by using non-convex loss function Boosting with savageloss proposed by Masnadi-Shirazi, H., & Vasconcelos, N. (2009) Figure: Experimental results on Liver-Disorder set (A binary UCI data set) 18 / 44

L2 Boosting Under the regression setting, y can be continuous values Choice of the loss function: quadratic loss Selection criterion: fit a weak learner f k to the residual r = y F k 1 (x) in each iteration Iterative format: add the weak learner directly with a fixed step-length factor ν (F k = F k 1 + νf k ) Common choice for ν is 1 or smaller numbers (e.g. 0.3) 19 / 44

Incremental forward stagewise regression (FS ɛ ) Recall: Boosting fits an additive model F (x) = K α k T k (x) k=1 by minimizing the loss function L(y, F (x)) Consider the case that weak learners are much more than samples, to get an optimal model can be considered as high dimensional problem Incremental forward stagewise regression approximates the effect of the LASSO as a greedy algorithm Choice of the loss function: quadratic loss Selection criterion: select the base learner that best fits the residuals Iterative format: change the coefficient by an infinitesimal amount 20 / 44

Incremental forward stagewise regression (FS ɛ ) Initial all coefficients α k to be zero Calculate the residual as r = y K k=1 α k T k (x) Select that the base learner best fits the current residual ( α k, k ) = argmin αk,k(r α k T k (x)) 2 Update α k = α k + ɛsign( α k ), repeat 21 / 44

Comparison between FS ɛ and LASSO LASSO solves the problem in an optimization view 1 min α R p N y T α 2 2 + λ α 1 but due to the very large number of base learners T k, directly solving a LASSO problem is not feasible FS ɛ solves the problem in a greedy view With K < iterations, many of the coefficients will remain zero, while the others will tend to have absolute values smaller than their corresponding least squares solution values Therefore this K-iteration solution qualitatively resembles the LASSO, with K inversely related to λ 22 / 44

Comparison between FS ɛ and LASSO FS ɛ compares with LASSO: Figure: Solution paths 23 / 44

Comparison between PGA, L2 Boosting, FS ɛ All three algorithms find the weak learner f k that has largest inner product with the residual r = y F k 1 Pure greedy algorithm (PGA) add f k with weight calculated as <r,f k> f k L2 Boosting add f k with weight ν, which is normally set as 1 or small other smaller value (like 0.3) FS ɛ add f k with an infinitesimal weight ɛ PGA can be too greedy and L2 Boosting also cannot achieve good performance; for both algorithms, the weights in the previous iterations will not be modified during the future iterations FS ɛ changes weights in an extremely small amount (ɛ) each time, which can have an L 1 regularization effect and get a result similar to LASSO 24 / 44

Gradient (Tree) Boosting Choice of the loss function: can be applied to any differentiable loss function in general Selection [ criterion: ] Fit a base learner f (x) to gradient r = L(y,F (x)) F (x) F (x)=f k 1 (x) Calculate the weight in front of the base learner (line search): n γ k = argmin γ L(y i, F k 1 (x i ) γf k (x i )) i=1 Iterative format: F k = F k 1 + γ k f k In the case of using decision trees as weak learners, choose a separate optimal value γ jk for each of the tree s regions (R j ), instead of a single γ k for the whole tree Demonstration 25 / 44

Application with Boosting algorithms Regression and classification problems: Gradient Boosting decision tree (GBDT), XGBoost, LightGBM Compressive sensing problem: compressive sampling matching pursuit (CoSaMP) Efficient sparse learning and feature selection method: Forward backward greedy algorithm (FoBa), gradient forward backward greedy algorithm (FoBa-gdt) 26 / 44

Gradient Boosting decision tree (GBDT) Gradient Boosting based on decision trees to solve classification and regression problems If weak learners are not given, build them in the process of learning Given a tree structure, determine the quality of the tree based on a score function Exact greedy algorithm: Find the best split by calculating the scores among all possible splits (based on the current structure) 27 / 44

Gradient Tree Boosting (XGBoost) A very popular package to apply gradient tree boosting Based on GBDT, find splits in a subset of all possible splits in an appropriate way Handle sparsity patterns in a unified way Figure: Tree structure with default directions 28 / 44

Gradient Tree Boosting (LightGBM) A new package runs even faster than XGBoost in some situations Similar to XGBoost, but use histogram to find splits to build trees Grow tree on leaves, unlike XGBoost(level-wise) Comparison between XGBoost and LightGBM 29 / 44

Some remarks on Gradient Boosting Randomization: re-sample features and samples each time to build more effective weak learners Another regularization term is often added to penalize the complexity of the trees 30 / 44

Compressive sampling matching pursuit (CoSaMP) A commonly used greedy algorithm in compressive sensing problem A more advanced algorithm compared to PGA, OGA, RGA Choice of the loss function: quadratic loss Selection criterion: choose multiple weak learners in one time with largest inner product with the residual Iterative format: Same as OGA (projection) 31 / 44

Compressive sampling matching pursuit (CoSaMP) Initial the support set Ω to be empty Calculate residual r = y F (x) and g k =< r, f k (x) > for each weak learner f k Find the largest 2s components in g = g 1,..., g K, include the corresponding 2s f k into the Ω Let F k be the orthogonal projection from y onto the span of f Ω Select s base learners with largest entries in F k to calculate new residual, repeat Output the s base learners with largest entries in the last iteration 32 / 44

Restricted Isometry Property Restricted isometry property (RIP) characterizes matrices which are nearly orthonormal, at least when operating on sparse vectors Let T be an n p matrix and let 1 s p be an integer. Suppose that there exists a constant δ s (0, 1) such that, for every n s submatrix T s of T and for every s-sparsity vector α, (1 δ s ) α 2 2 T s α 2 2 (1 + δ s ) α 2 2 Then, the matrix T is said to satisfy the s-restricted isometry property with restricted isometry constant δ s Intuition: any small number(at the order of the desired sparsity level) of features are not highly correlated Suitable random matrices (e.g. independent random Gaussian) will satisfy RIP with high probability 33 / 44

Some theoretical analysis on CoSaMP Suppose sampling matrix T has restricted isometry constant δ s C. Let y = T α + ɛ be a vector of an arbitrary signal with noise ɛ. CoSaMP produces a 2s-sparse approximation β that satisfies α β Cmax{η, 1 s α α s + ɛ } where α s is β best s-sparse approximation to α CoSaMP provides rigorous bounds on the runtime and can deal with contaminated samples 34 / 44

Forward algorithm Forward algorithm is not prone to overfit However, forward algorithm can never correct mistakes in earlier steps Figure: Failure of Forward Greedy Algorithm 35 / 44

Forward Backward greedy algorithm Backward algorithm start with a full model and greedily remove features However, backward algorithm needs to start with sparse/non-overfited model Forward Backward greedy algorithm: an efficient sparse learning and feature selection method which combines forward and backward steps, proposed by Zhang, T. (2011) Choice of the loss function: quadratic loss Selection criterion: The weaker learner satisfies f k = argmin f min α L(y, F k + αf ) Iterative format: delete useless weak learners in the combined learner in the learning process 36 / 44

Forward Backward greedy algorithm Algorithm 1 FoBa 1. Initialize Ω 0 =, F 0 = 0, k = 0, v = 0.5 2. While (TRUE) i = argmin i min αl(f k + αf i ) Ω k+1 = {i} Ω k F k = argmin F L(F Ω k+1 ) σ k = L(F k ) L(F k+1 ) if σ k < η then BREAK k = k + 1 While (TRUE) j = argmin j Ωk L(F k f j ) d = L(F k f j ) L(F k ) d + = σ k if d > vd + then BREAK k = k 1 Ω k = Ω k+1 / {j} F k = argmin F L(F Ω k ) end end 37 / 44

Some theoretical analysis A sufficient condition for feature selection consistency on OSA (also for LASSO): irrepresentable condition (L type condition) For FoBa, feature selection consistency us guaranteed under restricted isometry property (L 2 type condition) If the data matrix is Gaussian random matrix, L requires sample size n to be of the order of O(s 2 logn) while L 2 type condition to be O(slogn) to achieve consistent solution FoBa terminates after at most 2L(0)/η forward iterations 38 / 44

Simulation results on FoBa Artificial data experiment: p = 500, n = 100, noise ɛ = 0.1, moderately correlated design matrix Exact sparse weight with s = 5 and weights uniform 0 10 50 random runs, results for top five features Foba Forward-greedy LASSO Least squares 0.093 ± 0.02 0.16 ± 0.089 0.25 ± 0.14 Parameter estimation 0.057 ± 0.2 0.52 ± 0.82 1.1 ± 1 Feature selection 0.76 ± 0.98 1.8 ± 1.1 3.2 ± 0.77 Table: Error range 39 / 44

Gradient Forward Backward Greedy Algorithm (FoBa-gdt) FoBa-gdt is a more generalized FoBa proposed by Liu, J., Ye, J., & Fujimaki, R. (2014) FoBa-gdt: Change the measure of goodness of a feature from L(F k ) min α L(F k + αf j ) to the form of gradient L(F k ) Choice of the loss function: any differentiable loss function FoBa directly evaluates a feature by the decrease in objective function (computationally expensive due to solving a large number of one dimensional optimization) but FoBa-gdt does not suffer this problem 40 / 44

Some remarks and other research scope Boosting is not just about selecting learners but more about how to build an appropriate model in a greedy way Loss function can even change over iteration time. RobustBoost proposed by Freund, Y. (2009) implemented this idea Boosting will induce bias when repetitively using the same training data for sampling, Dorogush, Anna Veronika, et al.(2017) discussed this issue and proposed dynamic Boosting. Boosting can be applied to deep learning field by combining shallow networks into a complex one Boosting can be also helpful to do functional matrix factorization Selecting an appropriate momentum term (learning rate) can be another research area 41 / 44

Reference Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics, 28(2), 337-407. Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1, pp. 241-249). New York: Springer series in statistics. Schapire, R. E. (2003). The boosting approach to machine learning: An overview. In Nonlinear estimation and classification (pp. 149-171). Springer New York. Mannor, S., Meir, R., & Zhang, T. (2003). Greedy Algorithms for Classification Consistency, Convergence Rates, and Adaptivity. Journal of Machine Learning Research, 4(Oct), 713-742. Buehlmann, P. (2006). Boosting for high-dimensional linear models. The Annals of Statistics, 559-583. Barron, A. R., Cohen, A., Dahmen, W., & DeVore, R. A. (2008). Approximation and learning by greedy algorithms. The annals of statistics, 64-94. Needell, D., & Tropp, J. A. (2009). CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3), 301-321. Zhang, T. (2009). Adaptive forward-backward greedy algorithm for sparse learning with linear models. In Advances in Neural Information Processing Systems (pp. 1921-1928). 42 / 44

Reference Freund, Y. (2009). A more robust boosting algorithm. arxiv preprint arxiv:0905.2138. Ferreira, A. J., & Figueiredo, M. A. (2012). Boosting algorithms: A review of methods, theory, and applications. In Ensemble Machine Learning (pp. 35-85). Springer US. Freund, R. M., Grigas, P., & Mazumder, R. (2013). Adaboost and forward stagewise regression are first-order convex optimization methods. arxiv preprint arxiv:1307.1192. Liu, J., Ye, J., & Fujimaki, R. (2014, January). Forward-backward greedy algorithms for general convex smooth functions over a cardinality constraint. In International Conference on Machine Learning (pp. 503-511). Cortes, C., Mohri, M., & Syed, U. (2014). Deep boosting. In Proceedings of the 31st International Conference on Machine Learning (ICML-14) (pp. 1179-1187). Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794). ACM. Miao, Q., Cao, Y., Xia, G., Gong, M., Liu, J., & Song, J. (2016). RBoost: label noise-robust boosting algorithm based on a nonconvex loss function and the numerically stable base learners. IEEE transactions on neural networks and learning systems, 27(11), 2216-2228. Yuan, X., Li, P., & Zhang, T. (2016). Exact recovery of hard thresholding pursuit. In Advances in Neural Information Processing Systems (pp. 3558-3566). 43 / 44

Reference Sancetta, A. (2016). Greedy algorithms for prediction. Bernoulli, 22(2), 1227-1277. Locatello, F., Khanna, R., Tschannen, M., & Jaggi, M. (2017). A Unified Optimization View on Generalized Matching Pursuit and Frank-Wolfe. arxiv preprint arxiv:1702.06457. Huang, F., Ash, J., Langford, J., & Schapire, R. (2017). Learning Deep ResNet Blocks Sequentially using Boosting Theory. arxiv preprint arxiv:1706.04964. Xu, L., Lin, S., Zeng, J., Liu, X., Fang, Y., & Xu, Z. (2017). Greedy Criterion in Orthogonal Greedy Learning. IEEE Transactions on Cybernetics. Dorogush, A. V., Gulin, A., Gusev, G., Kazeev, N., Prokhorenkova, L. O., & Vorobev, A. (2017). Fighting biases with dynamic boosting. arxiv preprint arxiv:1706.09516. 44 / 44