Boosting. b m (x). m=1

Similar documents
Gradient Boosting (Continued)

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

ECE 5424: Introduction to Machine Learning

Automatic Differentiation and Neural Networks

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Stochastic Gradient Descent

Empirical Risk Minimization and Optimization

Empirical Risk Minimization and Optimization

Voting (Ensemble Methods)

VBM683 Machine Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

LogitBoost with Trees Applied to the WCCI 2006 Performance Prediction Challenge Datasets

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Announcements Kevin Jamieson

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Introduction to gradient descent

Gradient Boosting, Continued

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Statistical Machine Learning from Data

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Decision Trees: Overfitting

ECE 5984: Introduction to Machine Learning

IFT Lecture 7 Elements of statistical learning theory

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

TDT4173 Machine Learning

Big Data Analytics. Lucas Rego Drumond

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Regularization Paths

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CPSC 540: Machine Learning

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

A Study of Relative Efficiency and Robustness of Classification Methods

Generalized Boosted Models: A guide to the gbm package

Machine Learning for NLP

Chapter 6. Ensemble Methods

Module 9: Further Numbers and Equations. Numbers and Indices. The aim of this lesson is to enable you to: work with rational and irrational numbers

AdaBoost and other Large Margin Classifiers: Convexity in Classification

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Non-Linear Regression Samuel L. Baker

Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

Machine Learning and Data Mining. Linear classification. Kalev Kask

Boosting. March 30, 2009

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Decision Tree Learning Lecture 2

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Does Modeling Lead to More Accurate Classification?

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Boosted Lasso and Reverse Boosting

Kernel Learning via Random Fourier Representations

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Lecture 5: Logistic Regression. Neural Networks

Statistical Data Mining and Machine Learning Hilary Term 2016

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

ECS171: Machine Learning

CS 6375 Machine Learning

Neural Networks and Deep Learning

Lecture 16 Solving GLMs via IRWLS

Chapter 14 Combining Models

Statistical Machine Learning Hilary Term 2018

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Application of machine learning in manufacturing industry

Delta Boosting Machine and its application in Actuarial Modeling Simon CK Lee, Sheldon XS Lin KU Leuven, University of Toronto

STK-IN4300 Statistical Learning Methods in Data Science

Regression with Numerical Optimization. Logistic

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

CSCI-567: Machine Learning (Spring 2019)

Expectation Maximization and Mixtures of Gaussians

Ordinal Classification with Decision Rules

10-701/ Machine Learning - Midterm Exam, Fall 2010

CMU-Q Lecture 24:

Learning with multiple models. Boosting.

10701/15781 Machine Learning, Spring 2007: Homework 2

Machine Learning 4771

FinQuiz Notes

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

TDT4173 Machine Learning

ECS171: Machine Learning

Why does boosting work from a statistical view

MLCC 2017 Regularization Networks I: Linear Models

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

The Mean Version One way to write the One True Regression Line is: Equation 1 - The One True Line

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Learning Task Grouping and Overlap in Multi-Task Learning

The following document was developed by Learning Materials Production, OTEN, DET.

Machine Learning Lecture 7

Linear Models in Machine Learning

Exploring Lucas s Theorem. Abstract: Lucas s Theorem is used to express the remainder of the binomial coefficient of any two

Computational statistics

Escaping the curse of dimensionality with a tree-based regressor

Qualifying Exam in Machine Learning

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

6.036 midterm review. Wednesday, March 18, 15

COMS 4771 Regression. Nakul Verma

The Perceptron algorithm

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Conceptual Explanations: Logarithms

Bias-Variance in Machine Learning

Transcription:

Statistical Machine Learning Notes 9 Instructor: Justin Domke Boosting 1 Additive Models The asic idea of oosting is to greedily uild additive models. Let m (x) e some predictor (a tree, a neural network, whatever), which we will call a ase learning. In oosting, we will uild a model that is the sum of ase learners as f(x) = M m (x). m=1 The ovious way to fit an additive model is greedily. Namely, with start with the simple function f (x) =, then iteratively add ase learners to minimize the risk of f m 1 (x)+ m (x). Forward stagewise additive modeling f (x) For m = 1,..., M m arg min L(f m 1 (ˆx) + (ˆx), ŷ) f m (x) f m 1 (x) + v m (x) Notice here that once we have fit a particular ase learner, it is frozen in, and is not further changed. Once can also create procedures for additive modeling that re-fit the ase learners several times. 2 Additive egression Trees It is quite easy to do forward stagewise additive modeling when we are interested in minimizing the least-squares regression loss. 1

Boosting 2 arg min L ( f m 1 (ˆx) + (ˆx), ŷ ) = arg min = arg min ( fm 1 (ˆx) + (ˆx) ŷ ) 2 ( ) 2 (ˆx) r(ˆx, ŷ) (2.1) where r(ˆx, ŷ) = ŷ f m 1 (ˆx). However, minimizing a prolem like Eq. 2.1 is exactly what all our previous regression methods do. Thus, we can apply any method for least-squares regression essentially unchanged. The only difference is that the target values are given y ŷ f m 1 (ˆx) the difference etween ŷ and the additive model so far. Here are some examples applying this to two datasets. Here (as in all the examples in these notes) our ase learner is a regression tree with two levels of splits.

Boosting 3 data.2.15.1.5 2 4 6 8 1 1 2 s 3 s 4 s 9 s 5 s

Boosting 4 data.4.3.2.1 2 4 6 8 1 1 2 s 3 s 9 s 49 s 1 s 3 Boosting Suppose we want to minimize some other loss function. Unfortunately, if we aren t using the least-squares loss, the optimization

Boosting 5 arg min L(f m 1 (ˆx) + (ˆx), ŷ) (3.1) will not reduce to a least-squares prolem. One alternative would e to use a different aselearner, that fits a different loss. With oosting, we don t want to change our ase learner. This could e ecause it is simply inconvenient to create a new algorithm for a different loss function. The general idea of oosting is to fit a new ase learner to, if not minimize the risk for f m, to decrease it. As we will see, we can do this for a large range of loss functions, all ased on a least-squares ase learner. 4 Gradient Boosting Gradient oosting is a very clever algorithm. In each, we set as target values the negative gradient of the loss with respect to f. So, if the ase learner closely matches the target values, when we add some multiple v of the ase learner to our additive model, it should decrease the loss. Gradient Boosting f (x) For m = 1,..., M For all(ˆx, ŷ) r(ˆx, ŷ) d df L(f m 1(ˆx), ŷ) m arg min ( ) 2 (ˆx) r(ˆx, ŷ) f m (x) f m 1 (x) + v m (x) Consider changing f(x) to f(x)+v(x) for some small v. We want to minimize the loss, i.e. L(f m 1 (ˆx) + (ˆx), ŷ) L(f m 1 (ˆx), ŷ) + dl(f m 1 (ˆx), ŷ) (ˆx). (4.1) df However, it doesn t make much sense to simply minimize the right hand side of Eq. 4.1, since we can always make the decrease igger y multiplying y a larger constant. Instead, we choose to minimize

Boosting 6 dl(f m 1 (ˆx), ŷ) (ˆx) + 1 (ˆx) 2 df 2 which prevents from ecoming too large. It is easy to see that arg min dl(f m 1 (ˆx), ŷ) (ˆx) + 1 df 2 (ˆx) 2 = arg min ((ˆx) + dl(f m 1(ˆx), ŷ) df = arg min ( ) 2 (ˆx) r(ˆx, ŷ) ) 2 Some example loss functions, and their derivatives: least-squares. L = (f ŷ) 2 dl/fd = 2(f ŷ) least-asolute-deviation. L = f ŷ dl/df = sign(f ŷ) logistic regression. L = log(1 + exp( ŷf)) dl/df = exponential loss. L = exp( ŷf) dl/df = ŷ exp( ŷf) ŷ 1+exp(fŷ). Fun fact: gradient oosting with a step size of v = 1 on the least-squares loss gives exactly 2 the forward stagewise modeling algorithm we devised aove. Here are some examples of gradient oosting on classification losses (logistic and exponential). The figures show the function f(x). As usual, we would get a predicted class y taking the sign of f(x).

Boosting 7 Gradient Boosting, Logistic Loss, step size of v = 1. data.8.6.4.2 2 4 6 8 1 1 2 s 3 s 9 s 49 s 1 s

Boosting 8 Gradient Boosting, Exponential loss, step size of v = 1. data 1.8.6.4.2 2 4 6 8 1 1 2 s 3 s 9 s 49 s 1 s

Boosting 9 5 Newton Boosting The asic idea of Newton oosting is, instead of just taking the gradient, to uild a local quadratic model, and then to minimize it. We can again do this using any least-squares regressor as a ase learner. (Note that Newton oosting is not generally accepted terminology. The general idea of fitting to a quadratic optimization is used in several algorithms for specific loss function (e.g. Logitoost), ut doesn t seem to have een given a name. The class voted to use Newton Boosting in preference to Quadratic Boosting.) Newton Boosting f (x) For m = 1,..., M For all(ˆx, ŷ) h(ˆx, ŷ) d2 df 2L(f m 1(ˆx), ŷ) g(ˆx, ŷ) d L(f df m 1(ˆx), ŷ) r(ˆx, ŷ) 1 d h(ˆx, ŷ) df L(f m 1(ˆx), ŷ) m arg min h(ˆx, ŷ) ( (ˆx) r(ˆx, ŷ) ) 2 f m (x) f m 1 (x) + v m (x) To understand this, consider search searching for the that minimizes the empirical risk for f m. To start with, we make a local second-order approximation. (Here, g(ˆx, ŷ) = dl(f m 1(ˆx),ŷ) df, and h(ˆx, ŷ) = d2 L(f m 1 (ˆx),ŷ) df 2.) arg min L(f m 1 (ˆx) + (ˆx), ŷ) arg min 1 = arg min 2 = arg min = arg min ( L(fm 1 (ˆx), ŷ) + (ˆx)g(ˆx, ŷ) + 1 2 (ˆx)2 h(ˆx, ŷ) ) h(ˆx, ŷ) ( g(ˆx, ŷ) 2 (ˆx) + (ˆx)2) h(ˆx, ŷ) h(ˆx, ŷ) ( (ˆx) g(ˆx, ŷ) ) 2 h(ˆx, ŷ) h(ˆx, ŷ) ( (ˆx) r(ˆx, ŷ) ) 2

Boosting 1 This is a weighted least-squares regression, with weights h(ˆx, ŷ) and targets g h. Here, we collect in a tale all of the loss functions and derivatives we might use. Loss Functions L(f, ŷ) dl d 2 L Name L df df 2 least-squares (f ŷ) 2 2(f ŷ) 2 least-asolute-deviation f ŷ sign(f ŷ) logistic log ( 1 + exp( ŷf) ) ŷ 1 + exp(fŷ) ŷ 2 2 + exp(fŷ) + exp( fŷ) exponential exp( ŷf) ŷ exp( ŷf) y 2 exp( fŷ) Here are some experiments using Newton oosting on the same dataset as aove. Generally, we see that Newton oosting is more effective than gradient oosting in quickly reducing the ing error. However, this does also means it egins to overfit after a smaller numer of s. Thus, if running time is an issue (either at ing or time), Newton oosting may e preferale, since it more quickly reduces ing error.

Boosting 11 Newton Boosting, Logistic Loss, step size of v = 1. data.8.6.4.2 2 4 6 8 1 1 2 s 3 s 9 s 49 s 1 s

Boosting 12 Newton Boosting, Logistic Loss, step size of v = 1. data 1.8.6.4.2 2 4 6 8 1 1 2 s 3 s 9 s 49 s 1 s

Boosting 13 6 Shrinkage and Early Stopping Now, we have not talked aout regularization. What is suggested in practice is to regularize y early stoppping. Namely, repeatedly oost, ut hold out a set. Select the numer of ase learners M in the model y the numer that leads the minimum error on the set. (It would e est to select this M y cross validation, then re-fit with all models, though this is often not done is practice to reduce running times.) Notice that we have another parameter controlling complexity, however: the step size v. For a small step size, we will need more models M to achieve the same complexity. Experimentally, etter results are almost always given y choosing a small step size v, and a larger numer of ase learners M. The following experiments suggest that this is a product of the fact that it yields smoother models. Note also from the following experiments that Newton oosting with a given step size performs similarly to gradient oosting with a smaller step size.

Boosting 14 Gradient oosting, least-squares loss, v = 1 2.2.15.1.5 Gradient oosting, least-squares loss, v = 1 2 2 4 6 8 1.2.15.1.5 Gradient oosting, least-squares loss, v = 1 2 2 4 6 8 1.2.15.1.5 2 4 6 8 1

Boosting 15 In the following figures, the left column shows f(x) after 1 s, while the middle column shows sign(f(x)). Newton oosting, logistic loss, v = 1.8.6.4.2 2 4 6 8 1 Newton oosting, logistic loss,v = 1 1.8.6.4.2 2 4 6 8 1 Newton oosting, logistic loss,v = 1 1.8.6.4.2 2 4 6 8 1

Boosting 16 Gradient oosting, logistic loss,v = 1.8.6.4.2 2 4 6 8 1 Gradient oosting, logistic loss, v = 1 1.8.6.4.2 2 4 6 8 1 Gradient oosting, logistic loss, v = 1 1.8.6.4.2 2 4 6 8 1 7 Discussion Note that while our experiments here have used trees, any least-squares regressor could e used at the ase learner. If so inclined, we could even use a oosted regression tree as our ase learner. Most of these loss functions can e extended to multi-class classification. This works y, in each, not fitting a single tree, ut C trees, where there are C classes. For newton oosting, this is made tractale y taking a diagonal approximation of the Hessian.

Boosting 17 Our presentation here has een quite ahistorical. The first oosting algorithms were presented from a very different, and more theoretical perspective.