Gradient Boosting (Continued)

Similar documents
Gradient Boosting, Continued

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Announcements Kevin Jamieson

Generalized Boosted Models: A guide to the gbm package

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Stochastic Gradient Descent

ECE 5424: Introduction to Machine Learning

VBM683 Machine Learning

ECE 5984: Introduction to Machine Learning

Statistical Machine Learning from Data

Chapter 6. Ensemble Methods

l 1 and l 2 Regularization

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Voting (Ensemble Methods)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

10-701/ Machine Learning - Midterm Exam, Fall 2010

Learning theory. Ensemble methods. Boosting. Boosting: history

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Decision Trees: Overfitting

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Data Mining und Maschinelles Lernen

Ensemble Methods: Jay Hyer

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Chapter 14 Combining Models

Regularization Paths

Generalized Boosted Models: A guide to the gbm package

Midterm exam CS 189/289, Fall 2015

Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

Statistical Data Mining and Machine Learning Hilary Term 2016

Ensemble Methods for Machine Learning

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Boosting. Jiahui Shen. October 27th, / 44

Ensembles of Classifiers.

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning. Ensemble Methods. Manfred Huber

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Why does boosting work from a statistical view

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Convex Optimization / Homework 1, due September 19

Boosting. b m (x). m=1

Machine Learning, Fall 2011: Homework 5

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Machine Learning Lecture 7

Ensembles. Léon Bottou COS 424 4/8/2010

COMS 4771 Lecture Boosting 1 / 16

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

CS7267 MACHINE LEARNING

ENSEMBLES OF DECISION RULES

The exam is closed book, closed notes except your one-page cheat sheet.

Boosting. March 30, 2009

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

CS60021: Scalable Data Mining. Large Scale Machine Learning

A Brief Introduction to Adaboost

Machine Learning for NLP

Application of machine learning in manufacturing industry

CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Ordinal Classification with Decision Rules

CSE 546 Midterm Exam, Fall 2014

SPECIAL INVITED PAPER

Presentation in Convex Optimization

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Advanced Machine Learning

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Learning with multiple models. Boosting.

LEAST ANGLE REGRESSION 469

Incremental Training of a Two Layer Neural Network

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Machine Learning 2017

Lecture 5: Logistic Regression. Neural Networks

Machine Learning Lecture 5

Neural Networks and Ensemble Methods for Classification

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Big Data Analytics. Lucas Rego Drumond

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Dimension Reduction Using Rule Ensemble Machine Learning Methods: A Numerical Study of Three Ensemble Methods

Linear classifiers: Overfitting and regularization

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Multivariate Analysis Techniques in HEP

Max Margin-Classifier

Linear classifiers: Logistic regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Delta Boosting Machine and its application in Actuarial Modeling Simon CK Lee, Sheldon XS Lin KU Leuven, University of Toronto

Statistics and learning: Big Data

Lecture 3: Statistical Decision Theory (Part II)

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Ensemble Methods and Random Forests

JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA

ECS171: Machine Learning

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

Logistic Regression. Machine Learning Fall 2018

Transcription:

Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31

Boosting Fits an Additive Model Boosting Fits an Additive Model David Rosenberg (New York University) DS-GA 1003 April 4, 2016 2 / 31

Boosting Fits an Additive Model Adaptive Basis Function Model AdaBoost produces a classification score function of the form M α m G m (x) m=1 each G m is a base classifier The G m s are like basis functions, but they are learned from the data. Let s move beyond classification models... David Rosenberg (New York University) DS-GA 1003 April 4, 2016 3 / 31

Boosting Fits an Additive Model Adaptive Basis Function Model Base hypothesis space H An adaptive basis function expansion over H is f (x) = M ν m h m (x), m=1 h m H chosen in a learning process ( adaptive ) ν m R are expansion coefficients. Note: We are taking linear combination of outputs of h m (x). Functions in h m H must produce values in R (or a vector space) David Rosenberg (New York University) DS-GA 1003 April 4, 2016 4 / 31

Boosting Fits an Additive Model How to fit an adaptive basis function model? Loss function: l(y,ŷ) Base hypothesis space: H of real-valued functions The combined hypothesis space is then { F M = f (x) = } M ν m h m (x). Find f F M that minimizes some objective function J(f ). e.g. penalized ERM: [ ] 1 n ˆf = arg min l(y i,f (x i )) + Ω(f ). f F M n i=1 m=1 To fit this, we ll proceed in stages, adding a new h m in every stage. David Rosenberg (New York University) DS-GA 1003 April 4, 2016 5 / 31

Boosting Fits an Additive Model Forward Stagewise Additive Modeling (FSAM) Start with f 0 0. After m 1 stages, we have where h 1,...,h m 1 H. Want to find step direction h m H and step size ν i > 0 so that m 1 f m 1 = ν i h i, i=1 f m = f m 1 + ν i h m improves objective function value by as much as possible. David Rosenberg (New York University) DS-GA 1003 April 4, 2016 6 / 31

Boosting Fits an Additive Model Forward Stagewise Additive Modeling for ERM 1 Initialize f 0 (x) = 0. 2 For m = 1 to M: 1 Compute: 2 Set f m = f m 1 + ν m h. 3 Return: f M. (ν m,h m ) = arg min ν R,h H i=1 n l y i,f m 1 (x i )+νh(x i ). }{{} new piece David Rosenberg (New York University) DS-GA 1003 April 4, 2016 7 / 31

Boosting Fits an Additive Model Exponential Loss and AdaBoost Take loss function to be l(y,f (x)) = exp( yf (x)). Let H be our base hypothesis space of classifiers h : X { 1,1}. Then Forward Stagewise Additive Modeling (FSAM) reduces to AdaBoost! Proof on Homework #6 (and see HTF Section 10.4). Only difference: AdaBoost gets whichever G m the base learner returns from H no guarantees it s best in H. FSAM explicitly requires getting the best in H G m = arg min G H N i=1 w (m) i 1(y i G(x i )) David Rosenberg (New York University) DS-GA 1003 April 4, 2016 8 / 31

Gradient Boosting / Anyboost Gradient Boosting / Anyboost David Rosenberg (New York University) DS-GA 1003 April 4, 2016 9 / 31

Gradient Boosting / Anyboost FSAM Looks Like Iterative Optimization The FSAM step (ν m,h m ) = arg min ν R,h H i=1 n l y i,f m 1 (x i )+νh(x i ). }{{} new piece Hard part: finding the best step direction h. What if we looked for the locally best step direction? like in gradient descent David Rosenberg (New York University) DS-GA 1003 April 4, 2016 10 / 31

Gradient Boosting / Anyboost Functional Gradient Descent We want to minimize n J(f ) = l(y i,f (x i )). i=1 Only depends on f at the n training points. Define f = (f (x 1 ),...,f (x n )) T and write the objective function as n J(f) = l(y i, f i ). i=1 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 11 / 31

Gradient Boosting / Anyboost Functional Gradient Descent: Unconstrained Step Direction Consider gradient descent on n J(f) = l(y i, f i ). The negative gradient step direction at f is g = f J(f), which we can easily calculate. Eventually we need more than just f, which is just predictions on training. We ll need a full function f : X R. i=1 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 12 / 31

Gradient Boosting / Anyboost Unconstrained Functional Gradient Stepping R(f) is the empirical risk, where f = (f (x 1 ),f (x 2 )) are predictions on training set. Issue: ˆf M only defined at training points. From Seni and Elder s Ensemble Methods in Data Mining, Fig B.1. David Rosenberg (New York University) DS-GA 1003 April 4, 2016 13 / 31

Gradient Boosting / Anyboost Functional Gradient Descent: Projection Step Unconstrained step direction is g = f J(f). Suppose H is our base hypothesis space. Find h H that is closest to g at the training points, in the l 2 sense: min h H i=1 n ( g i h(x i )) 2. This is a least squares regression problem. H should have real-valued functions. So the h that best approximates g is our step direction. David Rosenberg (New York University) DS-GA 1003 April 4, 2016 14 / 31

Gradient Boosting / Anyboost Projected Functional Gradient Stepping T (x;p) H is our actual step direction like the projection of -g=- R(f) onto H. From Seni and Elder s Ensemble Methods in Data Mining, Fig B.2. David Rosenberg (New York University) DS-GA 1003 April 4, 2016 15 / 31

Gradient Boosting / Anyboost Functional Gradient Descent: Step Size Finally, we choose a stepsize. Option 1 (Line search not sure this is actually used in practice): ν m = arg min ν>0 n l{y i,f m 1 (x i ) + νh m (x i )}. Option 2: (Shrinkage parameter) We consider ν = 1 to be the full gradient step. Choose a fixed ν (0,1) called a shrinkage parameter. A value of ν = 0.1 is typical optimize as a hyperparameter. i=1 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 16 / 31

Gradient Boosting / Anyboost The Gradient Boosting Machine Ingredients (Recap) Take any [sub]differentiable loss function. Choose a base hypothesis space for regression. Choose number of steps (or a stopping criterion). Choose step size methodology. Then you re good to go! David Rosenberg (New York University) DS-GA 1003 April 4, 2016 17 / 31

Gradient Tree Boosting Gradient Tree Boosting David Rosenberg (New York University) DS-GA 1003 April 4, 2016 18 / 31

Gradient Tree Boosting Gradient Tree Boosting Common form of gradient boosting machine takes where J is the number of terminal nodes. J = 2 gives decision stumps H = {regression trees of size J}, HTF recommends 4 J 8 (but some recent results use much larger trees) Software packages: Gradient tree boosting is implemented by the gbm package for R as GradientBoostingClassifier and GradientBoostingRegressor in sklearn xgboost and lightgbm are state of the art for speed and performance David Rosenberg (New York University) DS-GA 1003 April 4, 2016 19 / 31

GBM Regression with Stumps GBM Regression with Stumps David Rosenberg (New York University) DS-GA 1003 April 4, 2016 20 / 31

GBM Regression with Stumps Sinc Function: Our Dataset From Natekin and Knoll s "Gradient boosting machines, a tutorial" David Rosenberg (New York University) DS-GA 1003 April 4, 2016 21 / 31

GBM Regression with Stumps Minimizing Square Loss with Ensemble of Decision Stumps Decision stumps with 1,10,50, and 100 steps, step size λ = 1. Figure 3 from Natekin and Knoll s "Gradient boosting machines, a tutorial" David Rosenberg (New York University) DS-GA 1003 April 4, 2016 22 / 31

GBM Regression with Stumps Step Size as Regularization Performance vs rounds of boosting and step size. (Left is training set, right is validation set) Figure 5 from Natekin and Knoll s "Gradient boosting machines, a tutorial" David Rosenberg (New York University) DS-GA 1003 April 4, 2016 23 / 31

GBM Regression with Stumps Rule of Thumb The smaller the step size, the more steps you ll need. But never seems to make results worse, and often better. So make your step size as small as you have patience for. David Rosenberg (New York University) DS-GA 1003 April 4, 2016 24 / 31

Variations on Gradient Boosting Variations on Gradient Boosting David Rosenberg (New York University) DS-GA 1003 April 4, 2016 25 / 31

Variations on Gradient Boosting Stochastic Gradient Boosting For each stage, choose random subset of data for computing projected gradient step. Typically, about 50% of the dataset size, can be much smaller for large training set. Fraction is called the bag fraction. Why do this? Subsample percentage is additional regularization parameter may help overfitting. Faster. We can view this is a minibatch method. we re estimating the true step direction (the projected gradient) using a subset of data Introduced by Friedman (1999) in Stochastic Gradient Boosting. David Rosenberg (New York University) DS-GA 1003 April 4, 2016 26 / 31

Variations on Gradient Boosting Some Comments on Bag Size Justification for 50% of dataset size: In bagging, sampling 50% without replacement gives very similar results to full bootstrap sample See Buja and Stuetzle s Observations on Bagging. So if we re subsampling because we re inspired by bagging, this makes sense. But if we think of stochastic gradient boosting as a minibatch method, then makes little sense to choose batch size as a fixed percent of dataset size, especially for large datasets. David Rosenberg (New York University) DS-GA 1003 April 4, 2016 27 / 31

Variations on Gradient Boosting Bag as Minibatch Just as we argued for minibatch SGD, sample size needed for a good estimate of step direction is independent of training set size Minibatch size should depend on the complexity of base hypothesis space the complexity of the target function (Bayes decision function) Seems like an interesting area for both practical and theoretical pursuit. David Rosenberg (New York University) DS-GA 1003 April 4, 2016 28 / 31

Variations on Gradient Boosting Column / Feature Subsampling for Regularization Similar to random forest, randomly choose a subset of features for each round. XGBoost paper says: According to user feedback, using column sub-sampling prevents overfitting even more so than the traditional row sub-sampling. Zhao Xing (top Kaggle competitor) finds optimal percentage to be 20%-100% David Rosenberg (New York University) DS-GA 1003 April 4, 2016 29 / 31

Variations on Gradient Boosting Newton Step Direction For GBM, we find the closest h F to the negative gradient This is a first order method. g = f J(f). Newton s method is a second order method : Find 2nd order (quadratic) approximation to J at f. Requires computing gradient and Hessian of J. Newton step direction points towards minimizer of the quadratic. Minimizer of quadratic is easy to find in closed form Boosting methods with projected Newton step direction: LogitBoost (logistic loss function) XGBoost (any loss uses regression trees for base classifier) David Rosenberg (New York University) DS-GA 1003 April 4, 2016 30 / 31

Variations on Gradient Boosting XGBoost Adds explicit penalty term on tree complexity to the empirical risk: Ω(h) = γt + 1 2 λ T i=1 w 2 j, where h H is a regression tree from our base hypothesis space and T is the number of leaf nodes and w j is the prediction in the j th node Objective function at step m: J(h) = 2nd Order Approximation to empirical risk(h) + Ω(h) In XGBoost, they also use this objective to decide on tree splits See XGBoost Introduction for a nice introduction. David Rosenberg (New York University) DS-GA 1003 April 4, 2016 31 / 31