Gradient Boosting, Continued

Similar documents
Gradient Boosting (Continued)

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Statistical Machine Learning from Data

Generalized Boosted Models: A guide to the gbm package

VBM683 Machine Learning

ECE 5424: Introduction to Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

l 1 and l 2 Regularization

Voting (Ensemble Methods)

ECE 5984: Introduction to Machine Learning

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Machine Learning Lecture 7

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Chapter 14 Combining Models

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Ensemble Methods and Random Forests

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Decision trees COMS 4771

Stochastic Gradient Descent

Presentation in Convex Optimization

Big Data Analytics. Lucas Rego Drumond

Data Mining und Maschinelles Lernen

Decision Trees: Overfitting

CS7267 MACHINE LEARNING

Generalized Boosted Models: A guide to the gbm package

Ensembles. Léon Bottou COS 424 4/8/2010

Decision Tree Ensembles

Learning theory. Ensemble methods. Boosting. Boosting: history

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Announcements Kevin Jamieson

Ensemble Methods: Jay Hyer

Learning Ensembles. 293S T. Yang. UCSB, 2017.

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

Neural Networks and Ensemble Methods for Classification

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Learning with multiple models. Boosting.

A Brief Introduction to Adaboost

CS60021: Scalable Data Mining. Large Scale Machine Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

Statistics and learning: Big Data

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Decision Tree Learning Lecture 2

Ensembles of Classifiers.

Stochastic Gradient Descent. CS 584: Big Data Analytics

Delta Boosting Machine and its application in Actuarial Modeling Simon CK Lee, Sheldon XS Lin KU Leuven, University of Toronto

Midterm exam CS 189/289, Fall 2015

UVA CS 4501: Machine Learning

18.9 SUPPORT VECTOR MACHINES

Incremental Training of a Two Layer Neural Network

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

10-701/ Machine Learning - Midterm Exam, Fall 2010

Application of machine learning in manufacturing industry

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Learning from Examples

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Boosting. Jiahui Shen. October 27th, / 44

ABC-LogitBoost for Multi-Class Classification

CSC 411 Lecture 17: Support Vector Machine

JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning. Ensemble Methods. Manfred Huber

Machine Learning And Applications: Supervised Learning-SVM

Neural Networks and Deep Learning

Introduction to Machine Learning

ECS171: Machine Learning

CSC 411 Lecture 6: Linear Regression

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Dimension Reduction Using Rule Ensemble Machine Learning Methods: A Numerical Study of Three Ensemble Methods

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Chapter 6. Ensemble Methods

LogitBoost with Trees Applied to the WCCI 2006 Performance Prediction Challenge Datasets

Statistical Machine Learning Hilary Term 2018

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Boosting. b m (x). m=1

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Ordinal Classification with Decision Rules

ENSEMBLES OF DECISION RULES

Why does boosting work from a statistical view

Hierarchical models for the rainfall forecast DATA MINING APPROACH

Holdout and Cross-Validation Methods Overfitting Avoidance

MLCC 2017 Regularization Networks I: Linear Models

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Machine Learning Lecture 12

1 Handling of Continuous Attributes in C4.5. Algorithm

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning and Data Mining. Linear classification. Kalev Kask

Weighted Classification Cascades for Optimizing Discovery Significance

the tree till a class assignment is reached

Ensemble Methods for Machine Learning

Transcription:

Gradient Boosting, Continued David Rosenberg New York University December 26, 2016 David Rosenberg (New York University) DS-GA 1003 December 26, 2016 1 / 16

Review: Gradient Boosting Review: Gradient Boosting David Rosenberg (New York University) DS-GA 1003 December 26, 2016 2 / 16

Review: Gradient Boosting The Gradient Boosting Machine 1 Initialize f 0 (x) = 0. 2 For m = 1,2,... (until stopping condition met) 1 Compute unconstrained gradient: ( n g m = l(y i,f (x i ))) f (x i ) i=1 2 Fit regression model to g m : h m = arg min h F f (xi )=f m 1 (x i ) n (( g m ) i h(x i )) 2. 3 Choose fixed step size ν m = ν (0,1] [v = 0.1 is typical], or take n ν m = arg min l{y i,f m 1 (x i ) + νh m (x i )}. ν>0 4 Take the step: i=1 i=1 f m (x) = f m 1 (x) + ν m h m (x) David Rosenberg (New York University) DS-GA 1003 December 26, 2016 3 / 16 n i=1

Review: Gradient Boosting Unconstrained Functional Gradient Stepping Where R(f) is the empirical risk. Issue: ˆf M only defined at training points. From Seni and Elder s Ensemble Methods in Data Mining, Fig B.1. avid Rosenberg (New York University) DS-GA 1003 December 26, 2016 4 / 16

Review: Gradient Boosting Projected Functional Gradient Stepping T (x;p) F is our actual step direction (projection of -g=- R onto F) From Seni and Elder s Ensemble Methods in Data Mining, Fig B.2. avid Rosenberg (New York University) DS-GA 1003 December 26, 2016 5 / 16

Review: Gradient Boosting The Gradient Boosting Machine: Recap Take any [sub]differentiable loss function. Choose a base hypothesis space for regression. Choose number of steps (or a stopping criterion). Choose step size methodology. Then you re good to go! David Rosenberg (New York University) DS-GA 1003 December 26, 2016 6 / 16

Gradient Tree Boosting Gradient Tree Boosting David Rosenberg (New York University) DS-GA 1003 December 26, 2016 7 / 16

Gradient Tree Boosting Gradient Tree Boosting Common form of gradient boosting machine takes F = {regression trees of size J}, where J is the number of terminal nodes. J = 2 gives decision stumps HTF recommends 4 J 8. Software packages: Gradient tree boosting is implemented by the gbm package for R as GradientBoostingClassifier and GradientBoostingRegressor in sklearn For trees, there are other tweaks on the algorithm one can do See HTF 10.9-10.12 avid Rosenberg (New York University) DS-GA 1003 December 26, 2016 8 / 16

GBM Regression with Stumps GBM Regression with Stumps David Rosenberg (New York University) DS-GA 1003 December 26, 2016 9 / 16

GBM Regression with Stumps Sinc Function: Our Dataset From Natekin and Knoll s "Gradient boosting machines, a tutorial" avid Rosenberg (New York University) DS-GA 1003 December 26, 2016 10 / 16

GBM Regression with Stumps Fitting with Ensemble of Decision Stumps Decision stumps with 1,10,50, and 100 steps, step size λ = 1. From Natekin and Knoll s "Gradient boosting machines, a tutorial" avid Rosenberg (New York University) DS-GA 1003 December 26, 2016 11 / 16

GBM Regression with Stumps Step Size as Regularization Performance vs rounds of boosting and step size. From Natekin and Knoll s "Gradient boosting machines, a tutorial" avid Rosenberg (New York University) DS-GA 1003 December 26, 2016 12 / 16

Variations on Gradient Boosting Variations on Gradient Boosting David Rosenberg (New York University) DS-GA 1003 December 26, 2016 13 / 16

Variations on Gradient Boosting Stochastic Gradient Boosting For each stage, Why? choose random subset of data for computing projected gradient step. Typically, about 50% of the dataset size. Fraction is often called the bag fraction. Faster. Subsample percentage is additional regularization parameter. How small a fraction can we take? avid Rosenberg (New York University) DS-GA 1003 December 26, 2016 14 / 16

Variations on Gradient Boosting Column / Feature Subsampling for Regularization Similar to random forest, randomly choose a subset of features for each round. XGBoost paper says: According to use feedback, using column sub-sampling prevents overfitting even more so than the traditional row sub-sampling. avid Rosenberg (New York University) DS-GA 1003 December 26, 2016 15 / 16

Variations on Gradient Boosting Newton Step Direction For GBM, we find the closest h F to the negative gradient This is a first order method. g = f J(f). Newton s method is a second order method : Find 2nd order (quadratic) approximation to J at f. Requires computing gradient and Hessian of J. Newton step direction points towards minimizer of the quadratic. Minimizer of quadratic is easy to find in closed form Boosting methods with projected Newton step direction: LogitBoost (logistic loss function) XGBoost (any loss uses regression trees for base classifier) avid Rosenberg (New York University) DS-GA 1003 December 26, 2016 16 / 16