Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Similar documents
Gradient Boosting (Continued)

Gradient Boosting, Continued

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Recitation 1. Gradients and Directional Derivatives. Brett Bernstein. CDS at NYU. January 21, 2018

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Ensembles of Classifiers.

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine Learning. Ensemble Methods. Manfred Huber

VBM683 Machine Learning

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

ECE 5424: Introduction to Machine Learning

A Brief Introduction to Adaboost

Voting (Ensemble Methods)

Learning theory. Ensemble methods. Boosting. Boosting: history

CS7267 MACHINE LEARNING

Stochastic Gradient Descent

Statistical Machine Learning from Data

6.036 midterm review. Wednesday, March 18, 15

Ensemble Methods for Machine Learning

Data Mining und Maschinelles Lernen

Computational and Statistical Learning Theory

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Chapter 14 Combining Models

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Learning with multiple models. Boosting.

Machine Learning Brett Bernstein. Recitation 1: Gradients and Directional Derivatives

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

ECE 5984: Introduction to Machine Learning

Incremental Training of a Two Layer Neural Network

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Hierarchical Boosting and Filter Generation

Chapter 6. Ensemble Methods

TDT4173 Machine Learning

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Linear Classifiers: Expressiveness

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Support Vector Machines. Machine Learning Fall 2017

Ensembles. Léon Bottou COS 424 4/8/2010

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Logistic Regression. Machine Learning Fall 2018

CS534 Machine Learning - Spring Final Exam

TDT4173 Machine Learning

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Learning Theory Continued

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Ordinal Classification with Decision Rules

i=1 = H t 1 (x) + α t h t (x)

Learning from Examples

Boosting & Deep Learning

Discriminative Models

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Generalized Boosted Models: A guide to the gbm package

Discriminative Models

Decision Trees: Overfitting

Classifier Performance. Assessment and Improvement

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Announcements Kevin Jamieson

Boos$ng Can we make dumb learners smart?

18.9 SUPPORT VECTOR MACHINES

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Multiclass Boosting with Repartitioning

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Bagging and Other Ensemble Methods

Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

PAC-learning, VC Dimension and Margin-based Bounds

COMS 4771 Lecture Boosting 1 / 16

Evaluation. Andrea Passerini Machine Learning. Evaluation

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

EECS 349: Machine Learning Bryan Pardo

Numerical Learning Algorithms

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Infinite Ensemble Learning with Support Vector Machinery

Evaluation requires to define performance measures to be optimized

CSE 546 Midterm Exam, Fall 2014

Announcements - Homework

Artificial Intelligence Roman Barták

Error Functions & Linear Regression (2)

Lecture 8. Instructor: Haipeng Luo

Bias-Variance in Machine Learning

STK-IN4300 Statistical Learning Methods in Data Science

Ensemble Methods: Jay Hyer

ENSEMBLES OF DECISION RULES

Lecture 2 Machine Learning Review

Boosting as a Regularized Path to a Maximum Margin Classifier

Neural Networks and Ensemble Methods for Classification

10-701/ Machine Learning - Midterm Exam, Fall 2010

Advanced Machine Learning

Least Mean Squares Regression. Machine Learning Fall 2018

Support Vector Machines.

Transcription:

Brett Bernstein CDS at NYU March 30, 2017 Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 1 / 14

Initial Question Intro Question Question Suppose 10 different meteorologists have produced functions f 1,..., f 10 : R d R that forecast tomorrow s noon-time temperature using the same d features. Given a validation set containing 1000 data points (x i, y i ) R d R of similar forecast situations, describe a method to forecast tomorrow s noon-time temperature. Would you use boosting, bagging or neither? Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 2 / 14

Initial Question Intro Solution Solution Let ˆx i = (x i, f 1 (x i ),..., f 10 (x i )) R d+10. Then use any fitting method you like to produce an aggregate decision function f : R d+10 R. This method is sometimes called stacking. 1 This isn t bagging - we didn t generate bootstrap samples and learn a decision function on each of them. 2 This isn t boosting - boosting learns decision functions on varying datasets to produce an aggregate classifier. Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 3 / 14

Initial Question Different Ensembles 1 Parallel ensemble: each base model is fit independently of the other models. Examples are bagging and stacking. 2 Sequential ensemble: each base model is fit in stages depending on the previous fits. Examples are AdaBoost and. Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 4 / 14

Initial Question AdaBoost Review 1 Recall that a learner, or learning algorithm take a dataset as input and produces a decision function in some hypothesis space. Question Suppose we had a learner that given a dataset, and a weighting (importance) scheme on that dataset, would produce a classifier h that has lower than.5 loss using the weighted 0 1 loss: 1 n w i 1(y i h(x i )) γ <.5. Can we use this learner to create an ensemble that makes accurate predictions? We saw that AdaBoost solves this problem. Can get around weighted loss functions using sampling trick. Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 5 / 14

Initial Question AdaBoost Review 1 Recall that a learner, or learning algorithm take a dataset as input and produces a decision function in some hypothesis space. Question Suppose we had a learner that given a dataset, and a weighting (importance) scheme on that dataset, would produce a classifier h that has lower than.5 loss using the weighted 0 1 loss: 1 n w i 1(y i h(x i )) γ <.5. Can we use this learner to create an ensemble that makes accurate predictions? We saw that AdaBoost solves this problem. Can get around weighted loss functions using sampling trick. Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 5 / 14

Initial Question Additive Models 1 Additive models over a base hypothesis space H take the form F = { f (x) = } M ν m h m (x) h m H, ν m R. m=1 2 Since we are taking linear combinations, we assume the h m functions take values in R or some other vector space. 3 Empirical risk minimization over F tries to find arg min f F 1 n l(y i, f (x i )). 4 This in general is a difficult task, as the number of base hypotheses M is unknown, and each base hypothesis h m ranges over all of H. Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 6 / 14

Initial Question Forward Stagewise Additive Modeling (FSAM) The FSAM method fits additive models using the following (greedy) algorithmic structure: 1 Initialize f 0 0. 2 For stage m = 1,..., M: 1 Choose h m H and ν m R so that has the minimum empirical risk. 2 The function f m has the form f m = f m 1 + ν m h m f m = ν 1 h 1 + + ν m h m. When choosing h m, ν m during stage m, we must solve the minimization (ν m, h m ) = arg min l(y i, f m 1 (x i ) + νh(x i )). ν R,h H Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 7 / 14

Initial Question Forward Stagewise Additive Modeling (FSAM) The FSAM method fits additive models using the following (greedy) algorithmic structure: 1 Initialize f 0 0. 2 For stage m = 1,..., M: 1 Choose h m H and ν m R so that has the minimum empirical risk. 2 The function f m has the form f m = f m 1 + ν m h m f m = ν 1 h 1 + + ν m h m. When choosing h m, ν m during stage m, we must solve the minimization (ν m, h m ) = arg min l(y i, f m 1 (x i ) + νh(x i )). ν R,h H Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 7 / 14

1 Can we simplify the following minimization problem: (ν m, h m ) = arg min ν R,h H l(y i, f m 1 (x i ) + νh(x i )). 2 What if we linearize the problem and take a step along the steepest descent direction? 3 Good idea, but how do we handle the constraint that h is a function that lies in H, the base hypothesis space? 4 First idea: since we are doing empirical risk minimization, we only care about the values h takes on the training set. Thus we can think of h as a vector (h(x 1 ),..., h(x n )). 5 Second idea: first compute unconstrained steepest descent direction, and then constrain (project) onto possible choices from H. Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 8 / 14

1 Can we simplify the following minimization problem: (ν m, h m ) = arg min ν R,h H l(y i, f m 1 (x i ) + νh(x i )). 2 What if we linearize the problem and take a step along the steepest descent direction? 3 Good idea, but how do we handle the constraint that h is a function that lies in H, the base hypothesis space? 4 First idea: since we are doing empirical risk minimization, we only care about the values h takes on the training set. Thus we can think of h as a vector (h(x 1 ),..., h(x n )). 5 Second idea: first compute unconstrained steepest descent direction, and then constrain (project) onto possible choices from H. Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 8 / 14

Machine 1 Initialize f 0 0. 2 For stage m = 1,..., M: 1 Compute the steepest descent direction (also called pseudoresiduals): r m = ( 2 l(y 1, f m 1 (x 1 )),..., 2 l(y n, f m 1 (x n ))). 2 Find the closest base hypothesis (using Euclidean distance): h m = arg min h H ((r m ) i h(x i )) 2. 3 Choose fixed step size ν m (0, 1] or line search: 4 Take the step: ν m = arg min ν 0 l(y i, f m 1 (x i ) + νh m (x i )). f m (x) = f m 1 (x) + ν m h m (x). Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 9 / 14

Machine 1 Each stage we need to solve the following step: How do we do this? h m = arg min h H ((r m ) i h(x i )) 2. 2 This is a standard least squares regression task on the mock dataset D (m) = {(x 1, (r m ) 1 ),..., (x n, (r m ) n )}. 3 We assume that we have a learner that (approximately) solves least squares regression over H. Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 10 / 14

Machine 1 Each stage we need to solve the following step: How do we do this? h m = arg min h H ((r m ) i h(x i )) 2. 2 This is a standard least squares regression task on the mock dataset D (m) = {(x 1, (r m ) 1 ),..., (x n, (r m ) n )}. 3 We assume that we have a learner that (approximately) solves least squares regression over H. Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 10 / 14

Comments 1 The algorithm above is sometimes called AnyBoost or Functional Gradient Descent. 2 The most commonly used base hypothesis space is small regression trees (HTF recommends between 4 and 8 leaves). Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 11 / 14

Practice With Different Loss Functions Question Explain how to perform gradient boosting with the following loss functions: 1 Square loss: l(y, a) = (y a) 2 /2. 2 Absolute loss: l(y, a) = y a. 3 Exponential margin loss: l(y, a) = e ya. Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 12 / 14

Demonstration Using Decision Stumps Below is an example of a decision stump for functions h : R R. y 7 x 5 x > 5-2 7 5 x -2 Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 13 / 14

Demonstration Using Decision Stumps Below is the dataset we will use. y x Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 14 / 14