Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Brett Bernstein CDS at NYU March 30, 2017 Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 1 / 14

Initial Question Intro Question Question Suppose 10 different meteorologists have produced functions f 1,..., f 10 : R d R that forecast tomorrow s noon-time temperature using the same d features. Given a validation set containing 1000 data points (x i, y i ) R d R of similar forecast situations, describe a method to forecast tomorrow s noon-time temperature. Would you use boosting, bagging or neither? Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 2 / 14

Initial Question Intro Solution Solution Let ˆx i = (x i, f 1 (x i ),..., f 10 (x i )) R d+10. Then use any fitting method you like to produce an aggregate decision function f : R d+10 R. This method is sometimes called stacking. 1 This isn t bagging - we didn t generate bootstrap samples and learn a decision function on each of them. 2 This isn t boosting - boosting learns decision functions on varying datasets to produce an aggregate classifier. Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 3 / 14

Initial Question Different Ensembles 1 Parallel ensemble: each base model is fit independently of the other models. Examples are bagging and stacking. 2 Sequential ensemble: each base model is fit in stages depending on the previous fits. Examples are AdaBoost and. Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 4 / 14

Initial Question AdaBoost Review 1 Recall that a learner, or learning algorithm take a dataset as input and produces a decision function in some hypothesis space. Question Suppose we had a learner that given a dataset, and a weighting (importance) scheme on that dataset, would produce a classifier h that has lower than.5 loss using the weighted 0 1 loss: 1 n w i 1(y i h(x i )) γ <.5. Can we use this learner to create an ensemble that makes accurate predictions? We saw that AdaBoost solves this problem. Can get around weighted loss functions using sampling trick. Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 5 / 14

Initial Question Additive Models 1 Additive models over a base hypothesis space H take the form F = { f (x) = } M ν m h m (x) h m H, ν m R. m=1 2 Since we are taking linear combinations, we assume the h m functions take values in R or some other vector space. 3 Empirical risk minimization over F tries to find arg min f F 1 n l(y i, f (x i )). 4 This in general is a difficult task, as the number of base hypotheses M is unknown, and each base hypothesis h m ranges over all of H. Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 6 / 14

Initial Question Forward Stagewise Additive Modeling (FSAM) The FSAM method fits additive models using the following (greedy) algorithmic structure: 1 Initialize f 0 0. 2 For stage m = 1,..., M: 1 Choose h m H and ν m R so that has the minimum empirical risk. 2 The function f m has the form f m = f m 1 + ν m h m f m = ν 1 h 1 + + ν m h m. When choosing h m, ν m during stage m, we must solve the minimization (ν m, h m ) = arg min l(y i, f m 1 (x i ) + νh(x i )). ν R,h H Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 7 / 14

1 Can we simplify the following minimization problem: (ν m, h m ) = arg min ν R,h H l(y i, f m 1 (x i ) + νh(x i )). 2 What if we linearize the problem and take a step along the steepest descent direction? 3 Good idea, but how do we handle the constraint that h is a function that lies in H, the base hypothesis space? 4 First idea: since we are doing empirical risk minimization, we only care about the values h takes on the training set. Thus we can think of h as a vector (h(x 1 ),..., h(x n )). 5 Second idea: first compute unconstrained steepest descent direction, and then constrain (project) onto possible choices from H. Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 8 / 14

Machine 1 Initialize f 0 0. 2 For stage m = 1,..., M: 1 Compute the steepest descent direction (also called pseudoresiduals): r m = ( 2 l(y 1, f m 1 (x 1 )),..., 2 l(y n, f m 1 (x n ))). 2 Find the closest base hypothesis (using Euclidean distance): h m = arg min h H ((r m ) i h(x i )) 2. 3 Choose fixed step size ν m (0, 1] or line search: 4 Take the step: ν m = arg min ν 0 l(y i, f m 1 (x i ) + νh m (x i )). f m (x) = f m 1 (x) + ν m h m (x). Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 9 / 14

Machine 1 Each stage we need to solve the following step: How do we do this? h m = arg min h H ((r m ) i h(x i )) 2. 2 This is a standard least squares regression task on the mock dataset D (m) = {(x 1, (r m ) 1 ),..., (x n, (r m ) n )}. 3 We assume that we have a learner that (approximately) solves least squares regression over H. Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 10 / 14

Comments 1 The algorithm above is sometimes called AnyBoost or Functional Gradient Descent. 2 The most commonly used base hypothesis space is small regression trees (HTF recommends between 4 and 8 leaves). Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 11 / 14

Practice With Different Loss Functions Question Explain how to perform gradient boosting with the following loss functions: 1 Square loss: l(y, a) = (y a) 2 /2. 2 Absolute loss: l(y, a) = y a. 3 Exponential margin loss: l(y, a) = e ya. Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 12 / 14

Demonstration Using Decision Stumps Below is an example of a decision stump for functions h : R R. y 7 x 5 x > 5-2 7 5 x -2 Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 13 / 14

Demonstration Using Decision Stumps Below is the dataset we will use. y x Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 14 / 14