Descent methods. min x. f(x)

Similar documents
Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 5: September 15

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Lecture 5: September 12

6. Proximal gradient method

6. Proximal gradient method

Selected Topics in Optimization. Some slides borrowed from

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Unconstrained minimization

Fast proximal gradient methods

Unconstrained minimization of smooth functions

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

10. Unconstrained minimization

Newton s Method. Javier Peña Convex Optimization /36-725

CSCI : Optimization and Control of Networks. Review on Convex Optimization

A Brief Review on Convex Optimization

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Optimization methods

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Optimization methods

Lecture 14: October 17

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Convex Functions. Pontus Giselsson

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Conditional Gradient (Frank-Wolfe) Method

Lecture 6: September 12

Convex Optimization. Problem set 2. Due Monday April 26th

5. Subgradient method

Lecture 9: September 28

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Lecture 6: September 17

Lecture 17: October 27

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Subgradient Method. Ryan Tibshirani Convex Optimization

Unconstrained minimization: assumptions

Lecture 8: February 9

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Adaptive Restarting for First Order Optimization Methods

Convex Optimization Lecture 16

Optimization Tutorial 1. Basic Gradient Descent

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

Lecture 14: Newton s Method

Nonlinear Optimization for Optimal Control

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Math 273a: Optimization Subgradients of convex functions

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

MATH 680 Fall November 27, Homework 3

A function(al) f is convex if dom f is a convex set, and. f(θx + (1 θ)y) < θf(x) + (1 θ)f(y) f(x) = x 3

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Lecture: Smoothing.

Homework 2. Convex Optimization /36-725

Convex Optimization. Prof. Nati Srebro. Lecture 12: Infeasible-Start Newton s Method Interior Point Methods

Constrained Optimization and Lagrangian Duality

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Functions. Wing-Kin (Ken) Ma The Chinese University of Hong Kong (CUHK)

Optimization and Optimal Control in Banach Spaces

Majorization Minimization - the Technique of Surrogate

Convex Optimization. Lecture 12 - Equality Constrained Optimization. Instructor: Yuanzhang Xiao. Fall University of Hawaii at Manoa

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

Accelerated Proximal Gradient Methods for Convex Optimization

The Steepest Descent Algorithm for Unconstrained Optimization

Lecture 23: November 21

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Lecture 7: September 17

Linear Convergence under the Polyak-Łojasiewicz Inequality

Existence of minimizers

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

Unconstrained optimization

Linear Convergence under the Polyak-Łojasiewicz Inequality

Lasso: Algorithms and Extensions

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

Math 273a: Optimization Subgradient Methods

8 Numerical methods for unconstrained problems

Lecture 23: Conditional Gradient Method

Lecture 9 Sequential unconstrained minimization

Optimization for Machine Learning

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Coordinate Descent and Ascent Methods

IFT Lecture 2 Basics of convex analysis and gradient descent

CPSC 540: Machine Learning

Sparse Approximation and Variable Selection

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Homework 4. Convex Optimization /36-725

Transcription:

Gradient Descent

Descent methods min x f(x) 5 / 34

Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34

Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34

Gradient methods Unconstrained optimization min f(x) x R n. Suppose we have a vector x R n for which f(x) 0 6 / 34

Gradient methods Unconstrained optimization min f(x) x R n. Suppose we have a vector x R n for which f(x) 0 Consider the ray: x(α) = x α f(x) for α 0 6 / 34

Gradient methods Unconstrained optimization min f(x) x R n. Suppose we have a vector x R n for which f(x) 0 Consider the ray: x(α) = x α f(x) for α 0 Make first-order Taylor expansion around x f(x(α)) = f(x) + f(x), x(α) x + o( x(α) x 2 ) 6 / 34

Gradient methods Unconstrained optimization min f(x) x R n. Suppose we have a vector x R n for which f(x) 0 Consider the ray: x(α) = x α f(x) for α 0 As before, make first-order Taylor expansion around x f(x(α)) = f(x) + f(x), x(α) x + o( x(α) x 2 ) = f(x) α f(x) 2 2 + o(α f(x) 2 ) 6 / 34

Gradient methods Unconstrained optimization min f(x) x R n. Suppose we have a vector x R n for which f(x) 0 Consider the ray: x(α) = x α f(x) for α 0 As before, make first-order Taylor expansion around x f(x(α)) = f(x) + f(x), x(α) x + o( x(α) x 2 ) = f(x) α f(x) 2 2 + o(α f(x) 2 ) = f(x) α f(x) 2 2 + o(α) 6 / 34

Gradient methods Unconstrained optimization min f(x) x R n. Suppose we have a vector x R n for which f(x) 0 Consider the ray: x(α) = x α f(x) for α 0 As before, make first-order Taylor expansion around x f(x(α)) = f(x) + f(x), x(α) x + o( x(α) x 2 ) = f(x) α f(x) 2 2 + o(α f(x) 2 ) = f(x) α f(x) 2 2 + o(α) For α near 0, α f(x) 2 2 dominates o(α) For positive, sufficiently small α, f(x(α)) smaller than f(x) 6 / 34

Descent methods Carrying the idea further, consider x(α) = x + αd, where direction d R n obtuse to f(x), i.e., f(x), d < 0. 7 / 34

Descent methods Carrying the idea further, consider x(α) = x + αd, where direction d R n obtuse to f(x), i.e., f(x), d < 0. Again, we have the Taylor expansion f(x(α)) = f(x) + α f(x), d + o(α), where f(x), d dominates o(α) for suff. small α 7 / 34

Descent methods Carrying the idea further, consider x(α) = x + αd, where direction d R n obtuse to f(x), i.e., f(x), d < 0. Again, we have the Taylor expansion f(x(α)) = f(x) + α f(x), d + o(α), where f(x), d dominates o(α) for suff. small α Since d is obtuse to f(x), this implies f(x(α)) < f(x) 7 / 34

8 / 34

Gradient descent Consider unconstrained, smooth convex optimization min f(x) i.e., f is convex and differentiable with dom(f) = R n. Denote the optimal criterion value by f = min f(x), and a solution by x Gradient descent: choose initial x (0) R n, repeat: x (k) = x (k 1) t k f(x (k 1) ), k = 1, 2, 3,... Stop at some point 3

4

Gradient descent interpretation At each iteration, consider the expansion f(y) f(x) + f(x) T (y x) + 1 2t y x 2 2 Quadratic approximation, replacing usual 2 f(x) by 1 t I f(x) + f(x) T (y x) 1 2t y x 2 2 linear approximation to f proximity term to x, with weight 1/(2t) Choose next point y = x + to minimize quadratic approximation: x + = x t f(x) 6

x + = argmin y Blue point is x, red point is f(x) + f(x) T (y x) + 1 2t y x 2 2 7

How to choose step sizes Convergence analysis Forward stagewise regression, boosting 8

Fixed step size Simply take t k = t for all k = 1, 2, 3,..., can diverge if t is too big. Consider f(x) = (10x 2 1 + x2 2 )/2, gradient descent after 8 steps: 20 10 0 10 20 * 20 10 0 10 20 9

10 Can be slow if t is too small. Same example, gradient descent after 100 steps: 20 10 0 10 20 * 20 10 0 10 20

11 Same example, gradient descent after 40 appropriately sized steps: 20 10 0 10 20 * 20 10 0 10 20 Clearly there s a tradeoff convergence analysis later will give us a better idea

12 Backtracking line search One way to adaptively choose the step size is to use backtracking line search: First fix parameters 0 < β < 1 and 0 < α 1/2 At each iteration, start with t = 1, and while f(x t f(x)) > f(x) αt f(x) 2 2 shrink t = βt. Else perform gradient descent update x + = x t f(x) Simple and tends to work well in practice (further simplification: just take α = 1/2)

14 Backtracking picks up roughly the right step size (12 outer steps, 40 steps total): 20 10 0 10 20 * 20 10 0 10 20 Here α = β = 0.5

15 Exact line search Could also choose step to do the best we can along direction of negative gradient, called exact line search: t = argmin s 0 f(x s f(x)) Usually not possible to do this minimization exactly Approximations to exact line search are often not much more efficient than backtracking, and it s usually not worth it

16 Convergence analysis Assume that f convex and differentiable, with dom(f) = R n, and additionally f(x) f(y) 2 L x y 2 for any x, y I.e., f is Lipschitz continuous with constant L > 0 Theorem: Gradient descent with fixed step size t 1/L satisfies f(x (k) ) f x(0) x 2 2 2tk We say gradient descent has convergence rate O(1/k) I.e., to get f(x (k) ) f ɛ, we need O(1/ɛ) iterations

19 Convergence analysis for backtracking Same assumptions, f is convex and differentiable, dom(f) = R n, and f is Lipschitz continuous with constant L > 0 Same rate for a step size chosen by backtracking search Theorem: Gradient descent with backtracking line search satisfies where t min = min{1, β/l} f(x (k) ) f x(0) x 2 2 2t min k If β is not too small, then we don t lose much compared to fixed step size (β/l vs 1/L)

20 Convergence analysis under strong convexity Reminder: strong convexity of f means f(x) 2 2 x 2 2 is convex for some m > 0. If f is twice differentiable, then this implies 2 f(x) mi for any x Sharper lower bound than that from usual convexity: f(y) f(x) + f(x) T (y x) + m 2 y x 2 2 all x, y Under Lipschitz assumption as before, and also strong convexity: Theorem: Gradient descent with fixed step size t 2/(m + L) or with backtracking line search search satisfies where 0 < c < 1 f(x (k) ) f c k L 2 x(0) x 2 2

21 I.e., rate with strong convexity is O(c k ), exponentially fast! I.e., to get f(x (k) ) f ɛ, need O(log(1/ɛ)) iterations Called linear convergence, because looks linear on a semi-log plot: 10 4 10 2 f(x (k) ) p 10 0 exact l.s. 10 2 backtracking l.s. 10 4 0 50 100 150 200 k Constant c depends adversely on condition number L/m (higher condition number slower rate)

Convex function f is convex if domf is a convex set and Jensen s inequality holds: f(θx+(1 θ)y) θf(x)+(1 θ)f(y) x,y domf, θ [0,1] first-order condition for (continuously) differentiable f, Jensen s inequality can be replaced with f(y) f(x)+ f(x) T (y x) x,y domf second-order condition for twice differentiable f, Jensen s inequality can be replaced with 2 f(x) 0 x domf Gradient method 1-6

Strictly convex function f is strictly convex if domf is a convex set and f(θx+(1 θ)y) < θf(x)+(1 θ)f(y) x,y domf, x y, θ (0,1) first-order condition (for differentiable f): dom f is convex and f(y) > f(x)+ f(x) T (y x) x,y domf,x y hence minimizer of f is unique (if a minimizer exists) second-order condition note that 2 f(x) 0 is not necessary for strict convexity (cf., f(x) = x 4 ) Gradient method 1-7

Strongly convex function f is strongly convex with parameter m > 0 if f(x) m 2 x 2 2 is convex Jensen s inequality f(θx+(1 θ)y) θf(x)+(1 θ)f(y) m 2 θ(1 θ) x y 2 2 first-order condition f(y) f(x)+ f(x) T (y x)+ m 2 y x 2 2 x,y domf second-order condition 2 f(x) mi x domf Gradient method 1-8

Quadratic lower bound (from 1st order condition) if f is strongly convex with parameter m, then f(y) f(x)+ f(x) T (y x)+ m 2 y x 2 2 x,y domf f(y) (x,f(x)) if domf = R n, this implies f has a unique minimizer x and m 2 x x 2 2 f(x) f(x ) 1 2m f(x) 2 2 x Gradient method 1-9

Monotonicity of gradient differentiable f is convex if and only if domf is convex and ( f(x) f(y)) T (x y) 0 x,y domf i.e., f : R n R n is a monotone mapping gradient of strictly convex function is strictly monotone: ( f(x) f(y)) T (x y) > 0 x,y domf, x y gradient of strongly convex function is strongly monotone or coercive: ( f(x) f(y)) T (x y) m x y 2 2 x,y domf Gradient method 1-10

proof of monotonicity property if f is differentiable and convex, then f(y) f(x)+ f(x) T (y x), f(x) f(y)+ f(y) T (x y) combining the inequalities gives ( f(x) f(y)) T (x y) 0 if f is monotone, then g (t) g (0) for t 0 where g(t) = f(x+t(y x)), g (t) = f(x+t(y x)) T (y x) hence, f(y) = g(1) = g(0)+ 1 0 g (t)dt g(0)+g (0) = f(x)+ f(x) T (y x) Gradient method 1-11

Functions with Lipschitz continuous gradients gradient of f is Lipschitz continuous with parameter L > 0 if f(x) f(y) 2 L x y 2 x,y domf from the Cauchy-Schwarz inequality, this implies ( f(x) f(y)) T (x y) L x y 2 2 i.e., Lx f(x) is monotone; in other words L 2 x 2 2 f(x) is convex for twice differentiable f, equivalent to 2 f(x) LI for all x domf Gradient method 1-12

Quadratic upper bound if f is Lipschitz-continuous with parameter L, then f(y) f(x)+ f(x) T (y x)+ L 2 y x 2 2 x,y domf f(y) (x,f(x)) if domf = R n and f has a minimizer x, implies 1 2L f(x) 2 2 f(x) f(x ) L 2 x x 2 2 x Gradient method 1-13

Co-coercivity of gradient if f is convex and f is Lipschitz continuous with parameter L, then ( f(x) f(y)) T (x y) 1 L f(x) f(y) 2 2 x,y domf property is known as co-coercivity of f (with parameter 1/L) proof: the convex functions g(z) = f(z) f(x) T z, h(z) = f(z) f(y) T z have Lipschitz-continuous gradients, and minimizers z = x and z = y, resp. therefore, from page 1-13, g(x) g(y) 1 2L g(y) 2 2, h(y) h(x) 1 2L h(x) 2 2 combining these two inequalities shows co-coercivity Gradient method 1-14

extension: if in addition f is strongly convex with parameter m, then g(x) = f(x) m 2 x 2 2 is convex and g is Lipschitz continuous with parameter L m the co-coercivity property for g gives ( f(x) f(y)) T (x y) ml m+l x y 2 2+ 1 m+l f(x) f(y) 2 2 for all x,y domf Gradient method 1-15

Analysis of gradient method x (k) = x (k 1) t k f(x (k 1) ), k = 1,2,... with fixed step size or backtracking line search assumptions 1. f is convex and differentiable with domf = R n 2. f(x) is Lipschitz continuous with parameter L > 0 3. optimal value f = inf x f(x) is finite and attained at x Gradient method 1-16

Analysis for constant step size from quadratic upper bound (page 1-13) with y = x t f(x): f(x t f(x)) f(x) t(1 Lt 2 ) f(x) 2 2 therefore, if x + = x t f(x) and 0 < t 1/L, f(x + ) f(x) t 2 f(x) 2 2 f + f(x) T (x x ) t 2 f(x) 2 2 = f + 1 ( ) x x 2 2t 2 x x t f(x) 2 2 = f + 2t( 1 x x 2 2 x + x 2 ) 2 Gradient method 1-17

take x = x (i 1), x + = x (i), t i = t, and add the bounds for i = 1,...,k: k (f(x (i) ) f ) 1 2t i=1 = 1 2t k i=1 ( ) x (i 1) x 2 2 x (i) x 2 2 ( ) x (0) x 2 2 x (k) x 2 2 1 2t x(0) x 2 2 since f(x (i) ) is non-increasing, f(x (k) ) f 1 k k i=1 (f(x (i) ) f ) 1 2kt x(0) x 2 2 conclusion: #iterations to reach f(x (k) ) f ǫ is O(1/ǫ) Gradient method 1-18

Backtracking line search initialize t k at ˆt > 0 (for example, ˆt = 1); take t k := βt k until f(x t k f(x)) < f(x) αt k f(x) 2 2 f(x t f(x)) f(x) αt f(x) 2 2 0 < β < 1; we will take α = 1/2 (mostly to simplify proofs) t Gradient method 1-19

Analysis for backtracking line search line search with α = 1/2 if f has a Lipschitz continuous gradient f(x) t(1 tl 2 ) f(x) 2 2 f(x) t 2 f(x) 2 2 t = 1/L f(x t f(x)) selected step size satisfies t k t min = min{ˆt,β/l} Gradient method 1-20

convergence analysis from page 1-17: f(x (i) ) f + 1 ( ) x (i 1) x 2 2t 2 x (i) x 2 2 i f + 1 ( ) x (i 1) x 2 2t 2 x (i) x 2 2 min add the upper bounds to get f(x (k) ) f 1 k k i=1 (f(x (i) ) f ) 1 2kt min x (0) x 2 2 conclusion: same 1/k bound as with constant step size Gradient method 1-21

Analysis for strongly convex functions better results exist if we add strong convexity to the assumptions on p.1-16 analysis for constant step size if x + = x t f(x) and 0 < t 2/(m+L): x + x 2 2 = x t f(x) x 2 2 = x x 2 2 2t f(x) T (x x )+t 2 f(x) 2 2 (1 t 2mL m+l ) x x 2 2+t(t 2 m+l ) f(x) 2 2 (1 t 2mL m+l ) x x 2 2 (step 3 follows from result on p.1-15) Gradient method 1-22

distance to optimum x (k) x 2 2 c k x (0) x 2 2, c = 1 t 2mL m+l implies (linear) convergence for t = 2/(m+L), get c = ((κ 1)/(κ+1)) 2 with κ = L/m bound on function value (from page 1-13) f(x (k) ) f L 2 x(k) x 2 2 c2k L 2 x(0) x 2 2 conclusion: #iterations to reach f(x (k) ) f ǫ is O(log(1/ǫ)) Gradient method 1-23

22 A look at the conditions A look at the conditions for a simple problem, f(β) = 1 2 y Xβ 2 2 Lipschitz continuity of f: This means 2 f(x) LI As 2 f(β) = X T X, we have L = σ 2 max(x) Strong convexity of f: This means 2 f(x) mi As 2 f(β) = X T X, we have m = σ 2 min (X) If X is wide i.e., X is n p with p > n then σ min (X) = 0, and f can t be strongly convex Even if σ min (X) > 0, can have a very large condition number L/m = σ max (X)/σ min (X)

Condition number Condition number For l 2 -norm we have Cond(A) = A A 1 Cond 2(A) = A 2 A 1 2 If A is symmetric, we have Remark 1: Cond(A) 1. Cond 2(A) = λmax(a) λ min(a). Remark 2: If det(a) 0, Cond(A) 1.

Explanation of condition number Original problem Ax = b; Perturbed problem A(x + δx) = b + δb; Subtracting two equations we have Take norm we have With condition we have δx = A 1 δb δx A 1 δb = A 1 Ax δb b. δx x Ax A x Cond(A) δb b.

Condition number Condition number characterize the stability If Cond(A) 1, stability is very bad; If Cond(A) 1, stability is good. Lesson we should learn: We should avoid handle the bad condition number problem in applications!

Stability Loosely speaking, stability is to indicate how sensitive the solution of a problem may be to small relative changes in the input data. It is often quantized by condition number of a problem or an algorithm. Stability of the original problem ( Well-posedness ) This means the well-posedness of the original problem. The linear system with high condition number is a typical ill-posed example, which is called the unstable problem. Stability of numerical algorithm This means the condition of the algorithm. Ex: The Gaussian elimination without pivoting for some linear system will be unstable.

23 A function f having Lipschitz gradient and being strongly convex satisfies: mi 2 f(x) LI for all x R n, for constants L > m > 0 Think of f being sandwiched between two quadratics May seem like a strong condition to hold globally (for all x R n ). But a careful look at the proofs shows that we only need Lipschitz gradients/strong convexity over the sublevel set This is less restrictive S = {x : f(x) f(x (0) )}

24 Practicalities Stopping rule: stop when f(x) 2 is small Recall f(x ) = 0 at solution x If f is strongly convex with parameter m, then f(x) 2 2mɛ f(x) f ɛ Pros and cons of gradient descent: Pro: simple idea, and each iteration is cheap Pro: very fast for well-conditioned, strongly convex problems Con: often slow, because interesting problems aren t strongly convex or well-conditioned Con: can t handle nondifferentiable functions

25 Forward stagewise regression Let s stick with f(β) = 1 2 y Xβ 2 2, linear regression setting X is n p, its columns X 1,... X p are predictor variables Forward stagewise regression: start with β (0) = 0, repeat: Find variable i s.t. Xi T r is largest, where r = y Xβ(k 1) (largest absolute correlation with residual) Update β (k) i = β (k 1) i + γ sign(x T i r) Here γ > 0 is small and fixed, called learning rate This looks kind of like gradient descent

26 Steepest descent Close cousin to gradient descent, just change the choice of norm. Let p, q be complementary (dual): 1/p + 1/q = 1 Steepest descent updates are x + = x + t x, where x = f(x) q u u = argmin v p 1 f(x) T v If p = 2, then x = f(x), gradient descent If p = 1, then x = f(x)/ x i e i, where f (x) x i = max f j=1,...n (x) x j = f(x) Normalized steepest descent just takes x = u (unit q-norm)

27 An interesting equivalence Normalized steepest descent with respect to l 1 norm: updates are x + i ( f ) = x i t sign (x) x i where i is the largest component of f(x) in absolute value Compare forward stagewise: updates are β + i = β i + γ sign(x T i r), r = y Xβ But here f(β) = 1 2 y Xβ 2 2, so f(β) = XT (y Xβ) and f(β)/ β i = Xi T (y Xβ) Hence forward stagewise regression is normalized steepest descent under l 1 norm (with fixed step size t = γ)

28 Early stopping and sparse approximation If we run forward stagewise to completion, then we will minimize f(β) = y Xβ 2 2, i.e., we will produce a least squares solution What happens if we stop early? May seem strange from an optimization perspective (we are under-optimizing )... Interesting from a statistical perspective, because stopping early gives us a sparse approximation to the least squares solution Well-known sparse regression estimator, the lasso: 1 min β R p 2 y Xβ 2 2 subject to β 1 s How do lasso solutions and forward stagewise estimates compare?

29 Stagewise path Lasso path Coordinates 0.2 0.0 0.2 0.4 0.6 Coordinates 0.2 0.0 0.2 0.4 0.6 0.0 0.5 1.0 1.5 2.0 β (k) 1 0.0 0.5 1.0 1.5 2.0 ˆβ(s) 1 For some problems (some y, X), they are exactly the same as the learning rate γ 0!

30 Gradient boosting Given observations y = (y 1,... y n ) R n, predictor measurements x i R p, i = 1,... n Want to construct a flexible (nonlinear) model for outcome based on predictors. Weighted sum of trees: m θ i = β j T j (x i ), j=1 i = 1,... n Each tree T j inputs predictor measurements x i, outputs prediction. Trees are grown typically pretty short...

31 Pick a loss function L that reflects setting; e.g., for continuous y, could take L(y i, θ i ) = (y i θ i ) 2 Want to solve min β R M n i=1 ( L y i, M j=1 ) β j T j (x i ) Indexes all trees of a fixed size (e.g., depth = 5), so M is huge Space is simply too big to optimize Gradient boosting: basically a version of gradient descent that is forced to work with trees First think of optimization as min θ f(θ), over predicted values θ (subject to θ coming from trees)

32 Start with initial model, e.g., fit a single tree θ (0) = T 0. Repeat: Evaluate gradient g at latest prediction θ (k 1), [ ] L(yi, θ i ) g i = θ i θ i =θ (k 1) i, i = 1,... n Find a tree T k that is close to g, i.e., T k solves min trees T n ( g i T (x i )) 2 i=1 Not hard to (approximately) solve for a single tree Update our prediction: θ (k) = θ (k 1) + α k T k Note: predictions are weighted sums of trees, as desired

33 Can we do better? Recall O(1/ɛ) rate for gradient descent over problem class of convex, differentiable functions with Lipschitz continuous gradients First-order method: iterative method, updates x (k) in x (0) + span{ f(x (0) ), f(x (1) ),... f(x (k 1) )} Theorem (Nesterov): For any k (n 1)/2 and any starting point x (0), there is a function f in the problem class such that any first-order method satisfies f(x (k) ) f 3L x(0) x 2 2 32(k + 1) 2 Can attain rate O(1/k 2 ), or O(1/ ɛ)? Answer: yes (and more)!