Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Similar documents
Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Lecture 5: September 15

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Descent methods. min x. f(x)

Lecture 5: September 12

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Lecture 6: September 12

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Selected Topics in Optimization. Some slides borrowed from

Unconstrained minimization of smooth functions

Lecture 14: October 17

Lecture 1: September 25

Lecture 17: October 27

Lecture 9: September 28

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Lecture 7: September 17

Lecture 6: September 17

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Lecture 14: Newton s Method

Optimization Tutorial 1. Basic Gradient Descent

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Newton s Method. Javier Peña Convex Optimization /36-725

Lecture 8: February 9

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Lecture 23: November 21

Subgradient Method. Ryan Tibshirani Convex Optimization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

10. Unconstrained minimization

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

6. Proximal gradient method

Optimization methods

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Optimization methods

Nonlinear Optimization for Optimal Control

Lasso: Algorithms and Extensions

ECS171: Machine Learning

CPSC 540: Machine Learning

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Conditional Gradient (Frank-Wolfe) Method

Unconstrained minimization

Lecture 16: October 22

Convex Optimization. Problem set 2. Due Monday April 26th

Lecture 23: Conditional Gradient Method

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

5. Subgradient method

8 Numerical methods for unconstrained problems

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

IE 5531: Engineering Optimization I

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

A function(al) f is convex if dom f is a convex set, and. f(θx + (1 θ)y) < θf(x) + (1 θ)f(y) f(x) = x 3

6. Proximal gradient method

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Math 273a: Optimization Subgradient Methods

Barrier Method. Javier Peña Convex Optimization /36-725

Lecture 24: August 28

Lecture 25: November 27

Line Search Methods for Unconstrained Optimisation

Lecture 23: November 19

You should be able to...

Math 273a: Optimization Subgradients of convex functions

Convex Optimization. Prof. Nati Srebro. Lecture 12: Infeasible-Start Newton s Method Interior Point Methods

Lecture 1: January 12

Numerical Optimization Prof. Shirish K. Shevade Department of Computer Science and Automation Indian Institute of Science, Bangalore

arxiv: v1 [math.oc] 1 Jul 2016

The Steepest Descent Algorithm for Unconstrained Optimization

Lecture 10: September 26

Fast proximal gradient methods

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Coordinate Descent and Ascent Methods

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Lecture 19: Follow The Regulerized Leader

On Nesterov s Random Coordinate Descent Algorithms - Continued

Lecture 14 Ellipsoid method

Lecture 11: October 2

Dual Ascent. Ryan Tibshirani Convex Optimization

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Gradient Descent. Stephan Robert

Lecture: Adaptive Filtering

A Brief Review on Convex Optimization

CPSC 540: Machine Learning

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Math 164: Optimization Barzilai-Borwein Method

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

12. Interior-point methods

MVE165/MMG631 Linear and integer optimization with applications Lecture 13 Overview of nonlinear programming. Ann-Brith Strömberg

Numerical optimization

Transcription:

10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. Some of the content in this note are borrowed from the Boyd & Vandenbergheś book. This lecture is about Gradient Descent, the first algorithm in a series of first-order methods for solving optimization problem. 5.1 Unconstrained minimization problems and Gradient descent Let consider unconstrained, smooth convex optimization problems in the form of min f(x) (5.1) where f : R n R is convex, twice differentiable, with dom (f) = R (no constrains) Assume that the problem is solvable, we denote the optimal value, f = min x f(x) and optimal solution as x = arg min x f(x). To find x, one approach is to solve a system of equation f(x ) = 0, which is often not easy to solve analytically. Hence, an iterative scheme is more preferred: computing a minimizing sequence of points x (0), x (1),... such that f(x (k) ) f(x ) as k. The algorithm stops at some points when the residual error between current state and optimal value is within an acceptable tolerance, i.e. f(x (k) ) p ɛ. Gradient Descent is an iterative algorithm producing such a minimizing sequence x (k) by repeating x (k) = x (k 1) t k f(x (k 1) ), (5.2) where k = 1, 2,... is iteration number, t k is step size (or step length) at iteration k, initial x 0 R n is usually given. We can prove that f(x (k) ) < f(x (k 1) by applying first-order approximation on the LHS as follows f(x (k) ) = f(x (k 1) t k f(x (k 1) )) f(x (k 1) ) t f(x (k 1) ) T f(x (k 1) ) f(x (k 1) ) (5.3) Basically, by following the points generated by Gradient Descent, we are guaranteed to reach closer to the optimal values of the functions. However, it depends on whether the function is convex or non-convex that we can reach global optimum or local optima (illustrated in Figure 5.1). Interpretation of Gradient Descent. Now we consider a more formal interpretation of Gradient Descent. At each iteration, we want to move from our current point x to point y such that f(y) is optimal. Since f is twice differentiable, apply the quadratic approximation on f(y) to have f(y) f(x) + f(x) T (y x) + 1 2 (y x)t 2 f(θ(x y) + y)(y x) (5.4) where θ [0..1]. By replacing 2 f(θ(x y) + y) by 1 t I, we can represent f(y) as follows f(y) f(x) + f(x) T (y x) + 1 2t y x 2 2 = g(y) (5.5) 5-1

5-2 Lecture 5: Gradient Descent (a) Convex function (b) Non-convex function Figure 5.1: Gradient Descent paths with different starting points are illustrated in different colours. In the case of strictly convex function (lfigure a.), Gradient Descent paths starting from any points all lead to the global optimum. Conversely, in the case of non-convex function, different paths may end up at different local optima.. the first additive term is called linear approximation, and the second one is proximity term. Basically, the proximity term tell us that we should not go to far from x, otherwise results in large f(y). To find optimal value of y, we solve the equality g(y) = 0 f(x)+ 1 t (y x) = 0 y = x t f(x). Figure 5.2 illustrates this optimal movement. 5.2 How to choose step sizes Recall that the update rule of Gradient Descent requires a step size t k controlling the amount of gradient updated to the current point at each iteration. A naive strategy is to set a constant t k = t for all iterations. This strategy poses two problems. A too big t can lead to divergence, meaning the learning function oscillates away from the optimal point. A too small t takes longer time for the function to converge. A good selection of t can make the algorithm faster to converge, as illustrated in Figure 5.3. Hence, we need good strategies to select appropriate step sizes. Two examples of such approaches are backtracking line search and exact line search. Backtracking line search The main idea of this strategy is to pick step sizes to reduce f approximately. The strategy is described as follows. First fix two parameters 0 < β < 1 and 0 < α 0.5. At each iteration, start with t = 1, and while f(x t f(x)) > f(x) αt f(x) 2 2, (5.6) shrink t = βt. Else perform Gradient Descent update x + = x t f(x).

Lecture 5: Gradient Descent 5-3 Figure 5.2: Solid curve depicts f. Dashed curve depicts g, which is second-order approximation of f. Proximity term prevents y having lower values. (a) t is too large (after 8 steps) (b) t is too small (after 100 steps) (c) t is good (after 40 steps) Figure 5.3: Consider the function f(x) = 10x2 1 +x2 2 2.. This approach will eventually terminate. For small enough t, we may have f(x t f(x)) f(x) t f(x) 2 2 < f(x) αt f(x) 2 2 (5.7) Parameter α controls the decrease in f of the prediction based on the linear approximation. β controls how much exhaustive the search for t is. As suggested in B&V book, α 0.01, 03] and β [0.1, 0.8].

5-4 Lecture 5: Gradient Descent Figure 5.4: Backtracking line search with α = β = 0.5. We can see the step side is adjusted appropriately after every iteration (large step size at early iterations and keep decreasing when getting closer to the optimal values). Exact line search Backtrack line search is often known as inexact line search since it selects step size to reduce f approximately. We can also choose step size to do the best as we can along the direction of negative gradient. This strategy is called exact line search. t = arg min f(x s f(x)) (5.8) s 0 Usually Equation 5.8 is not possible to solve exactly. Approximations to the solution are not more efficient than backtracking. 5.3 Convergence analysis Assume that f is convex, differentiable with dom (f) = R n and Lipschitz gradient with constant L > 0. We have the following theorem. Theorem 5.1 Gradient descent with fixed step size t 1/L satisfies f(x (k) ) f x(0) x 2 2 2tk (5.9) Proof: Since f is Lipschitz continuous with constant L > 0, we have f(y) f(x) + f(x) T (y x) + L 2 y x 2 2, x, y (5.10) (Note: Proof of the inequality in Equation 5.10 is part of homework 1) Assume that y is computed from x using Gradient descent update rule y = x + = x t f(x) f(x) = x x+ t (5.11)

Lecture 5: Gradient Descent 5-5 Substitute Equation 5.11 to Equation 5.10, we have f(x + ) f(x) + f(x) T ( t f(x)) + L 2 t2 f(x) 2 2 Note that (1 Lt 2 )t t 2 for t (0, 1 L ], we can bound Equation 5.12 = f(x) t f(x) 2 2 + Lt2 2 f(x) 2 2 = f(x) (1 Lt 2 )t f(x) 2 2 (5.12) f(x + ) f(x) t 2 f(x) 2 2 = f(x) t 2 By the convexity of f, we can write x+ x 2 2 = f(x) 1 t 2t x x+ 2 2 (5.13) f(x ) f(x) + f(x) T (x x) f(x) f(x ) + f(x) T (x x ) (5.14) Hence, we can substitute Equation 5.14 to Equation 5.13 f(x + ) f(x ) + f(x) T (x x ) 1 2t x+ x 2 2 Move f(x ) to the LHS, we have = f(x ) + ( x x+ ) T (x x ) 1 t 2t x+ x 2 2 = f(x ) + 1 t (x x+ ) T (x x ) 1 2t x+ x 2 2 = f(x ) + 1 2t (2(x x+ ) T (x x ) x + x 2 2) = f(x ) + 1 2t (2(x x+ ) T (x x ) x + x 2 2 + x x 2 2 x x 2 2) = f(x ) + 1 2t ( ( x x 2 2 2(x x + ) T (x x ) + x + x 2 2) + x x 2 2) = f(x ) + 1 2t ( x+ x 2 2 + x x 2 2) Summing all the LHS over different k, we have (5.15) f(x + ) f(x ) 1 2t ( x+ x 2 2 + x x 2 2) (5.16) k f(x (i) ) f(x ) = 1 2t i=1 Since f(x (k) )... f(x (1) ), we have k ( x (i 1) x 2 2 x (i) x 2 2) i=1 k(f(x (k) ) f(x )) Divide both sides to k, we have Theorem 5.1. = 1 2t ( x0 x 2 2 x (k) x 2 2) 1 2t x0 x 2 2 (5.17) k f(x (i) ) f(x ) 1 2t x0 x 2 2 (5.18) i=1 A couple of things to note from this convergence analysis:

5-6 Lecture 5: Gradient Descent We say Gradient Descent has convergence rate O(1/k). In other words, to get f(x (k) ) f ɛ, we need O(1/ɛ) iterations. Equation 5.12 recalls us the stopping condition in Backtracking line search when α = 0.5, t = 1 L. Hence, Backtracking line search with α = 0.5 plus condition of Lipschitz gradient will guarantee us the convergence rate of O(1/k). 5.3.1 Convergence analysis for Backtracking line search With the same assumption as above, namely f is convex, differentiable with dom (f) = R n and Lipschitz gradient with constant L > 0. We have the following theorem. Theorem 5.2 Gradient descent with backtracking line search satisfies where t min = min{1, β/l} f(x (k) ) f x(0) x 2 2, (5.19) 2t min k It is obvious to see that Backtracking line search arrives at the similar rate with fixed step size. The only difference is the amount of penalty is now controlled by t min, which again depends on β. β = 1 returns the exact rate of fixed step size. Not too small β would return a comparable result to fixed step size. 5.3.2 Convergence analysis under strong convexity Recall that a function f is strongly convex for some m > 0 if f(x) m 2 x 2 2 is convex. Using the same assumption as above, Lipschitz gradient, plus strong convexity, we have the following theorem Theorem 5.3 Gradient descent with fixed step size t 2 m+l where 0 < c < 1. or with backtracking line search satisfies f(x (k) ) f c k L 2 x(0) x 2 2, (5.20) (Note: Proof of this theorem is part of homework 2.) Theorem 5.3 gives us an exponentially fast converge rate (O(c k )). In other word, we can say that to get f(x (k) f ɛ, we only need O(log(1/ɛ)) iterations. Hence, sometimes this rate is known as linear convergence. Constant c depends adversely on condition number L/m. Higher condition number will lead to slower rate. 5.3.3 Example of the conditions Let consider f(β) = 1 2 y Xβ 2 2. Lipschitz continuity of f:

Lecture 5: Gradient Descent 5-7 2 f(x) LI As 2 f(β) = X T X, we have L = σ 2 max(x) (i.e. largest eigen value of X T X) Strong convexity of f: 2 f(x) mi As 2 f(β) = X T X, we have m = σ 2 min (X) If X is wide, X is n p with p > n(more columns than rows, i.e. more features than observations), σ 2 min (X) = 0 (XT X is rank deficient), and f can t be strongly convex. If σ min (X) > 0, we can have very large condition number L/m = σ max (X)/σ min (X) 5.3.4 Practicalities Now we know that Gradient Descent can eventually lead to convergence under some assumptions after a large number of iterations. A good stopping condition in practice is that f(x) 2 is small (Recall that f(x) 2 = 0 at solution x ). If f is strongly convex with parameter m, then we can stop when Pros and Cons of Gradient Descent: f(x) 2 (2mɛ) f(x) f ɛ (5.21) Pro: simple idea (iteratively moving toward the steepest descent ), each iteration is cheap (need to recompute the gradients) Pro: very fast for well-conditioned, strongly convex problems. Con: often slow, because interesting problems are not strongly convex or well-conditioned. Con: can t handle non-differentiable functions. 5.4 Summary Gradient Descent belongs to a group of first-order method, basically, algorithms that update x (k) in the following linear space x (0) + span{ f(x (0) ),..., f(x (k 1) )} (5.22) The algorithm has O(1/ɛ) rate over problem class of convex, differentiable functions with Lipschitz continuous gradients. Theorem 5.4 For any iteration k (n 1)/2 and any starting point x (0), there is a function f in the problem class such that any first-order method satisfies f(x (k) ) f 3L x(0) x 2 2 32(k + 1) 2 (5.23) This theorem sets a lower bound on the rate, telling us that we cannot do better than O(1/k 2 ). However, we can tweak the algorithm to have better rate of O(1/ ɛ) (e.g. using acceleration techniques). Besides, there are other algorithms that allow us to solve more complex functions, i.e. non-differentiable functions.