Unconstrained minimization of smooth functions

Similar documents
Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Lecture 5: September 15

Lecture 5: September 12

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

10. Unconstrained minimization

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

The Steepest Descent Algorithm for Unconstrained Optimization

Unconstrained minimization

Lecture 14: October 17

Convex Optimization. Problem set 2. Due Monday April 26th

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Descent methods. min x. f(x)

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Optimization Tutorial 1. Basic Gradient Descent

Newton s Method. Javier Peña Convex Optimization /36-725

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Lecture 6: September 12

6. Proximal gradient method

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Conditional Gradient (Frank-Wolfe) Method

IE 5531: Engineering Optimization I

Selected Topics in Optimization. Some slides borrowed from

Line Search Methods for Unconstrained Optimisation

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

6. Proximal gradient method

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Unconstrained optimization

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Numerical optimization

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Introduction to Nonlinear Optimization Paul J. Atzberger

Nonlinear equations. Norms for R n. Convergence orders for iterative methods

Lecture 14: Newton s Method

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

Bindel, Fall 2011 Intro to Scientific Computing (CS 3220) Week 6: Monday, Mar 7. e k+1 = 1 f (ξ k ) 2 f (x k ) e2 k.

Numerical Optimization

8 Numerical methods for unconstrained problems

Higher-Order Methods

Introduction to gradient descent

5 Overview of algorithms for unconstrained optimization

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

Subgradient Method. Ryan Tibshirani Convex Optimization

You should be able to...

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Chapter 1. Optimality Conditions: Unconstrained Optimization. 1.1 Differentiable Problems

Unconstrained Optimization

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

CPSC 540: Machine Learning

14. Nonlinear equations

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Lecture 3: Linesearch methods (continued). Steepest descent methods

Nonlinear Optimization for Optimal Control

IE 5531: Engineering Optimization I

GRADIENT = STEEPEST DESCENT

CPSC 540: Machine Learning

Optimization Methods. Lecture 19: Line Searches and Newton s Method

MATH 4211/6211 Optimization Basics of Optimization Problems

Machine Learning Brett Bernstein. Recitation 1: Gradients and Directional Derivatives

Equality constrained minimization

minimize x subject to (x 2)(x 4) u,

2. Quasi-Newton methods

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

An Iterative Descent Method

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent

Gradient Descent. Dr. Xiaowei Huang

Quasi-Newton Methods

5. Subgradient method

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Lecture 18 Oct. 30, 2014

4 Newton Method. Unconstrained Convex Optimization 21. H(x)p = f(x). Newton direction. Why? Recall second-order staylor series expansion:

Constrained Optimization and Lagrangian Duality

Numerical Methods. V. Leclère May 15, x R n

Fast proximal gradient methods

The Conjugate Gradient Method

Optimization methods

Riemannian geometry of surfaces

Nonlinear Optimization

March 8, 2010 MATH 408 FINAL EXAM SAMPLE

Lecture 17: October 27

Barrier Method. Javier Peña Convex Optimization /36-725

Written Examination

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

NONSMOOTH VARIANTS OF POWELL S BFGS CONVERGENCE THEOREM

Existence of minimizers

Complexity of gradient descent for multiobjective optimization

Lecture Notes: Geometric Considerations in Unconstrained Optimization

FIXED POINT ITERATIONS

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Programming, numerics and optimization

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

Chapter 8 Gradient Methods

CSC 576: Gradient Descent Algorithms

Transcription:

Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and is smooth (we will consider a few different definitions of smooth qualitatively, this just means that the gradient changes in a controlled manner as we move from point to point). While many problems are smooth, methods for nonsmooth f(x) are also of great interest, and will (hopefully) be covered later in the course. Nonsmooth methods are not much more involved algorithmically, but they are slightly harder to analyze. Since f is convex, a necessary and sufficient condition for x to be a minimizer is that the gradient vanishes: f(x ) = 0. It is not a given that such a x exists it is possible that f(x) is unbounded below. In this section, we will assume that f does have (at least one) minimizer, and we will denote the optimal value as p = f(x ). 1

Every general-purpose optimization algorithm we will look at in this course is iterative they will all have the basic form: x (0) = initial guess k = 0 do k = k + 1 calculate a direction to move d (k) calculate a step size t k 0 x (k) = x (k 1) + t k d (k) check convergence criteria until converged In the coming lectures, we will focus primarily on two methods for computing the direction d (k). 1. Gradient descent: we take d (k) = f(x (k 1) ). This is the direction of steepest descent (where steepest is defined relative to the Euclidean norm). Gradient descent iterations are cheap, but typically many iterations are required for convergence. 2. Newton s method: we build a quadratic model around x (k) then compute the exact minimizer of this quadratic by solving a system of equations. This corresponds to taking d (k) = ( 2 f(x (k 1) )) 1 f(x k 1 ), that is, the inverse of the Hessian evaluated at x (k 1) applied to the gradient evaluated at the same point. Newton iterations 2

tend to be expensive (as they require a system solve), but they typically converge in far fewer iterations than gradient descent. Whichever direction we choose, it should be a descent direction; d (k) should satisfy d (k), f(x (k 1) ) 0. Since f is convex, it is always true that f(x + td) f(x) + t d, f(x), and so to decrease the value of the functional while moving in direction d, it is necessary that the inner product above be negative. Line search After the step direction is chosen, we need to compute how far to move. There are many methods for doing this, here are three: Exact: Solve the 1D optimization program min s 0 f(x(k 1) + sd (k) ). This is typically not worth the trouble, but there are instances (i.e. unconstrained convex quadratic programs) when it can be solved analytically. Backtracking: Start with a step size of t = 1, then decrease by a factor of β until we are below a certain line. 3

Fix α (0, 0.5) and β (0, 1). Given a starting point x and direction d, t = 1 repeat if f(x + td) < f(x) + α t d, f(x) converged else t = βt endif until converged Picture: The backtracking line search tends to be cheap, and works very well in practice. Typically, we take α small (to encourage a large step) and β [0.3, 0.8]. Fixed: Finally, we can just use a constant step size t k = t. This will actually work if the step size is small enough, but usually this results in way too many iterations. 4

Convergence of gradient descent How quickly does gradient descent converge? I m glad you asked! We will look at two different smoothness assumptions on f, and translate them into convergence rates. In the first case, we assume that f is twice differentiable (so that its Hessian exists everywhere), and that mi 2 f(x) MI. That is, the eigenvalues of the Hessian (which is always in S N for a convex function) are bounded between m > 0 and M <. We call this assumption strong convexity (note that the lower bound by itself means that f is strictly convex). Recall that the main consequence of convexity is that we have a way to compute a linear global lower bound at every point: f(y) f(x) + y x, f(x), for all x, y. The main consequence of strong convexity is that we have in addition a quadratic lower and upper bound. By the Taylor theorem, we know that for any x, y, f(y) = f(x) + y x, f(x) + 1 2 (y x)t 2 f(z)(y x) for some point z on the line segment between x and y. Thus we have the bounds f(y) f(x) + y x, f(x) + m 2 y x 2 2, (1) 5

and f(y) f(x) + y x, f(x) + M 2 y x 2 2. (2) An immediate consequence of the stronger lower bound in (1) is that we can tell at any point x how close to optimal we are. The smallest value the right hand side can take in that inequality is when ỹ = x m 1 f(x); plugging that value into the right hand side yields f(y) f(x) 1 2m f(x) 2 2, a bound which holds for every f(y). In particular, it holds for the optimal value p, and so p f(x) 1 2m f(x) 2 2. (3) Thus we get a bound on f(x) p just by calculating the norm of the gradient. It also re-iterates that if f(x) 2 is small, we are close to the solution. We can also bound how close x is to the minimizer x. Using (1) with y = x, p = f(x ) f(x) + x x, f(x) + m 2 x x 2 2 f(x) x x 2 f(x) 2 + m 2 x x 2 2. Then since p f(x), x x 2 f(x) 2 + m 2 x x 2 2 0, 6

and so x x 2 2 m f(x) 2. These facts are very useful for defining stopping criteria. Suppose we have a strongly convex function that we minimize using gradient descent with an exact line search. We can show that at each iteration, the gap f(x (k) ) p gets cut down by a fixed factor. We consider a single iteration we will use x to denote the current point, and x + = x t exact f(x) to denote the result of the gradient step. We choose t exact by minimizing the following function: f(t) = f(x t f(x)). By strong convexity, we know that f(t) f(x) t f(x) 2 2 + Mt2 2 f(x) 2 2. By definition of t exact, we know f(x + ) = f(t exact ) f(1/m) f(x) 1 2M f(x) 2 2. From (3), we also know that f(x) 2 2 2m(f(x) p ), and so f(x + ) p f(x) p m M (f(x) p ), which means f(x + ) p f(x) p ( 1 m ). M 7

That is, the gap between the current functional evaluation and the optimal value has been cut down by a factor of 1 m/m < 1. Applying this relationship recursively, we see that after k iterations of gradient descent, we have f(x (k) ) p ( f(x (0) ) p 1 m ) k. M Another way to say this is that we can achieve accuracy by taking f(x (k) ) p ɛ, steps. k log(e 0/ɛ) log(1 m/m), E 0 = f(x (0) ) p, There are similar results for gradient descent on strongly convex functions using backtracking. They are similar in nature; they have the same linear convergence but with constants that depend on α and β along with m and M. See [?, p. 468]. Lipschitz gradient condition We can also get (much weaker) convergence results when f is not strongly convex (or even necessarily twice differentiable), but rather has a Lipschitz gradient. This means that there exists an L > 0 such that f(x) f(y) 2 L x y 2. (4) 8

The upshot here is that we still have a quadratic upper bound for f at every point, but not the lower bound. To be precise, if f obeys (4), then f(y) f(x) + y x, f(x) + L 2 y x 2 2. This follows immediately from the fundamental theorem of calculus 1 : f(y) f(x) = and so 1 f(y) f(x) y x, f(x) = 0 y x, f ((1 t)x + ty) dt, 1 0 y x, f((1 t)x + ty) f(x) dt 1 y x 2 f((1 t)x + ty) f(x) 2 L y x 2 2 = L 2 y x 2 2. 0 1 0 t dt dt Now, let s consider running gradient descent on such a function with a fixed step size t 1/L. As before, we denote the current iterate as x and the next iterate at x + = x t f(x). We have ( f(x + ) f(x) 1 Lt 2 f(x) t 2 f(x) 2 2, ) t f(x) 2 2 1 The 1D function f(t) = f((1 t)x + ty) has derivative f (t) = y x, f ((1 t)x + ty). This is just a simple application of the chain rule. 9

for the range of t we are considering. By convexity of f, f(x) f(x ) + x x, f(x), where x is a minimizer of f, and so f(x + ) f(x ) + x x, f(x) t 2 f(x) 2 2, and then substituting f(x) = (x x + )/t yields f(x + ) f(x ) 1 t x x, x x + 1 2t x x+ 2 2 = 1 2t ( ) x x 2 2 x + x 2 2. Summing this difference over k iterations yields: ( k k ) i=1 f(x (i) ) f(x ) 1 2t = 1 2t i=1 x (i 1) x 2 2 x (i) x 2 2 ( ) x (0) x 2 2 x (k) x 2 2 1 2t x(0) x 2 2. Since f(x (i) ) f(x ) is monotonically decreasing in i, the kth term will be smaller than the average: f(x (k) ) f(x ) 1 k k f(x (i) ) f(x ) i=1 1 2tk x(0) x 2 2. 10

Note that this convergence guarantee is much slower it is O(1/k) in place of O(c k ) for some c < 1. This is the price we pay for allowing f to not be as smooth. 11