Optimization Tutorial 1. Basic Gradient Descent

Similar documents
Gradient Descent. Dr. Xiaowei Huang

Constrained Optimization and Lagrangian Duality

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Unconstrained minimization of smooth functions

Introduction to gradient descent

Optimization Methods. Lecture 19: Line Searches and Newton s Method

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Numerical optimization

MATH 4211/6211 Optimization Basics of Optimization Problems

Convex Optimization. Problem set 2. Due Monday April 26th

ISM206 Lecture Optimization of Nonlinear Objective with Linear Constraints

The Steepest Descent Algorithm for Unconstrained Optimization

Constrained Optimization

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Nonlinear Optimization for Optimal Control

Unconstrained optimization

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

MVE165/MMG631 Linear and integer optimization with applications Lecture 13 Overview of nonlinear programming. Ann-Brith Strömberg

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Newton s Method. Javier Peña Convex Optimization /36-725

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

Nonlinear Optimization

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Conditional Gradient (Frank-Wolfe) Method

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

Scientific Computing: Optimization

10. Unconstrained minimization

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

Numerical Optimization of Partial Differential Equations

CSCI 1951-G Optimization Methods in Finance Part 09: Interior Point Methods

Nonlinear Programming

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Lecture V. Numerical Optimization

8 Numerical methods for unconstrained problems

Optimality Conditions for Constrained Optimization

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Selected Topics in Optimization. Some slides borrowed from

Lecture 5: September 15

Line Search Methods for Unconstrained Optimisation

Generalization to inequality constrained problem. Maximize

Gradient Descent. Mohammad Emtiyaz Khan EPFL Sep 22, 2015

2.098/6.255/ Optimization Methods Practice True/False Questions

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Lecture 14: October 17

1 Kernel methods & optimization

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

OPER 627: Nonlinear Optimization Lecture 2: Math Background and Optimality Conditions

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

TMA 4180 Optimeringsteori KARUSH-KUHN-TUCKER THEOREM

CHAPTER 2: QUADRATIC PROGRAMMING

5 Handling Constraints

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Appendix A Taylor Approximations and Definite Matrices

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

CS-E4830 Kernel Methods in Machine Learning

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Computational Finance

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

minimize x subject to (x 2)(x 4) u,

Optimization. Yuh-Jye Lee. March 21, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 29

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

ARE202A, Fall 2005 CONTENTS. 1. Graphical Overview of Optimization Theory (cont) Separating Hyperplanes 1

Quiz Discussion. IE417: Nonlinear Programming: Lecture 12. Motivation. Why do we care? Jeff Linderoth. 16th March 2006

Optimization methods

An Iterative Descent Method

Descent methods. min x. f(x)

Constrained Optimization Theory

ICS-E4030 Kernel Methods in Machine Learning

10-725/36-725: Convex Optimization Prerequisite Topics

Roles of Convexity in Optimization Theory. Efor, T. E and Nshi C. E

14. Nonlinear equations

1 Overview. 2 A Characterization of Convex Functions. 2.1 First-order Taylor approximation. AM 221: Advanced Optimization Spring 2016

Lecture 5: September 12

Lecture 7 Unconstrained nonlinear programming

Lecture 5 : Projections

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Introduction to Scientific Computing

Lecture 3. Optimization Problems and Iterative Algorithms

Optimization Problems with Constraints - introduction to theory, numerical Methods and applications

1 Numerical optimization

Unconstrained Optimization

g(x,y) = c. For instance (see Figure 1 on the right), consider the optimization problem maximize subject to

Optimality conditions for unconstrained optimization. Outline

4 Newton Method. Unconstrained Convex Optimization 21. H(x)p = f(x). Newton direction. Why? Recall second-order staylor series expansion:

Nonlinear Optimization: What s important?

CPSC 540: Machine Learning

Part 2: Linesearch methods for unconstrained optimization. Nick Gould (RAL)

Some Inexact Hybrid Proximal Augmented Lagrangian Algorithms

Primal-Dual Interior-Point Methods for Linear Programming based on Newton s Method

Review of Optimization Basics

Numerical Optimization

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Trust Regions. Charles J. Geyer. March 27, 2013

Optimization for Machine Learning

Lecture 1: Introduction. Outline. B9824 Foundations of Optimization. Fall Administrative matters. 2. Introduction. 3. Existence of optima

Interior Point Methods. We ll discuss linear programming first, followed by three nonlinear problems. Algorithms for Linear Programming Problems

Transcription:

E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra. In many real-world applications of computer science, engineering, and economics, one often encounters numerical optimization problems of the following form, where the goal is to find the minimum of a realvalued function f : R d R over a (closed) constraint set C R d : min f(x). x C In this three-part tutorial, we shall discuss theory and algorithms for solving such problems. We begin in this part with the basic gradient descent method for solving unconstrained optimization problems where C = R d ; in the next two parts of the tutorial, we shall focus on constrained optimization problems. 1 One of the simplest examples of a unconstrained optimization problem is the following quadratic optimization problem, which we shall use as a running example in this document: min x R d f quad (x) = 1 2 x Ax b x, (OP1) where A R d d and b R d. Before we proceed to discuss methods for solving problems of the above form, we will find it useful to first provide some preliminaries on vector calculus and linear algebra. 1 Preliminaries Throughout this document, we shall assume that the given function f : R d R that we wish to optimize is bounded below, i.e., B R s.t. f(x) B, x R d. Gradient and Hessian. The gradient of a multi-dimensional function f : R d R is a generalization of the first derivative of a one-dimensional function, and is given by a vector of first-order partial derivatives of f (if they exist): f(x) = x 1 x 2. x d The Hessian of f is a generalization of the second derivative in one-dimension, and is a matrix of second-order partial derivatives of f (if they exist): 2 f(x) = x 2 1. x d x 1.... x 1 x d....... x 2 d 1 While we focus on minimization problems in the tutorial, the theory and methods discussed apply to maximization problems as well.. 1

2 Optimization Tutorial 1: Basic Gradient Descent Function value at x 8 6 4 2 0 Original function First order approx. Second order approx. 2 2 1 0 1 2 x Figure 1: Plot of f(x) = e x and its first-order and second-order Taylor approximations at x 0 = 0. It can be verified that for the quadratic function in (OP1), f(x) = Ax b and 2 f(x) = A. Also, let C k denote the set of all real-valued functions whose k th -order partial derivatives exist and are continuous. Taylor series. We will often require to approximate a function in terms of its gradient and Hessian at a particular point x 0 R d. For this, we shall resort to the Taylor expansion of the functions at x 0. More specifically, the first-order Taylor expansion of a function f : R d R in C 1 is given by f(x) = f(x 0 ) + f(x 0 ) (x x 0 ) + o( x x 0 2 ), (1) while the second-order Taylor expansion of a function f in C 2 is given by f(x) = f(x 0 ) + f(x 0 ) (x x 0 ) + 1 2 (x x 0) 2 f(x 0 )(x x 0 ) + o( x x 0 2 2), u(t) where u(t) = o(v(t)) shall denote lim t 0 v(t) = 0, i.e., u(t) goes to 0 faster than v(t) as t 0; this tells us that when x x 0 2 2 is sufficiently small, the last term in the above expansions can be effectively ignored. For the one-dimensional function f(x) = e x, the first-order Taylor approximation at 0 is 1 x, while the second-order Taylor approximation at 0 is 1 x + 1 2 x2 ; see Figure 1 for an illustration. Positive (semi-)definiteness. Finally, a matrix A R d d is called positive semi-definite (p.s.d.) if x R d, x 0, x Ax 0 and positive definite (p.d.) if this inequality is strict; it can be verified that a p.s.d (p.d.) matrix has non-negative (positive) eigen values. 2 Necessary and Sufficient Conditions for Optimality As a first step towards solving an unconstrained optimization problem, we provide necessary and sufficient conditions for the minimizer of a function in C 2. While our ideal goal is to find the global minimizer of f, i.e., a point x R d such that f(x ) f(x), x R d, due to practical considerations, often one needs to settle with a solution that is locally optimum: Definition 1. A point x R d is said to be a local minimizer of a function f : R d R if there exists ɛ > 0 such that f(x ) f(x) for all x R d within an ɛ-distance of x, i.e., x x 2 ɛ. The following characterization of a local minimizer of f is well-known; see for example [3]. Proposition 1 (Necessary and sufficient conditions for local optimality). Let f : R d R be a function in C 2. If x R d is a local minimizer of f, then 1. f(x ) = 0 2. 2 f(x ) is p.s.d. Moreover, if for a given x, f(x ) = 0 and 2 f(x ) is p.d., then x is a local minimizer of f. Let us apply the above characterization to the optimization problem in (OP1). Equating the gradient of f quad in this problem to zero we have f quad (x ) = Ax b = 0. If A is invertible, then x = A 1 b; further, if 2 f quad (x ) = A is p.d., we have that x is a local minimizer of f quad. Clearly, in order to minimize a function f over R d, we will have to find a point with zero gradient; the Hessian at this point can then be used to determine if this point is indeed a (local) minimizer of f. In the next section, we describe a technique for finding a point in R d where the gradient of f is zero.

Optimization Tutorial 1: Basic Gradient Descent 3 8 6 f(x) 4 x 2 2 x 1 0 0 2 4 x Figure 2: Plot of a one-dimensional function f. The minimizer of the function is indicated by a red mark. Note that the slope/gradient of f at a point x 1 to the left of the minimizer is positive, while that at point x 2 to the right of the minimizer is negative. The arrows indicate directions that yield lower values of f compared to the current point. 3 Basic Gradient Descent We now describe an iterative method for optimizing a given function f; this method starts with an initial point and generates a sequence of points whose gradients converge to zero. We begin by building some intuition using the example one-dimensional function shown in Figure 2. Observe that at a point x 1 lying to the left of the minimizer of this function (indicated with a red mark), the slope/gradient of this function is negative; this suggests that one can decrease the value of the function by moving to the right of x 1. Similarly, at a point x 2 to the right of the minimizer, the slope/gradient of the function is positive, suggesting that one can decrease the function value by moving to the left of x 2. In both cases, moving in a direction opposite to the sign of the function slope at a point produces a decrease in its value. In fact, we shall see now that for any function f : R d R in C 1, one can decrease the value of f at a point, by moving in a direction opposite to that of the gradient of f at that point. In particular, let x 0 R d and y R d be obtained by subtracting from x 0 the gradient of f scaled by a factor η > 0, i.e., y = x 0 η f(x 0 ). From the first-order Taylor expansion of f at x 0 (see Eq. (1)), we have f(y) = f(x 0 ) + f(x 0 )(y x 0 ) + o( y x 0 2 ) = f(x 0 ) η f(x 0 ) 2 + o(η f(x 0 ) 2 ). Clearly, if η is sufficiently small, the third term in the above expression becomes negligible compared to the first two terms, giving us f(y) < f(x 0 ). Based on this observation, we now describe an iterative method for optimizing f, where each iteration generates a new point by moving in the direction of the negative gradient at the current point; this method is popularly known as the gradient descent method. Gradient Descent Method Input: f : R d R Initialize: x 0 R d Parameter: T t = 0 for t = 1 to T Select step-size η t > 0 x t+1 = x t η t f(x t ) Output: x T +1 As seen earlier, by choosing an appropriate step-size η t in each iteration t, we will have f(x t+1 ) < f(x t ). It is however not immediately clearly how one should choose η t in each iteration. While a small value of η t will result in slow convergence, with a large value, we will not be able to guarantee a decrease in f after every iteration. Below, we describe two schemes for selecting η t, both of which come with convergence guarantees. Exact line search. A natural first-cut approach for step-size selection would be to choose a value that produces the maximum decrease in function value at the given iteration: η t argmin η>0 f(x t η f(x t )). (2)

4 Optimization Tutorial 1: Basic Gradient Descent α = 0 α = 0.25 α = 1 f(x t η t f(x t )) f(x t ) αη t f(x t ) 2 2 0 0.5 1 1.5 2 2.5 3 η t Figure 3: Illustration of backtracking line search for an example function f : R d R. The solid black curve is a plot of f(x t+1 ) = f(x t η t f(x t )) as a function of step-size η t. The dotted blue curves are plots of the linear function f(x t ) αη t f(x t ) 2 2 as a function of η t for different values of α. When α = 1, this gives us a line that lies below the solid curve; when α < 1, we get a line that has portions above the solid curve, with α = 0 corresponding to a horizontal line. Given a value of α (0, 1), the backtracking line search scheme aims to find a value of η t for which f(x t η t f(x t )) f(x t ) αη t f(x t ) 2 2; as seen from the plots, for any value of α (0, 1), such a η t always exists. Note that the above sub-problem is itself an optimization problem (involving a single variable). In some cases, this sub-problem can be solved analytically, resulting in a simple closed-form solution for η t. For example, in the case of the quadratic problem in (OP1), the above exact line search scheme gives us: η t = f(x t ) 2 2 f(x t ) A f(x t ). However, in many optimization problems of practical importance, it is often quite expensive to solve this inner line search problem exactly. In such cases, one resorts to an approximate approach for solving Eq. (2); we next describe one such scheme. Inexact line search. An alternate to the above exact line search approach is a simple heuristic where one starts with a large value of η and multiplicatively decreases this value until f(x t η f(x t )) < f(x t ). However, this scheme may not yield sufficient progress in each iteration to provide convergence guarantees for the gradient descent method. Instead, consider the following sophisticated version of this heuristic that does indeed guarantee a sufficient decrease in the function value after each iteration [1]: Inexact (Backtracking) Line Search Input: f : R d R, x t Parameter: α (0, 1), β (0, 1) η t = 1 while f(x t η t f(x t )) > f(x t ) αη t f(x t ) 2 2 η t = βη t Output: η t Recall that the first order Taylor approximation of f(x t η t f(x t )) is f(x t ) η t f(x t ) 2 2. The above scheme aims to find a step-size η t that is lesser than a scaled version of this linear approximation: f(x t ) αη t f(x t ) 2 2. As seen from Figure 3, such a value of η t always exists, ensuring that the above procedure terminates. Moreover, at termination, we will have f(x t+1 ) = f(x t η t f(x t )) f(x t ) αη t f(x t ) 2 2 < f(x t ), as desired. Note that the parameter α now gives us a handle on the amount decrease in function value after each iteration and will determine the speed of convergence of the gradient descent method. Under (a slight variant of) the above backtracking line search scheme (and under a certain smoothness assumption on the function being optimized), the gradient descent method can be shown to generate a sequence of iterates whose gradients converge to zero; see for example [2] for a formal guarantee. Moreover, when the function being optimized satisfies certain additional properties, one can give explicit rates of convergence for the gradient descent method (under both the exact and backtracking line search schemes), i.e., bound the number of iterations required by the method to reach a solution that is within a certain factor of the optimal function value [1]. We shall discuss these guarantees in the second part of this tutorial.

Optimization Tutorial 1: Basic Gradient Descent 5 4 Next Lecture In the next lecture, we shall discuss the Newton s method for unconstrained optimization that has better convergence guarantees than the gradient descent method described in the previous section. Following this, we shall start with constrained optimization and look at the Karush-Kuhn-Tucker (KKT) conditions for optimality of a solution to a constrained optimization problem. References [1] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [2] R. Fletcher. Practical methods of optimization. John Wiley & Sons, 2nd edition, 1987. [3] D. Luenberger and Y. Ye. Linear and Nonlinear Programming. Springer, 3nd edition, 2008. Practice Exercise Questions 1. Second-order conditions for optimality. You have been asked to find the minimizer of a (bounded, continuous, twice differentiable) function f : R 2 R and run the gradient descent method on this function with three different initial points. Suppose the three runs of the method give you three different final points x 1, x 2, x 3 R 2 with Hessian matrices: 2 2 1 f(x 1 ) = ; 2 1 2 f(x 1 2 2 ) = ; 2 4 2 f(x 2 1 3 ) =. 2 1 Which among the above points will you declare as a (local) minimizer of f? 2. Gradient descent with constant step size. As a part of a course assignment, you are required to minimize the following quadratic function g : R 2 R using the gradient descent method: g(x 1, x 2 ) = 4x 2 1 2x 1 x 2 + 4x 2 2 5x 1 + 3x 2 1. You choose to implement the exact line search strategy to select the step size in each iteration of the method, and obtain the minimizer of g. Surprisingly, your friend tells you that he obtained the same solution by running the steepest descent method on g with a constant step size of 0.1. Can you give an explanation for why your friend s approach would work for g? Hint: Observe that the step size used by your friend is the reciprocal of the largest eigen value of the Hessian of g at any point.