Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Similar documents
Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Gradient Methods Using Momentum and Memory

The Steepest Descent Algorithm for Unconstrained Optimization

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i

Nonlinear Optimization for Optimal Control

Lecture 5: September 12

Iterative solvers for linear equations

Iterative solvers for linear equations

Unconstrained minimization of smooth functions

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Conjugate Gradient (CG) Method

Lecture 10: October 27, 2016

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

The Conjugate Gradient Method

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Optimization Tutorial 1. Basic Gradient Descent

Unconstrained optimization

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Lecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm

Numerical Optimization

Iterative solvers for linear equations

Algorithms for Nonsmooth Optimization

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent

IE 5531: Engineering Optimization I

Robust Stability. Robust stability against time-invariant and time-varying uncertainties. Parameter dependent Lyapunov functions

EECS 275 Matrix Computation

Non-convex optimization. Issam Laradji

Numerical Methods I Solving Nonlinear Equations

5. Subgradient method

Notes for CS542G (Iterative Solvers for Linear Systems)

Unconstrained Optimization

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Convex Optimization. Problem set 2. Due Monday April 26th

Nonlinear Optimization

Lecture 24: August 28

IE 5531: Engineering Optimization I

Newton s Method. Javier Peña Convex Optimization /36-725

Lecture 14: October 17

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Comparison of Modern Stochastic Optimization Algorithms

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Jim Lambers MAT 419/519 Summer Session Lecture 13 Notes

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Nonlinear Programming

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

you expect to encounter difficulties when trying to solve A x = b? 4. A composite quadrature rule has error associated with it in the following form

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

Optimization methods

Linear Regression and Its Applications

Chebyshev Polynomials and Approximation Theory in Theoretical Computer Science and Algorithm Design

Gradient Descent. Dr. Xiaowei Huang

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

Lecture 17: October 27

Conditional Gradient (Frank-Wolfe) Method

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.

From Lay, 5.4. If we always treat a matrix as defining a linear transformation, what role does diagonalisation play?

Damped harmonic motion

CPSC 540: Machine Learning

6.S897 Algebra and Computation February 27, Lecture 6

Eigenvalues, Eigenvectors, and Diagonalization

Introduction. Chapter One

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

LECTURE 7, WEDNESDAY

ORF 363/COS 323 Final Exam, Fall 2018

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Lecture 9: September 28

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Eigenvalues, Eigenvectors, and Diagonalization

8 Numerical methods for unconstrained problems

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Lecture 2: Computing functions of dense matrices

CLASS NOTES Computational Methods for Engineering Applications I Spring 2015

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

TMA 4180 Optimeringsteori THE CONJUGATE GRADIENT METHOD

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

Lecture 6: September 12

1. Method 1: bisection. The bisection methods starts from two points a 0 and b 0 such that

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Lecture 14: Newton s Method

LEAST SQUARES APPROXIMATION

2 Nonlinear least squares algorithms

Neural Network Training

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Lecture 5: September 15

Course Notes: Week 1

Transcription:

Course Notes for EE7C (Spring 018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October 15, 018 6 Discovering acceleration In this lecture, we seek to find methods that converge faster than those discussed in previous lectures. To derive this accelerated method, we start by considering the special case of optimizing quadratic functions. Our exposition loosely follows Chapter 17 in Lax s excellent text [Lax07]. 6.1 Quadratics Definition 6.1 (Quadratic function). A quadratic function f : R n R takes the form: where A S n, b R n and c R. f (x) = 1 xt Ax b T x + c, Note that substituting n = 1 into the above definition recovers the familiar univariate quadratic function f (x) = ax + bx + c where a, b, c R, as expected. There is one subtlety in this definition: we restrict A to be symmetric. In fact, we could allow A R n n and this would define the same class of functions, since for any A R n n there is a symmetric matrix à = 1 ( A + A T ) for which: x T Ax = x T Ãx x R n. 1

Restricting A S n ensures each quadratic function has a unique representation. The gradient and Hessian of a general quadratic function take the form: f (x) = Ax b f (x) = A. Note provided A is non-singular, the quadratic has a unique critical point at: x = A 1 b. When A 0, the quadratic is strictly convex and this point is the unique global minima. 6. Gradient descent on a quadratic In this section we will consider a quadratic f (x) where A is positive definite, and in particular that: αi A βi, for some 0 < α < β. This implies that f is α-strongly convex and β-smooth. From Theorem?? we know that under these conditions, ( ) gradient descent with the appropriate step size converges linearly at the rate exp t α β. Clearly the size of α β can dramatically affect the convergence guarantee. In fact, in the case of a quadratic, this is related to the condition number of the matrix A. Definition 6. (Condition number). Let A be a real matrix. Its condition number (with respect to the Euclidean norm) is: (A) = σ max(a) σ min (A), the ratio of its largest and smallest eigenvalues. So in particular, we have that (A) β α ; henceforth, we will assume that α, β correspond to the minimal and maximal eigenvalues of A so that (A) = β α. It follows from Theorem?? that gradient descent with a constant step size β 1 converges as ( x t+1 x exp t 1 ) x 1 x. In many cases, the function f is ill-conditioned and can easily take large values. In this case, case convergence could be very slow as we will need t > before the error gets small. Can we do better than this? To answer this question, it will be instructive to analyze gradient descent specifically for quadratic functions, and derive the convergence bound that we previously proved for any strongly convex smooth functions. This exercise will show us where we are losing performance, and suggest a method that can attain better guarantees.

Theorem 6.3. Assume f : R n R is a quadratic where the quadratic coefficient matrix has a condition number. Let x be an optimizer of f, and let x t be the updated point at step t using gradient descent with a constant step size β 1, i.e. using the update rule x t+1 = x t β 1 f (x t). Then: ( x t+1 x exp t ) x 1 x. Proof. Consider the quadratic function f (x) = 1 xt Ax b T x + c, where A is a symmetric n n matrix, b R n and c R. A gradient descent update with step size η t takes the form: x t+1 = x t η t f (x t ) = x t η t (Ax t b) Subtracting x from both sides of this equation and using the property that Ax b = f (x ) = 0: Thus, x t+1 x x t+1 x = (x t η t (Ax t b)) (x η t (Ax b)) t k=1 = (I η t A)(x t x ) = t k=1 (I η k A)(x 1 x ). (I η t A) x 1 x Set η k = β 1 for all k. Note that α β I β 1 A I, so: I 1 β A = 1 α β = 1 1. ( ) t I η k A x 1 x. k=1 It follows that x t+1 x ( 1 1 ) t ( x 1 x exp t ) x 1 x. 6.3 Connection to polynomial approximation In the previous section, we proved an upper bound on the convergence rate. In this section, we would like to improve on this. To see how, think about whether there was any point in the argument above where we were careless? One obvious candidate is 3

that our choice of step size, η k = β 1, was chosen rather arbitrarily. In fact, by choosing the sequence η k we can select any degree-t polynomial of the form: Note that: p(a) = t (I η k A). k=1 p(a) = max x λ(a) p(x) where p(a) is a matrix polynomial, and p(t) is the corresponding scalar polynomial. In general, we may not know the set of eigenvalues λ(a), but we do know that all eigenvalues are in the range [α, β]. Relaxing the upper bound, we get p(a) max x [α,β] p(x). We can see now that we want a polynomial p(a) that takes on small values in [α, β], while satisfying the additional normalization constraint p(0) = 1. 6.3.1 A simple polynomial solution A simple solution has a uniform step size η t = α+β. Note that max 1 α + β x = β α α + β β α = 1 1 β, x [α,β] recovering the same convergence rate we proved previously. The resulting polynomial p t (x) is plotted in Figure 1 for degrees t = 3 and t = 6, with α = 1 and β = 10. Note that doubling the degree from three to six only halves the maximum absolute value the polynomial attains in [α, β], explaining why convergence is so slow. 6.4 Chebyshev polynomials Fortunately, we can do better than this by speeding up gradient descent using Chebyshev polynomials. We will use Chebyshev polynomials of the first kind, defined by the recurrence relation: T 0 (a) = 1, T 1 (a) = a T k (a) = at k 1 (a) T k (a), for k. Figure plots the first few Chebyshev polynomials. Why Chebyshev polynomials? Suitably rescaled, they minimize the absolute value in a desired interval [α, β] while satisfying the normalization constraint of having value 1 at the origin. 4

1.0 degree 3 0.8 degree 6 0.6 0.4 0. 0.0 0. 0.4 0.6 0 4 6 8 10 Figure 1: Naive Polynomial Recall that the eigenvalues of the matrix we consider are in the interval [α, β]. We need to rescale the Chebyshev polynomials so that they re supported on this interval and still attain value 1 at the origin. This is accomplished by the polynomial ) P k (a) = T k ( α+β a β α T k ( α+β β α We see on figure 3 that doubling the degree has a much more dramatic effect on the magnitude of the polynomial in the interval [α, β]. Let s compare on figure 4 this beautiful Chebyshev polynomial side by side with the naive polynomial we saw earlier. The Chebyshev polynomial does much better: at around 0.3 for degree 3 (needed degree 6 with naive polynomial), and below 0.1 for degree 6. ). 6.4.1 Accelerated gradient descent The Chebyshev polynomial leads to an accelerated version of gradient descent. Before we describe the iterative process, let s first see what error bound comes out of the Chebyshev polynomial. So, just how large is the polynomial in the interval [α, β]? First, note that the maximum value is attained at α. Plugging this into the definition of the rescaled Chebyshev polynomial, we get the upper bound for any a [α, β], 5

1.00 0.75 0.50 0.5 0.00 0.5 0.50 0.75 1.00 1.00 0.75 0.50 0.5 0.00 0.5 0.50 0.75 1.00 Figure : Chebychev polynomials of varying degrees. P k (a) P k (α) = T ( k(1) ) β+α T K β α T K Recalling the condition number = β/α, we have β + α β α = + 1 1. Typically is large, so this is 1 + ɛ, ɛ. Therefore, we have P k (a) T k (1 + ɛ) 1. To upper bound P k, we need to lower bound T k (1 + ɛ). Fact: for a > 1, T k (a) = cosh (k arccosh(a)) where: cosh(a) = ea + e a Now, letting φ = arccosh(1 + ɛ): ( ) β + α 1 β α. (, arccosh(a) = ln x + ) x 1. e φ = 1 + ɛ + ɛ + ɛ 1 + ɛ. 6

1.0 degree 3 0.8 degree 6 0.6 0.4 0. 0.0 0. 0 4 6 8 10 Figure 3: Rescaled Chebyshev So, we can lower bound T k (1 + ɛ) : T k (1 + ɛ) = cosh (karccosh(1 + ɛ)) = cosh(kφ) = (eφ ) k + (e φ ) k (1 + ɛ) k. Then, the reciprocal is what we needed to upper bound the error of our algorithm, so we have: P k (a) T k (1 + ɛ) 1 (1 + ɛ) k. Thus, this establishes that the Chebyshev polynomial achieves the error bound: x t+1 x (1 + ɛ) t x 0 x ( ) t 1 + x 0 x ( ) exp t x 0 x. This means that for large, we get quadratic savings in the degree we need before the error drops off exponentially. Figure 5 shows the different rates of convergence, we clearly see that the 7

1.0 deg-6 chebyshev 0.8 deg-6 naive 0.6 0.4 0. 0.0 0 4 6 8 10 Figure 4: Rescaled Chebyshev VS Naive Polynomial 6.4. The Chebyshev recurrence relation Due to the recursive definition of the Chebyshev polynomial, we directly get an iterative algorithm out of it. Transferring the recursive definition to our rescaled Chebyshev polynomial, we have: P K+1 (a) = (η k a + γ k )P k (a) + µ k P k 1 (a). where we can work out the coefficients η k, γ k, µ k from the recurrence definition. Since P k (0) = 1, we must have γ k + µ k = 1. This leads to a simple update rule for our iterates: x k+1 = (η k A + γ k )x k + (1 γ K )x k 1 η k b = (η k A + (1 µ k ))x k + µ k x k 1 η k b = x k η k (Ax k b) + µ k (x k x k 1 ). We see that the update rule above is actually very similar to plain gradient descent except for the additional term µ k (x k x k 1 ). This term can be interpreted as a momentum term, pushing the algorithm in the direction of where it was headed before. In the next lecture, we ll dig deeper into momentum and see how to generalize the result for quadratics to general convex functions. 8

40 T 10 (1 + ) (1 + ) 10 / 30 0 10 0 0.00 0.0 0.04 0.06 0.08 0.10 References Figure 5: Convergence for naive polynomial and Chebyshev [Lax07] Peter D. Lax. Linear Algebra and Its Applications. Wiley, 007. 9