Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

Similar documents
Unconstrained minimization

10. Unconstrained minimization

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Unconstrained minimization: assumptions

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

CSCI : Optimization and Control of Networks. Review on Convex Optimization

A Brief Review on Convex Optimization

Newton s Method. Javier Peña Convex Optimization /36-725

Equality constrained minimization

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

11. Equality constrained minimization

Convex Optimization. Problem set 2. Due Monday April 26th

Lecture 14: October 17

Unconstrained minimization of smooth functions

Analytic Center Cutting-Plane Method

Lecture 14: Newton s Method

CSCI 1951-G Optimization Methods in Finance Part 09: Interior Point Methods

8. Conjugate functions

Convex Optimization. Lecture 12 - Equality Constrained Optimization. Instructor: Yuanzhang Xiao. Fall University of Hawaii at Manoa

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Convex Optimization and l 1 -minimization

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Descent methods. min x. f(x)

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

An Optimal Affine Invariant Smooth Minimization Algorithm.

Nonlinear Optimization for Optimal Control

Line Search Methods for Unconstrained Optimisation

8 Numerical methods for unconstrained problems

The Steepest Descent Algorithm for Unconstrained Optimization

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

6. Proximal gradient method

Self-Concordant Barrier Functions for Convex Optimization

Conditional Gradient (Frank-Wolfe) Method

14. Nonlinear equations

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Lecture 5: September 12

6. Proximal gradient method

EE364a Homework 8 solutions

Convex Functions. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Selected Topics in Optimization. Some slides borrowed from

4. Convex optimization problems (part 1: general)

A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution:

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

2. Quasi-Newton methods

12. Interior-point methods

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Lecture 9 Sequential unconstrained minimization

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Barrier Method. Javier Peña Convex Optimization /36-725

Convex Functions. Pontus Giselsson

Convex Optimization. Prof. Nati Srebro. Lecture 12: Infeasible-Start Newton s Method Interior Point Methods

12. Interior-point methods

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Numerical optimization

Lecture 14 Barrier method

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Written Examination

Numerical Optimization Prof. Shirish K. Shevade Department of Computer Science and Automation Indian Institute of Science, Bangalore

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

Fast proximal gradient methods

10. Ellipsoid method

Interior Point Algorithms for Constrained Convex Optimization

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Optimization Tutorial 1. Basic Gradient Descent

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Agenda. Interior Point Methods. 1 Barrier functions. 2 Analytic center. 3 Central path. 4 Barrier method. 5 Primal-dual path following algorithms

Optimization Methods. Lecture 18: Optimality Conditions and. Gradient Methods. for Unconstrained Optimization

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

ECE133A Applied Numerical Computing Additional Lecture Notes

Lecture: Convex Optimization Problems

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution:

Second Order Optimization Algorithms I

Gradient Methods Using Momentum and Memory

Optimization methods

A Distributed Newton Method for Network Utility Maximization, II: Convergence

Optimization Methods. Lecture 19: Line Searches and Newton s Method

4. Convex optimization problems

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

L. Vandenberghe EE236C (Spring 2016) 18. Symmetric cones. definition. spectral decomposition. quadratic representation. log-det barrier 18-1

Unconstrained optimization

Nonlinear Programming

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

4TE3/6TE3. Algorithms for. Continuous Optimization

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

Convex Optimization and Modeling

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Lecture 5: September 15

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 17

The proximal mapping

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

IE 521 Convex Optimization

Optimization and Optimal Control in Banach Spaces

Examination paper for TMA4180 Optimization I

More First-Order Optimization Algorithms

5 Handling Constraints

Transcription:

Convex Optimization 9. Unconstrained minimization Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao Tong University 2017 Autumn Semester SJTU Ying Cui 1 / 40

Outline Unconstrained minimization problems Descent methods Gradient descent method Steepest descent method Newton s method Self-concordance Implementation SJTU Ying Cui 2 / 40

Unconstrained minimization assumptions: min x f(x) assume f : R n R is convex, twice continuously differentiable (implying that domf is open) assume there exists an optimal point x (optimal value p = inf x f(x) is attained and finite) a necessary and sufficient condition for optimality: f(x ) = 0 solving unconstrained minimization problem is the same as finding a solution of optimality equation in a few special cases, can be solved analytically usually, must be solved by an iterative algorithm produce a sequence of points x (k) dom f, k = 0,1,... with f(x (k) ) p, as k terminated when f(x (k) ) p ǫ for some tolerance ǫ > 0 SJTU Ying Cui 3 / 40

Initial point and sublevel set algorithms in this chapter require a starting point x (0) such that x (0) domf sublevel set S = {x f(x) f(x (0) )} is closed (hard to verify) 2nd condition is satisfied for all x (0) domf if f is closed, i.e., all sublevel sets are closed, equivalent to epi f is closed true if f is continuous and dom f = R n true if f(x) as x bd dom f examples of differentiable functions with closed sublevel sets: m f(x) = log( exp(a T i x+b i)), i=1 m f(x) = log(b i a T i x) i=1 SJTU Ying Cui 4 / 40

Strong convexity and implications f is strongly convex on S if there exists an m > 0 such that implications 2 f(x) mi for all x S for x,y S, f(y) f(x)+ f(x) T (y x)+ m 2 x y 2 2 m = 0: recover the basic inequality characterizing convexity m > 0: a better lower bound than follows from convexity alone imply that S is bounded p > and for x S, f(x) p 1 2m f(x) 2 2 if gradient is small at a point, then the point is nearly optimal a condition for suboptimality generalizing optimality condition f(x) 2 (2mǫ) 1/2 = f(x) p ǫ useful as a stopping criterion if m is known upper bound on f(x): there exists an M > 0 such that 2 f(x) MI for all x S SJTU Ying Cui 5 / 40

Condition number of matrix and convex set condition number of a matrix: the ratio of its largest eigenvalue to its smallest eigenvalue condition number of a convex set: square of the ratio of its maximum width to its minimum width width of a convex set C in the direction q with q 2 = 1: W(C,q) = sup z C q T z inf z C q T z minimum width and maximum width of C: W min = inf q 2=1W(C,q) and W max = sup q 2=1 W(C,q) condition number of C: cond(c) = W 2 max W 2 min a measure of its anisotropy or eccentricity: cond(c) small means C has approximately the same width in all directions (nearly spherical); cond(c) large means that C is far wider in some directions than in others SJTU Ying Cui 6 / 40

Condition number of sublevel sets mi 2 f(x) MI for all x S upper bound of condition number of 2 f(x): cond( 2 f(x)) M/m upper bound of condition number of sublevel set C α = {x f(x) α}, p < α f(x (0) ): geometric interpretation: cond(c α ) M/m lim α p cond(c α) = cond( 2 f(x )) condition number of the sublevel sets of f (which is bounded by M/m) has a strong effect on the efficiency of some common methods for unconstrained minimization SJTU Ying Cui 7 / 40

Descent methods algorithms described in this chapter produce a minimizing sequence x (k),k = 1,, where x (k+1) = x (k) +t (k) x (k) with f(x (k+1) ) < f(x (k) ) and t (k) > 0 other notations: x + = x+t x, x := x+t x x is step (or search direction); t is step size (or step length) convexity of f implies f(x (k) ) T x (k) < 0 (i.e., x (k) is a descent direction) General descent method. given a starting point x dom f. repeat 1. Determine a descent direction x. 2. Line search. Choose a step size t > 0. 3. Update. x := x+t x until stopping criterion is satisfied. SJTU Ying Cui 8 / 40

Line search types exact line search: t = argmin t>0 f(x+t x) minimize f along ray {x+t x t 0} used when cost of the minimization problem with one variable is low compared to the cost of computing the search direction itself in some special cases the minimizer can be found analytically, and in others it can be computed efficiently SJTU Ying Cui 9 / 40

Line search types backtracking line search (with parameters α (0, 1 2 ),β (0,1)) reduce f enough along ray {x+t x t 0} starting at t = 1, repeat t := βt until f(x+t x) < f(x)+αt f(x) T x convexity of f: f(x+t x) f(x)+t f(x) T x constant α can be interpreted as the fraction of decrease in f predicted by linear extrapolation that we will accept graphical interpretation: backtrack until t t 0 f(x+t x) t = 0 f(x)+t f(x) T x t0 f(x)+αt f(x) T x t Figure 9.1 Backtracking line search. The curve shows f, restricted to the line over which we search. The lower dashed line shows the linear extrapolation of f, and the upper dashed line has a slope a factor of α smaller. The backtracking condition is that f lies below the upper dashed line, i.e., 0 t t0. SJTU Ying Cui 10 / 40

Gradient descent method general descent method with x = f(x) Gradient descent method. given a starting point x dom f. repeat 1. x := f(x). 2. Line search. Choose step size t via exact or backtracking line search. 3. Update. x := x+t x. until stopping criterion is satisfied. stopping criterion usually of the form f(x) 2 ǫ convergence result: for strongly convex f, f(x (k) ) p c k (f(x (0) ) p ) exact line search: c = 1 m/m < 1 backtracking line search: c = 1 min{2mα,2βαm/m} < 1 linear convergence: the error lies below a line on a log-linear plot of error versus iteration number very simple, but often very slow; rarely used in practice SJTU Ying Cui 11 / 40

Examples a quadratic problem in R 2 f(x) = (1/2)(x 2 1 +γx2 2 ) (γ > 0) with exact line search, starting at x (0) = (γ,1): closed-form expressions for iterates ( ) k ( x (k) γ 1 1 = γ, x (k) 2 = γ γ 1 ) k ( ) 2k γ 1, f(x (k) ) = f(x (0) ) γ +1 γ +1 γ +1 exact solution found in one iteration if γ = 1; convergence rapid if γ not far from 1; convergence very slow if γ 1 or γ 1 4 x2 0 x (1) x (0) 4 10 0 10 x1 Figure 9.2 Some contour lines of the function f(x) = (1/2)(x 2 1+10x 2 2). The condition number of the sublevel sets, which are ellipsoids, is exactly 10. The figure shows the iterates of the gradient method with exact line search, started at x (0) = (10,1). SJTU Ying Cui 12 / 40

Examples a nonquadratic problem in R 2 f(x 1,x 2 ) = e x 1+3x 2 0.1 +e x 1 3x 2 0.1 +e x 1 0.1 backtracking line search: approximately linear convergence (sublevel sets of f not too badly conditioned, M/m not too large) exact line search: approximately linear convergence, about twice as fast as with backtracking line search 10 5 10 0 x (0) x (2) x (0) f(x (k) ) p 10 5 backtracking l.s. 10 10 exact l.s. x (1) x (1) Figure 9.3 Iterates of the gradient method with backtracking line search, for the problem in R 2 with objective f given in (9.20). The dashed curves are level curves of f, and the small circles are the iterates of the gradient method. The solid lines, which connect successive iterates, show the scaled steps t (k) x (k). Figure 9.5 Iterates of the gradient method with exact line search for the problem in R 2 with objective f given in (9.20). 10 15 0 5 10 15 20 25 k Figure 9.4 Error f(x (k) ) p versus iteration k of the gradient method with backtracking and exact line search, for the problem in R 2 with objective f given in (9.20). The plot shows nearly linear convergence, with the error reduced approximately by the factor 0.4 in each iteration of the gradient method with backtracking line search, and by the factor 0.2 in each iteration of the gradient method with exact line search. SJTU Ying Cui 13 / 40

Examples a problem in R 100 500 f(x) = c T x log(b i a T i x) i=1 backtracking line search: approximately linear convergence exact line search: approximately linear convergence, only a bit faster than with backtracking line search 10 4 10 2 f(x (k) ) p 10 0 exact l.s. 10 2 backtracking l.s. 10 4 0 50 100 150 200 k Figure 9.6 Error f(x (k) ) p versus iteration k for the gradient method with backtracking and exact line search, for a problem in R 100. SJTU Ying Cui 14 / 40

Conclusions of gradient decent method characteristics: exhibit approximately linear convergence, i.e., error converges to zero approximately as a geometric series choice of backtracking parameters α, β has a noticeable but not dramatic effect on the convergence exact line search sometimes improves the convergence, but not much (and probably not worth trouble of implementing it) convergence rate depends greatly on the condition number of the Hessian, or the sublevel sets main adavantage and disadvantage: main advantage: simplicity main disadvantage: convergence rate depends so critically on the condition number of the Hessian or sublevel sets SJTU Ying Cui 15 / 40

Steepest descent method normalized steepest descent direction (at x, for norm ) x nsd = argmin{ f(x) T v v = 1} first-order Taylor approximation of f(x+v) around x is f(x+v) f(x)+ f(x) T v directional derivative of f at x in direction v is f(x) T v direction x nsd is unit-norm direction with most negative directional derivative (unnormalized) steepest descent direction x sd = f(x) x nsd satisfies f(x) T x sd = f(x) 2 SJTU Ying Cui 16 / 40

Steepest descent method general descent method with x = x sd Steepest descent method. given a starting point x dom f. repeat 1. Compute steepest descent direction x sd. 2. Line search. Choose step size t via exact or backtracking line search. 3. Update. x := x+t x sd. until stopping criterion is satisfied. when exact line search is used, scale factors in the descent direction have no effect, so x nsd or x sd can be used convergence result: for strongly convex f, f(x (k) ) p c k (f(x (0) ) p ) backtracking line search: c = 1 2mα γ 2 min{1,βγ 2 /M} < 1 any norm can be bounded in terms of the Euclidean norm, i.e., there exist constants γ, γ (0,1] such that x γ x 2 and x γ x 2 linear convergence, same as gradient decent method SJTU Ying Cui 17 / 40

Steepest decent for different norms Euclidean norm: x sd = f(x) coincide with the gradient descent method quadratic norm x P = (x T Px) 1/2 (P S n ++): x sd = P 1 f(x) can be thought of as the gradient descent method applied to the problem after the change of coordinates x = P 1/2 x l 1 -norm: x sd = f(x) x i e i, where f(x) x i = f(x) is a coordinate-descent algorithm (update the component with maximum absolute partial derivative value) f(x) xnsd f(x) xnsd Figure 9.9 Normalized steepest descent direction for a quadratic norm. The ellipsoid shown is the unit ball of the norm, translated to the point x. The normalized steepest descent direction xnsd at x extends as far as possible in the direction f(x) while staying in the ellipsoid. The gradient and normalized steepest descent directions are shown. Figure 9.10 Normalized steepest descent direction for the l1-norm. The diamond is the unit ball of the l1-norm, translated to the point x. The normalized steepest descent direction can always be chosen in the direction of a standard basis vector; in this example we have xnsd = e1. SJTU Ying Cui 18 / 40

Choice of norm for steepest descent choice of norm has strong effect on speed of convergence of steepest decent method (consider quadratic P-norm) steepest descent method with quadratic P-norm is same as gradient method after change of coordinates x = P 1/2 x to increase speed of convergence, choose P so that the sublevel sets of f, transformed by P 1/2, are well conditioned ellipsoid {x x T Px 1} approximates shape of sublevel sets work well in cases where we can identify a matrix P for which the transformed problem has moderate condition number steepest descent with backtracking line search for two quadratic norms (ellipses show {x x x (k) P = 1}) 10 5 x (0) x (1) x (2) x (0) x (2) f(x (k) ) p 10 0 10 5 P1 P2 10 10 x (1) Figure 9.11 Steepest descent method with a quadratic norm P1. The ellipses are the boundaries of the norm balls {x x x (k) P1 1} at x (0) and x (1). Figure 9.12 Steepest descent method, with quadratic norm P2. 10 15 0 10 20 30 40 k Figure 9.13 Error f(x (k) ) p versus iteration k, for the steepest descent method with the quadratic norm P1 and the quadratic norm P2. Convergence is rapid for the norm P1 and very slow for P2. SJTU Ying Cui 19 / 40

Newton step Newton step for f at x: x nt = 2 f(x) 1 f(x) convexity of f ( 2 f(x) 0) implies f(x) T x nt < 0 unless f(x) = 0 Newton step is a decent direction unless x is optimal affine invariant: Newton step of f(y) = f(ty) (T nonsingular) at y and Newton step of f at x = Ty satisfies x nt = T y nt SJTU Ying Cui 20 / 40

Interpretations Newton step Minimizer of second-order approximation x+ x nt minimizes second-order Taylor approximation of f at x (a convex quadratic function of v) ˆf(x+v) = f(x)+ f(x) T v + 1 2 vt 2 f(x)v if f is quadratic, then x+ x nt is the exact minimizer of f if f is nearly quadratic, then x+ x nt should be a very good estimate of the minimizer of f when x is near x (quadratic model of f will be very accurate), x+ x nt should be a very good approximation of x f (x,f(x)) (x+ xnt,f(x + xnt)) Figure 9.16Thefunctionf (shownsolid) anditssecond-orderapproximation f at x (dashed). The Newton step xnt is what must be added to x to give the minimizer of f. SJTU Ying Cui 21 / 40 f

Interpretations Newton step Solution of linearized optimality condition x+ x nt solves linearized optimality condition f(x+v) f(x)+ 2 f(x)v = 0 when x is near x (so the optimality condition almost holds), x+ x nt should be a very good approximation of x f (x,f (x)) f (x+ xnt,f (x+ xnt)) Figure 9.18 The solid curve is the derivative f of the function f shown in figure 9.16. f is the linear approximation of f at x. The Newton step xnt is the difference between the root of f and the point x. SJTU Ying Cui 22 / 40

Interpretations Newton step Steepest descent direction in Hessian norm x nt is steepest descent direction at x for the quadratic norm defined by the Hessian 2 f(x), i.e., u 2 f(x) = (u T 2 f(x)u) 1/2 when x is near x ( 2 f(x) after the associated change of coordinates x = ( 2 f(x)) 1/2 x has small condition number), steepest descent with 2 f(x) converges very rapidly x x+ xnsd x+ xnt Figure 9.17 The dashed lines are level curves of a convex function. The ellipsoid shown (with solid line) is {x + v v T 2 f(x)v 1}. The arrow shows f(x), the gradient descent direction. The Newton step xnt is the steepest descent direction in the norm 2 f(x). The figure also shows xnsd, the normalized steepest descent direction for the same norm. SJTU Ying Cui 23 / 40

Newton decrement Newton decrement at x (a measure of the proximity of x to x ): properties λ(x) = ( f(x) T 2 f(x) 1 f(x)) 1/2 1 2 λ(x)2 is an estimate of f(x) p, using quadratic approx. ˆf: f(x) inf v ˆf(x+v) = f(x) ˆf(x+ x nt ) = 1 2 λ(x)2 λ(x) is equal to the norm of Newton step at x in the quadratic Hessian norm u 2 f(x) = (u T 2 f(x)u) 1/2 : λ(x) = x nt 2 f(x) = ( x T nt 2 f(x) x nt ) 1/2 λ(x) 2 is directional derivative of f at x in Newton direction: λ(x) 2 = f(x) T x nt = d dt f(x+ x ntt) t=0 affine invariant: Newton decrement of f(y) = f(ty) (T nonsingular) at y same as Newton decrement of f at x = Ty SJTU Ying Cui 24 / 40

Newton s method general descent method with x = x nt General descent method. given a starting point x domf, tolerance ǫ > 0 repeat 1. Compute the Newton step and decrement x nt := 2 f(x) 1 f(x); λ 2 := f(x) T 2 f(x) 1 f(x) 2. Stopping criterion.quit if λ 2 /2 ǫ. 3. Line search. Choose step size t by backtracking line search. 4. Update x := x+t x nt. Newton s method is affine invariant due to affine invariance of Newton step and decrement independent of linear changes of coordinates Newton iterates for ˆf(y) = f(ty) with starting point y (0) = T 1 x (0) are y (k) = T 1 x (k) SJTU Ying Cui 25 / 40

Classical convergence analysis assumptions f strongly convex on S with constant m, implying mi 2 f(x) MI for all x S 2 f is Lipschitz continuous on S with constant L > 0, i.e., 2 f(x) 2 f(y) 2 L x y 2 for all x S L measures how well f can be approximated by a quadratic function outline: there exist constants η (0,m 2 /L),γ > 0 such that if f(x (k) ) 2 η, then f(x (k+1) ) f(x (k) ) γ if f(x (k) ) 2 < η, then L f(x (k+1) ) 2m 2 2 ( L f(x (k) ) 2 ) 2m 2 2 implying for all l k, we have f(x (l) ) 2 < η SJTU Ying Cui 26 / 40

Classical convergence analysis damped Newton phase ( f(x) 2 η) most iterations require backtracking steps function value decreases by at least γ, i.e., f(x (k+1) ) f(x (k) ) γ if p >, this phase ends after at most (f(x 0 ) p )/γ iterations quadratically convergent phase ( f(x) 2 < η) all iterations use step size t = 1 (no backtracking steps) f(x) 2 converges to zero quadratically: if f(x (k) ) 2 < η then L f(x l ) 2m 2 2 ( L f(x (k) ) 2 l k ( ) 2m 2 2 1 2 l k 2), l k = f(x (l) ) p 1 2m f(xl ) 2 2 2m3 L 2 ( 1 2 ) 2 l k+1, l k if p >, this phase ends (f(x (l) ) p ǫ) after at most log 2 log 2 (ǫ 0 /ǫ) iterations SJTU Ying Cui 27 / 40

Classical convergence analysis conclusion: number of iterations until f(x) p ǫ is bounded above by f(x (0) ) p +log γ 2 log 2 (ǫ 0 /ǫ) γ = αβη 2 m M 2, η = min{1,3(1 2α)} m2 L, ǫ 0 = 2m 3 /L 2 second term is small (of the order of 6) and almost constant for practical purposes in practice, constants m,l (hence γ,ǫ 0 ) are usually unknown provide qualitative insight in convergence properties (i.e., explains two algorithm phases) SJTU Ying Cui 28 / 40

Examples example in R 2 (page 12) 10 5 x (0) 10 0 x (1) f(x (k) ) p 10 5 10 10 Figure 9.19 Newton s method for the problem in R 2, with objective f given in (9.20), and backtracking line search parameters α = 0.1, β = 0.7. Also shown are the ellipsoids {x x x (k) 2 f(x (k) ) 1} at the first two iterates. 10 15 0 1 2 3 4 5 k Figure 9.20 Error versus iteration k of Newton s method for the problem in R 2. Convergence to a very high accuracy is achieved in five iterations. backtracking parameters α = 0.1,β = 0.7 converges in only 5 steps apparent quadratic convergence SJTU Ying Cui 29 / 40

Examples example in R 100 (page 13) 10 5 2 f(x (k) ) p 10 0 10 5 10 10 exact l.s. backtracking l.s. step size t (k) 1.5 1 0.5 exact l.s. backtracking l.s. 10 15 0 2 4 6 8 10 k Figure 9.21 Error versus iteration for Newton s method for the problem in R 100. The backtracking line search parameters are α = 0.01, β = 0.5. Here too convergence is extremely rapid: a very high accuracy is attained in only seven or eight iterations. The convergence of Newton s method with exact line search is only one iteration faster than with backtracking line search. 0 0 2 4 6 8 k Figure 9.22 The step size t versus iteration for Newton s method with backtracking and exact line search, applied to the problem in R 100. The backtracking line search takes one backtracking step in the first two iterations. After the first two iterations it always selects t = 1. backtracking parameters α = 0.01,β = 0.5 backtracking line search almost as fast as exact line search (and much simpler) clearly shows two convergent phases (damped phase of 2 iterations) SJTU Ying Cui 30 / 40

Examples example in R 10000 (with sparse a i ) 10000 100000 f(x) = log(1 x 2 i) log(b i a T i x) i=1 i=1 10 5 f(x (k) ) p 10 0 10 5 0 5 10 15 20 k Figure 9.23 Error versus iteration of Newton s method, for a problem in R 10000. A backtracking line search with parameters α = 0.01, β = 0.5 is used. Even for this large scale problem, Newton s method requires only 18 iterations to achieve very high accuracy. backtracking parameters α = 0.01,β = 0.5 a linearly convergent phase of about 13 iterations followed by a quadratically convergent phase of 4 or 5 iterations convergence performance similar to small examples SJTU Ying Cui 31 / 40

Conclusions of Newton s method strong advantages over gradient and steepest descent methods: convergence of Newton s method is rapid in general, and quadratic near x Newton s method is affine invariant, insensitive to choice of coordinates, or condition number of sublevel sets of f Newton s method scales well with problem size good performance of Newton s method is not dependent on the choice of algorithm parameters main disadvantage: cost of forming and storing the Hessian cost of computing the Newton step, which requires solving a set of linear equations in many cases it is possible to exploit problem structure to substantially reduce cost step SJTU Ying Cui 32 / 40

Self-concordance shortcomings of classical convergence analysis depends on unknown constants m, M, L, only conceptually useful bound is not affinely invariant (m,m,l change if coordinates change), although Newton s method is convergence analysis via self-concordance (Nesterov and Nemirovski) analysis of Newton s method for self-concordant functions does not depend on any unknown constants gives affine-invariant bound include many logarithmic barrier functions that play an important role in interior-point methods for solving convex optimization problems SJTU Ying Cui 33 / 40

Self-concordant functions definition convex f : R R is self-concordant if f (x) 2f (x) 3/2 for all x domf f : R n R is self-concordant if it is self-concordant along every line in its domain, i.e., g(t) = f(x+tv) is self-concordant for all x domf,v R n examples on R linear functions (zero second and third derivatives) quadratic functions (zero third derivative and nonnegative second derivative) negative logarithm f(x) = logx negative entropy plus negative logarithm: f(x) = xlogx logx SJTU Ying Cui 34 / 40

Self-concordant functions remarks constant 2 is chosen for convenience, in order to simplify the formulas later on; any other positive constant could be used instead if f : R R satisfies f (x) kf (x) 3/2, then f(x) = k 2 /4f(x) satisfies f (x) 2 f (x) 3/2 what is important is that the third derivative of the function is bounded by some multiple of the 3/2-power of its second derivative self-concordance is affine invariant if f : R R is s.c., then f(y) = f(ay +b) is s.c. self-concordance condition limits the third derivative of a function, in a way independent of affine coordinate changes SJTU Ying Cui 35 / 40

Self-concordant calculus properties preserved under positive scaling α 1, and sum if f is s.c. and a 1, then af is s.c. if f1 and f 2 are s.c., then f 1 +f 2 is s.c. preserved under composition with affine function if f : R n R is s.c. and A R n m,b R n, then f(ax+b) is s.c. preserved under composition with logarithm if g : R R is convex with domg = R++ and g (x) 3g (x)/x for all x, then f(x) = log( g(x)) logx is s.c. on {x x > 0,g(x) < 0} if g (x) 3g (x)/x holds for g, then it holds for g(x)+ax 2 +bx+c where a 0 examples: the following are s.c. f(x) = m i=1 log(b i a T i x) on {x at i x < b i, i = 1,...,m} f(x) = logdetx on S n ++ f(x) = log(y 2 x T x) on {(x,y) x 2 < y} SJTU Ying Cui 36 / 40

Convergence analysis for self-concordant functions assumptions: f : R n R is s.c. summary: there exist constants η (0,1/4],γ > 0 such that (η = (1 2α)/4, γ = αβ η2 1+η ) if λ(x) > η, then f(x (k+1) ) f(x (k) ) γ if λ(x) η, then 2λ(x (k+1) ) (2λ(x (k) )) 2 = f(x (l) ) p λ(x (l) ) 2 ( 1 2) 2 l k+1, l k complexity bound: number of Newton iterations bounded by f(x (0) ) p +log γ 2 log 2 (1/ǫ) = 20 8α αβ(1 2α) 2(f(x(0) ) p )+log 2 log 2 (1/ǫ) depends only on line search parameters α,β and final accuracy ǫ second term is small and can be safely replaced with 6 example: for α = 0.1,β = 0.8,ǫ = 10 10, we have bound 375(f(x (0) ) p )+6 SJTU Ying Cui 37 / 40

Numerical example 150 randomly generated instances of a i and b i for m min log(b i a T x i x) i=1 25 20 iterations 15 10 5 0 0 5 10 15 20 25 30 35 f(x (0) ) p Figure 9.25 Number of Newton iterations required to minimize selfconcordant functions versus f(x (0) ) p. The function f has the form f(x) = m i=1 log(bi at i x), where the problem data ai and b are randomly generated. The circles show problems with m = 100, n = 50; the squares show problems with m = 1000, n = 500; and the diamonds show problems with m = 1000, n = 50. Fifty instances of each are shown. number of iterations much smaller than 375(f(x (0) ) p )+6 bound of form c(f(x (0) ) p )+6 with smaller c (empirically) valid SJTU Ying Cui 38 / 40

Implementation main effort in each iteration: evaluate derivatives and solve Newton system H x = g where H = 2 f(x), g = f(x) via Cholesky factorization H = LL T, x nt = L T L 1 g, λ(x) = L 1 g 2 where L is a lower triangular matrix cost (1/3)n 3 flops for unstructured system cost (1/3)n 3 if H sparse, banded SJTU Ying Cui 39 / 40

Example of dense Newton system with structure f(x) = n ψ i (x i )+ψ 0 (Ax+b), i=1 H = D +A T H 0 A assume A R p n, dense, with p n D diagonal with diagonal elements ψ i (x i); H 0 = 2 ψ 0 (Ax+b) method 1: form H, solve via dense Cholesky factorization (cost (1/3)n 3 ) method 2: factor H 0 = L 0 L T 0 ; write Newton system as D x+a T L 0 ω = g, L T 0 A x ω = 0 eliminate x from first equation; compute ω and x from (I +L T 0 AD 1 A T L 0 )ω = L T 0 AD 1 g, D x = g A T L 0 ω cost: 2p 2 n (dominated by computation of L T 0 AD 1 A T L 0 ) SJTU Ying Cui 40 / 40