Newton s Method. Javier Peña Convex Optimization /36-725

Similar documents
Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Lecture 14: Newton s Method

Lecture 14: October 17

10. Unconstrained minimization

Unconstrained minimization

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Conditional Gradient (Frank-Wolfe) Method

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Subgradient Method. Ryan Tibshirani Convex Optimization

Convex Optimization. Problem set 2. Due Monday April 26th

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Barrier Method. Javier Peña Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Selected Topics in Optimization. Some slides borrowed from

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

11. Equality constrained minimization

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Lecture 17: October 27

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Lecture 5: September 15

Nonlinear Optimization for Optimal Control

Convex Optimization and l 1 -minimization

Unconstrained minimization: assumptions

14. Nonlinear equations

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Unconstrained minimization of smooth functions

Descent methods. min x. f(x)

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

An Optimal Affine Invariant Smooth Minimization Algorithm.

2. Quasi-Newton methods

Convex Optimization. Prof. Nati Srebro. Lecture 12: Infeasible-Start Newton s Method Interior Point Methods

Optimization for Machine Learning

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Convex Optimization. Lecture 12 - Equality Constrained Optimization. Instructor: Yuanzhang Xiao. Fall University of Hawaii at Manoa

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725

LINEAR AND NONLINEAR PROGRAMMING

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Optimization Tutorial 1. Basic Gradient Descent

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Lecture 5: September 12

A Brief Review on Convex Optimization

Lecture 9 Sequential unconstrained minimization

Equality constrained minimization

Interior Point Methods. We ll discuss linear programming first, followed by three nonlinear problems. Algorithms for Linear Programming Problems

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Lecture 16: October 22

Constrained Optimization and Lagrangian Duality

CPSC 540: Machine Learning

A Second-Order Path-Following Algorithm for Unconstrained Convex Optimization

March 8, 2010 MATH 408 FINAL EXAM SAMPLE

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

Lecture 23: Conditional Gradient Method

Lecture 8: February 9

CPSC 540: Machine Learning

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Written Examination

Primal-Dual Interior-Point Methods. Ryan Tibshirani Convex Optimization /36-725

10-725/36-725: Convex Optimization Prerequisite Topics

minimize x subject to (x 2)(x 4) u,

Self-Concordant Barrier Functions for Convex Optimization

Primal-Dual Interior-Point Methods. Ryan Tibshirani Convex Optimization

8 Numerical methods for unconstrained problems

CSCI 1951-G Optimization Methods in Finance Part 09: Interior Point Methods

Part 5: Penalty and augmented Lagrangian methods for equality constrained optimization. Nick Gould (RAL)

Convex Optimization Lecture 16

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

arxiv: v1 [math.oc] 1 Jul 2016

Lecture 17: Primal-dual interior-point methods part II

A Distributed Newton Method for Network Utility Maximization, II: Convergence

Interior Point Algorithms for Constrained Convex Optimization

Higher-Order Methods

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Interior-Point Methods

CS711008Z Algorithm Design and Analysis

Math 273a: Optimization Subgradients of convex functions

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

12. Interior-point methods

Maria Cameron. f(x) = 1 n

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Convex Optimization CMU-10725

6. Proximal gradient method

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

6. Proximal gradient method

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lecture 3. Optimization Problems and Iterative Algorithms

Large-scale Stochastic Optimization

Scientific Computing: Optimization

Lecture 2: Convex Sets and Functions

Bindel, Spring 2016 Numerical Analysis (CS 4220) Notes for

Dual Proximal Gradient Method

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Transcription:

Newton s Method Javier Peña Convex Optimization 10-725/36-725 1

Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and examples: Conjugate f is always convex (regardless of convexity of f) When f is a quadratic in Q 0, f is a quadratic in Q 1 When f is a norm, f is the indicator of the dual norm unit ball When f is closed and convex, x f (y) y f(x) Fenchel duality Primal : Dual : min f(x) + g(x) x max f (u) g ( u) u 2

Newton s method Consider the unconstrained, smooth convex optimization problem min x f(x) where f is convex, twice differentable, and dom(f) = R n. Newton s method: choose initial x (0) R n, and x (k) = x (k 1) ( 2 f(x (k 1) ) ) 1 f(x (k 1) ), k = 1, 2, 3,... Here 2 f(x (k 1) ) is the Hessian matrix of f at x (k 1) Compare to gradient descent: choose initial x (0) R n, and x (k) = x (k 1) t k f(x (k 1) ), k = 1, 2, 3,... 3

Newton s method interpretation The Newton step x + = x ( 2 f(x)) 1 f(x) can be obtained by minimizing over y the quadratic approximation f(y) f(x) + f(x) T (y x) + 1 2 (y x)t 2 f(x)(y x). By contrast the gradient descent step x + = x t f(x) can be obtained by minimizing over y the quadratic approximation f(y) f(x) + f(x) T (y x) + 1 2t y x 2 2. 4

For f(x) = (10x 2 1 + x2 2 )/2 + 5 log(1 + e x 1 x 2 ), compare gradient descent (black) to Newton s method (blue), where both take steps of roughly same length 20 10 0 10 20 20 10 0 10 20 (For our example we needed to consider a nonquadratic... why?) 5

Outline Today: Interpretations and properties Convergence analysis Backtracking line search Equality-constrained Newton Quasi-Newton methods 6

Newton s method for root finding Assume F : R n R n is a differentiable vector-valued function and consider the system of equations F (x) = 0. Newton s method for root-finding: choose initial x (0) R n, and x (k) = x (k 1) F (x (k 1) ) 1 F (x (k 1) ), k = 1, 2, 3,... Here F (x (k 1) ) is the Jacobian matrix of F at x (k 1). The Newton step x + = x F (x) 1 F (x) can be obtained by solving over y the linear approximation F (y) F (x) + F (x)(y x) = 0. 7

Example: let F : R R be defined as F (x) = x 2 2. Newton s method starting from x (0) = 1 : k 0 1 2 3 4 x (k) 1 1.5 1.4166 1.414216 1.4142135 x (k) 2 0.4142 0.0858 0.0024 2.1 10 6 1.6 10 12 Newton s method for the optimization problem min x f(x) is the same as Newton s method for finding a root of f(x) = 0. History: The work of Newton (1685) and Raphson (1690) originally focused on finding roots of polynomials. Simpson (1740) applied this idea to general nonlinear equations and minimization. 8

Affine invariance of Newton s method Important property Newton s method: affine invariance. Assume f : R n R is twice differentiable and A R n n is nonsingular. Let g(y) := f(ay). Newton step for g starting from y is y + = y ( 2 g(y) ) 1 g(y). It turns out that the Newton step for f starting from x = Ay is x + = Ay +. Therefore progress is independent of problem scaling. By contrast, this is not true of gradient descent. 9

10 Local convergence Theorem: Assume F : R n R n is continuously differentiable and x R n is a root of F, that is, F (x ) = 0 such that F (x ) is non-singular. Then (a) There exists δ > 0 such that if x (0) x < δ then Newton s method is well defined and x (k+1) x lim k x (k) x = 0. (b) If F is Lipschitz continuous in a neighborhood of x then there exists K > 0 such that x (k+1) x K x (k) x 2.

Newton decrement Consider again Newton s method for the optimization problem min x f(x). At a point x, define the Newton decrement as λ(x) = ( f(x) T ( 2 f(x) ) 1 f(x) ) 1/2. This relates to the difference between f(x) and the minimum of its quadratic approximation: ( f(x) min f(x) + f(x) T (y x) + 1 ) y 2 (y x)t 2 f(x)(y x) = 1 2 f(x)t ( 2 f(x) ) 1 f(x) = 1 2 λ(x)2. Therefore can think of λ 2 (x)/2 as an approximate bound on the suboptimality gap f(x) f 11

12 Another interpretation of Newton decrement: if Newton direction is v = ( 2 f(x)) 1 f(x), then λ(x) = ( v T 2 f(x)v ) 1/2 = v 2 f(x) i.e., λ(x) is the length of the Newton step in the norm defined by the Hessian 2 f(x) Note that the Newton decrement, like the Newton steps, are affine invariant; i.e., if we defined g(y) = f(ay) for nonsingular A, then λ g (y) would match λ f (x) at x = Ay

13 Backtracking line search We have seen pure Newton s method, which need not converge. In practice, we instead use damped Newton s method (i.e., Newton s method), which repeats x + = x t ( 2 f(x) ) 1 f(x) Note that the pure method uses t = 1 Step sizes here typically are chosen by backtracking search, with parameters 0 < α 1/2, 0 < β < 1. At each iteration, we start with t = 1 and while f(x + tv) > f(x) + αt f(x) T v we shrink t = βt, else we perform the Newton update. Note that here v = ( 2 f(x)) 1 f(x), so f(x) T v = λ 2 (x)

Example: logistic regression Logistic regression example, with n = 500, p = 100: we compare gradient descent and Newton s method, both with backtracking f fstar 1e 13 1e 09 1e 05 1e 01 1e+03 Gradient descent Newton's method 0 10 20 30 40 50 60 70 k Newton s method has a different regime of convergence. 14

15 Convergence analysis Assume that f convex, twice differentiable, having dom(f) = R n, and additionally f is Lipschitz with parameter L f is strongly convex with parameter m 2 f is Lipschitz with parameter M Theorem: Newton s method with backtracking line search satisfies the following two-stage convergence bounds (f(x ) f ) γk if k k 0 f(x (k) ) f 2m ( 1 ) 2 k k 0 +1 M 2 2 if k > k 0 Here γ = αβ 2 η 2 m/l 2, η = min{1, 3(1 2α)}m 2 /M, and k 0 is the number of steps until f(x (k 0+1) ) 2 < η

16 In more detail, convergence analysis reveals γ > 0, 0 < η m 2 /M such that convergence follows two stages Damped phase: f(x (k) ) 2 η, and f(x (k+1) ) f(x (k) ) γ Pure phase: f(x (k) ) 2 < η, backtracking selects t = 1, and M ( M ) 2 2m 2 f(x(k+1) ) 2 2m 2 f(x(k) ) 2 Note that once we enter pure phase, we won t leave, because when η m 2 /M 2m 2 ( M ) 2 M 2m 2 η < η

17 To reach f(x (k) ) f ɛ, we need at most f(x (0) ) f γ + log log(ɛ 0 /ɛ) iterations, where ɛ 0 = 2m 3 /M 2 This is called quadratic convergence. Compare this to linear convergence (which, recall, is what gradient descent achieves under strong convexity) The above result is a local convergence rate, i.e., we are only guaranteed quadratic convergence after some number of steps k 0, where k 0 f(x(0) ) f γ Somewhat bothersome may be the fact that the above bound depends on L, m, M, and yet the algorithm itself does not

18 Self-concordance A scale-free analysis is possible for self-concordant functions: on an open segment of R, a convex function f is called self-concordant if f (t) 2f (t) 3/2 for all t. On an open convex domain of R n a function is self-concordant if its restriction to every line in its domain is so. Examples of self-concordant functions f : R n ++ R defined by f : S n ++ R defined by f(x) = n log(x j ) i=1 f(x) = log(det(x))

19 Convergence analysis for self-concordant functions Property of self-concordance: if g is self-concordant and A, b are of appropriate dimensions, then is self-concordant. f(x) := g(ax b) Theorem (Nesterov and Nemirovskii): Newton s method with backtracking line search requires at most C(α, β) ( f(x (0) ) f ) + log log(1/ɛ) iterations to reach f(x (k) ) f ɛ, where C(α, β) is a constant that only depends on α, β

20 At a high-level: Comparison to first-order methods Memory: each iteration of Newton s method requires O(n 2 ) storage (n n Hessian); each gradient iteration requires O(n) storage (n-dimensional gradient) Computation: each Newton iteration requires O(n 3 ) flops (solving a dense n n linear system); each gradient iteration requires O(n) flops (scaling/adding n-dimensional vectors) Backtracking: backtracking line search has roughly the same cost, both use O(n) flops per inner backtracking step Conditioning: Newton s method is not affected by a problem s conditioning, but gradient descent can seriously degrade Fragility: Newton s method may be empirically more sensitive to bugs/numerical errors, gradient descent is more robust

21 Back to logistic regression example: now x-axis is parametrized in terms of time taken per iteration f fstar 1e 13 1e 09 1e 05 1e 01 1e+03 Gradient descent Newton's method 0.00 0.05 0.10 0.15 0.20 0.25 Time Each gradient descent step is O(p), but each Newton step is O(p 3 )

22 Sparse, structured problems When the inner linear systems (in Hessian) can be solved efficiently and reliably, Newton s method can strive E.g., if 2 f(x) is sparse and structured for all x, say banded, then both memory and computation are O(n) with Newton iterations What functions admit a structured Hessian? Two examples: If g(β) = f(xβ), then 2 g(β) = X T 2 f(xβ)x. Hence if X is a structured predictor matrix and 2 f is diagonal, then 2 g is structured If we seek to minimize f(β) + g(dβ), where 2 f is diagonal, g is not smooth, and D is a structured penalty matrix, then the Lagrange dual function is f ( D T u) g ( u). Often D 2 f ( D T u)d T can be structured

23 Equality-constrained Newton s method Consider now a problem with equality constraints, as in min x f(x) subject to Ax = b Several options: Eliminating equality constraints (reduced-space approach): write x = F y + x 0, where F spans null space of A, and Ax 0 = b. Solve in terms of y. Equality-constrained Newton: in many cases this is the most straightforward option. Deriving the dual: the Fenchel dual is f ( A T v) b T v and strong duality holds. More on this later.

24 Equality-constrained Newton s method Start with x (0) R n. Then we repeat the update v = argmin A(x+z)=b x + = x + tv, where ( f(x) + f(x) T z + 1 ) 2 zt 2 f(x)z From the KKT conditions it follows that for some w, v satisfies [ 2 f(x) A T ] [ ] [ ] v f(x) =. A 0 w Ax b The latter is precisely the root-finding Newton step for the KKT conditions of the original equality-constrained problem, namely [ f(x) + A T ] [ ] y 0 = Ax b 0

25 Quasi-Newton methods If the Hessian is too expensive (or singular), then a quasi-newton method can be used to approximate 2 f(x) with H 0, and we update according to x + = x th 1 f(x) Approximate Hessian H is recomputed at each step. Goal is to make H 1 cheap to apply (possibly, cheap storage too) Convergence is fast: superlinear, but not the same as Newton. Roughly n steps of quasi-newton make same progress as one Newton step Very wide variety of quasi-newton methods; common theme is to propogate computation of H across iterations. More on this class of methods later.

26 References and further reading S. Boyd and L. Vandenberghe (2004), Convex optimization, Chapters 9 and 10 Güler (2010), Foundations of Optimization, Chapter 14. Y. Nesterov (1998), Introductory lectures on convex optimization: a basic course, Chapter 2 Y. Nesterov and A. Nemirovskii (1994), Interior-point polynomial methods in convex programming, Chapter 2 J. Nocedal and S. Wright (2006), Numerical optimization, Chapters 6 and 7 L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012