Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Similar documents
Newton s Method. Javier Peña Convex Optimization /36-725

Lecture 14: October 17

Lecture 14: Newton s Method

ORIE 6326: Convex Optimization. Quasi-Newton Methods

10. Unconstrained minimization

Unconstrained minimization

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Conditional Gradient (Frank-Wolfe) Method

Convex Optimization. Problem set 2. Due Monday April 26th

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

2. Quasi-Newton methods

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Convex Optimization CMU-10725

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Selected Topics in Optimization. Some slides borrowed from

Lecture 17: October 27

5 Quasi-Newton Methods

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

11. Equality constrained minimization

Barrier Method. Javier Peña Convex Optimization /36-725

Lecture 5: September 15

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

Nonlinear Optimization for Optimal Control

Lecture 16: October 22

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Optimization II: Unconstrained Multivariable

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Nonlinear Programming

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Optimization II: Unconstrained Multivariable

Convex Optimization and l 1 -minimization

Higher-Order Methods

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Convex Optimization. Lecture 12 - Equality Constrained Optimization. Instructor: Yuanzhang Xiao. Fall University of Hawaii at Manoa

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Unconstrained minimization: assumptions

Unconstrained minimization of smooth functions

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Improving the Convergence of Back-Propogation Learning with Second Order Methods

March 8, 2010 MATH 408 FINAL EXAM SAMPLE

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Chapter 4. Unconstrained optimization

Equality constrained minimization

Algorithms for Constrained Optimization

Descent methods. min x. f(x)

Lecture 9 Sequential unconstrained minimization

14. Nonlinear equations

Lecture 5: September 12

Programming, numerics and optimization

An Optimal Affine Invariant Smooth Minimization Algorithm.

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Optimization Methods for Machine Learning

Convex Optimization. Prof. Nati Srebro. Lecture 12: Infeasible-Start Newton s Method Interior Point Methods

Optimization Tutorial 1. Basic Gradient Descent

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Homework 4. Convex Optimization /36-725

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Primal-Dual Interior-Point Methods. Ryan Tibshirani Convex Optimization /36-725

Quasi-Newton Methods

Constrained Optimization and Lagrangian Duality

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Primal-Dual Interior-Point Methods. Ryan Tibshirani Convex Optimization

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

LINEAR AND NONLINEAR PROGRAMMING

CPSC 540: Machine Learning

Lecture 8: February 9

Lecture 18: November Review on Primal-dual interior-poit methods

Lecture 23: Conditional Gradient Method

Convex Optimization Lecture 16

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Nonlinear Optimization: What s important?

Unconstrained optimization

Lecture 7 Unconstrained nonlinear programming

Interior Point Methods. We ll discuss linear programming first, followed by three nonlinear problems. Algorithms for Linear Programming Problems

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

CPSC 540: Machine Learning

Optimization for Machine Learning

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Optimisation in Higher Dimensions

Interior Point Algorithms for Constrained Convex Optimization

ECS550NFB Introduction to Numerical Methods using Matlab Day 2

Optimization and Root Finding. Kurt Hornik

MATH 4211/6211 Optimization Quasi-Newton Method

1 Computing with constraints

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Lecture V. Numerical Optimization

Convex Optimization CMU-10725

A Brief Review on Convex Optimization

minimize x subject to (x 2)(x 4) u,

10-725/36-725: Convex Optimization Prerequisite Topics

Transcription:

Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1

Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x y T x f(x) Conjugate f is always convex (regardless of convexity of f) When f is a quadratic in Q 0, f is a quadratic in Q 1 When f is a norm, f is the indicator of the dual norm unit ball When f is closed and convex, x f (y) y f(x) Relationship to duality: Primal : Dual : min f(x) + g(x) x max f (u) g ( u) u 2

Newton s method Given unconstrained, smooth convex optimization min f(x) where f is convex, twice differentable, and dom(f) = R n. Recall that gradient descent chooses initial x (0) R n, and repeats x (k) = x (k 1) t k f(x (k 1) ), k = 1, 2, 3,... In comparison, Newton s method repeats x (k) = x (k 1) ( 2 f(x (k 1) ) ) 1 f(x (k 1) ), k = 1, 2, 3,... Here 2 f(x (k 1) ) is the Hessian matrix of f at x (k 1) 3

Newton s method interpretation Recall the motivation for gradient descent step at x: we minimize the quadratic approximation f(y) f(x) + f(x) T (y x) + 1 2t y x 2 2 over y, and this yields the update x + = x t f(x) Newton s method uses in a sense a better quadratic approximation f(y) f(x) + f(x) T (y x) + 1 2 (y x)t 2 f(x)(y x) and minimizes over y to yield x + = x ( 2 f(x)) 1 f(x) 4

For f(x) = (10x 2 1 + x2 2 )/2 + 5 log(1 + e x 1 x 2 ), compare gradient descent (black) to Newton s method (blue), where both take steps of roughly same length 20 10 0 10 20 20 10 0 10 20 (For our example we needed to consider a nonquadratic... why?) 5

Outline Today: Interpretations and properties Backtracking line search Convergence analysis Equality-constrained Newton Quasi-Newton methods 6

Linearized optimality condition Aternative interpretation of Newton step at x: we seek a direction v so that f(x + v) = 0. Now consider linearizing this optimality condition 0 = f(x + v) f(x) + 2 f(x)v and solving for v, which again yields v = ( 2 f(x)) 1 f(x) (x, f (x)) f f (x + x nt,f (x + x nt)) (From B & V page 486) For us x nt = v History: work of Newton (1685) and Raphson (1690) originally focused on finding roots of polynomials. Simpson (1740) applied this idea to general nonlinear equations, and minimization by setting the gradient to zero 7

Affine invariance of Newton s method Important property Newton s method: affine invariance. Given f, nonsingular A R n n. Let x = Ay, and g(y) = f(ay). Newton steps on g are Hence i.e., y + = y ( 2 g(y) ) 1 g(y) = y ( A T 2 f(ay)a ) 1 A T f(ay) = y A 1( 2 f(ay) ) 1 f(ay) Ay + = Ay ( 2 f(ay) ) 1 f(ay) x + = x ( 2 f(x) ) 1 f(x) So progress is independent of problem scaling; recall that this is not true of gradient descent 8

Newton decrement At a point x, we define the Newton decrement as λ(x) = ( f(x) T ( 2 f(x) ) ) 1 1/2 f(x) This relates to the difference between f(x) and the minimum of its quadratic approximation: ( f(x) min f(x) + f(x) T (y x) + 1 ) y 2 (y x)t 2 f(x)(y x) = f(x) (f(x) 1 ( 2 f(x)t 2 f(x) ) ) 1 f(x) = 1 2 λ(x)2 Therefore can think of λ 2 (x)/2 as an approximate bound on the suboptimality gap f(x) f 9

10 Another interpretation of Newton decrement: if Newton direction is v = ( 2 f(x)) 1 f(x), then λ(x) = ( v T 2 f(x)v ) 1/2 = v 2 f(x) i.e., λ(x) is the length of the Newton step in the norm defined by the Hessian 2 f(x) Note that the Newton decrement, like the Newton steps, are affine invariant; i.e., if we defined g(y) = f(ay) for nonsingular A, then λ g (y) would match λ f (x) at x = Ay

11 Backtracking line search We have seen pure Newton s method, which need not converge. In practice, we instead use damped Newton s method (i.e., Newton s method), which repeats x + = x t ( 2 f(x) ) 1 f(x) Note that the pure method uses t = 1 Step sizes here typically are chosen by backtracking search, with parameters 0 < α 1/2, 0 < β < 1. At each iteration, we start with t = 1 and while f(x + tv) > f(x) + αt f(x) T v we shrink t = βt, else we perform the Newton update. Note that here v = ( 2 f(x)) 1 f(x), so f(x) T v = λ 2 (x)

Example: logistic regression Logistic regression example, with n = 500, p = 100: we compare gradient descent and Newton s method, both with backtracking f fstar 1e 13 1e 09 1e 05 1e 01 1e+03 Gradient descent Newton's method 0 10 20 30 40 50 60 70 k Newton s method seems to have a different regime of convergence! 12

13 Convergence analysis Assume that f convex, twice differentiable, having dom(f) = R n, and additionally f is Lipschitz with parameter L f is strongly convex with parameter m 2 f is Lipschitz with parameter M Theorem: Newton s method with backtracking line search satisfies the following two-stage convergence bounds (f(x ) f ) γk if k k 0 f(x (k) ) f 2m ( 1 ) 2 k k 0 +1 M 2 2 if k > k 0 Here γ = αβ 2 η 2 m/l 2, η = min{1, 3(1 2α)}m 2 /M, and k 0 is the number of steps until f(x (k 0+1) ) 2 < η

14 In more detail, convergence analysis reveals γ > 0, 0 < η m 2 /M such that convergence follows two stages Damped phase: f(x (k) ) 2 η, and f(x (k+1) ) f(x (k) ) γ Pure phase: f(x (k) ) 2 < η, backtracking selects t = 1, and M ( M ) 2 2m 2 f(x(k+1) ) 2 2m 2 f(x(k) ) 2 Note that once we enter pure phase, we won t leave, because when η m 2 /M 2m 2 ( M ) 2 M 2m 2 η < η

15 Unraveling this result, what does it say? To reach f(x (k) ) f ɛ, we need at most f(x (0) ) f γ + log log(ɛ 0 /ɛ) iterations, where ɛ 0 = 2m 3 /M 2 This is called quadratic convergence. Compare this to linear convergence (which, recall, is what gradient descent achieves under strong convexity) The above result is a local convergence rate, i.e., we are only guaranteed quadratic convergence after some number of steps k 0, where k 0 f(x(0) ) f γ Somewhat bothersome may be the fact that the above bound depends on L, m, M, and yet the algorithm itself does not

16 Self-concordance A scale-free analysis is possible for self-concordant functions: on R, a convex function f is called self-concordant if f (x) 2f (x) 3/2 for all x and on R n is called self-concordant if its projection onto every line segment is so. E.g., f(x) = log(x) is self-concordant Theorem (Nesterov and Nemirovskii): Newton s method with backtracking line search requires at most C(α, β) ( f(x (0) ) f ) + log log(1/ɛ) iterations to reach f(x (k) ) f ɛ, where C(α, β) is a constant that only depends on α, β

17 At a high-level: Comparison to first-order methods Memory: each iteration of Newton s method requires O(n 2 ) storage (n n Hessian); each gradient iteration requires O(n) storage (n-dimensional gradient) Computation: each Newton iteration requires O(n 3 ) flops (solving a dense n n linear system); each gradient iteration requires O(n) flops (scaling/adding n-dimensional vectors) Backtracking: backtracking line search has roughly the same cost, both use O(n) flops per inner backtracking step Conditioning: Newton s method is not affected by a problem s conditioning, but gradient descent can seriously degrade Fragility: Newton s method may be empirically more sensitive to bugs/numerical errors, gradient descent is more robust

18 Back to logistic regression example: now x-axis is parametrized in terms of time taken per iteration f fstar 1e 13 1e 09 1e 05 1e 01 1e+03 Gradient descent Newton's method 0.00 0.05 0.10 0.15 0.20 0.25 Time Each gradient descent step is O(p), but each Newton step is O(p 3 )

19 Sparse, structured problems When the inner linear systems (in Hessian) can be solved efficiently and reliably, Newton s method can strive E.g., if 2 f(x) is sparse and structured for all x, say banded, then both memory and computation are O(n) with Newton iterations What functions admit a structured Hessian? Two examples: If g(β) = f(xβ), then 2 g(β) = X T 2 f(xβ)x. Hence if X is a structured predictor matrix and 2 f is diagonal, then 2 g is structured If we seek to minimize f(β) + g(dβ), where 2 f is diagonal, g is not smooth, and D is a structured penalty matrix, then the Lagrange dual function is f ( D T u) g ( u). Often D 2 f ( D T u)d T can be structured

20 Equality-constrained Newton s method Consider now a problem with equality constraints, as in Several options: min f(x) subject to Ax = b Eliminating equality constraints: write x = F y + x 0, where F spans null space of A, and Ax 0 = b. Solve in terms of y Deriving the dual: can check that the Lagrange dual function is f ( A T v) b T v, and strong duality holds. With luck, we can express x in terms of v Equality-constrained Newton: in many cases, this is the most straightforward option

21 In equality-constrained Newton s method, we start with x (0) such that Ax (0) = b. Then we repeat the updates x + = x + tv, where v = argmin Az=0 f(x) T (z x) + 1 2 (z x)t 2 f(x)(z x) This keeps x + in feasible set, since Ax + = Ax + tav = b + 0 = b Furthermore, v is the solution to minimizing a quadratic subject to equality constraints. We know from KKT conditions that v satisfies [ 2 f(x) A T ] [ ] [ ] v f(x) = A 0 w 0 for some w. Hence Newton direction v is again given by solving a linear system in the Hessian (albeit a bigger one)

22 Quasi-Newton methods If the Hessian is too expensive (or singular), then a quasi-newton method can be used to approximate 2 f(x) with H 0, and we update according to x + = x th 1 f(x) Approximate Hessian H is recomputed at each step. Goal is to make H 1 cheap to apply (possibly, cheap storage too) Convergence is fast: superlinear, but not the same as Newton. Roughly n steps of quasi-newton make same progress as one Newton step Very wide variety of quasi-newton methods; common theme is to propogate computation of H across iterations

23 Davidon-Fletcher-Powell or DFP: Update H, H 1 via rank 2 updates from previous iterations; cost is O(n 2 ) for these updates Since it is being stored, applying H 1 is simply O(n 2 ) flops Can be motivated by Taylor series expansion Broyden-Fletcher-Goldfarb-Shanno of BFGS: Came after DFP, but BFGS is now much more widely used Again, updates H, H 1 via rank 2 updates, but does so in a dual fashion to DFP; cost is still O(n 2 ) Also has a limited-memory version, L-BFGS: instead of letting updates propogate over all iterations, only keeps updates from last m iterations; storage is now O(mn) instead of O(n 2 )

24 References and further reading S. Boyd and L. Vandenberghe (2004), Convex optimization, Chapters 9 and 10 Y. Nesterov (1998), Introductory lectures on convex optimization: a basic course, Chapter 2 Y. Nesterov and A. Nemirovskii (1994), Interior-point polynomial methods in convex programming, Chapter 2 J. Nocedal and S. Wright (2006), Numerical optimization, Chapters 6 and 7 L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012