arxiv: v1 [math.oc] 22 May 2018

Similar documents
arxiv: v2 [math.oc] 31 Jan 2019

5 Handling Constraints

Algorithms for Constrained Optimization

Optimization Tutorial 1. Basic Gradient Descent

Lecture Notes: Geometric Considerations in Unconstrained Optimization

An extrinsic look at the Riemannian Hessian

Constrained optimization: direct methods (cont.)

The Squared Slacks Transformation in Nonlinear Programming

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

arxiv: v1 [math.oc] 10 Apr 2017

Infeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization

Chapter 3 Numerical Methods

8 Numerical methods for unconstrained problems

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Algorithms for constrained local optimization

Line Search Methods for Unconstrained Optimisation

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

A Trust Funnel Algorithm for Nonconvex Equality Constrained Optimization with O(ɛ 3/2 ) Complexity

Lecture 5 : Projections

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

Max-Planck-Institut für Mathematik in den Naturwissenschaften Leipzig

Determination of Feasible Directions by Successive Quadratic Programming and Zoutendijk Algorithms: A Comparative Study

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Convex Optimization. Problem set 2. Due Monday April 26th

Optimality, Duality, Complementarity for Constrained Optimization

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Constrained Optimization Theory

11 a 12 a 21 a 11 a 22 a 12 a 21. (C.11) A = The determinant of a product of two matrices is given by AB = A B 1 1 = (C.13) and similarly.

An Inexact Sequential Quadratic Optimization Method for Nonlinear Optimization

SF2822 Applied Nonlinear Optimization. Preparatory question. Lecture 9: Sequential quadratic programming. Anders Forsgren

arxiv: v1 [math.oc] 1 Jul 2016

Nonlinear Optimization for Optimal Control

The Newton Bracketing method for the minimization of convex functions subject to affine constraints

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Maria Cameron. f(x) = 1 n

Numerical optimization

LINEAR AND NONLINEAR PROGRAMMING

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization CMU-10725

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Part 5: Penalty and augmented Lagrangian methods for equality constrained optimization. Nick Gould (RAL)

Constrained Optimization

4TE3/6TE3. Algorithms for. Continuous Optimization

AN AUGMENTED LAGRANGIAN AFFINE SCALING METHOD FOR NONLINEAR PROGRAMMING

Primal-dual relationship between Levenberg-Marquardt and central trajectories for linearly constrained convex optimization

Manifold Optimization Over the Set of Doubly Stochastic Matrices: A Second-Order Geometry

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Introduction to Nonlinear Optimization Paul J. Atzberger

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

1 Computing with constraints

Stochastic Optimization Algorithms Beyond SG

Journal of Convex Analysis Vol. 14, No. 2, March 2007 AN EXPLICIT DESCENT METHOD FOR BILEVEL CONVEX OPTIMIZATION. Mikhail Solodov. September 12, 2005

A Trust-region-based Sequential Quadratic Programming Algorithm

On Lagrange multipliers of trust-region subproblems

MS&E 318 (CME 338) Large-Scale Numerical Optimization

On the Local Convergence of an Iterative Approach for Inverse Singular Value Problems

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method

On the iterate convergence of descent methods for convex optimization

Complexity Analysis of Interior Point Algorithms for Non-Lipschitz and Nonconvex Minimization

Multidisciplinary System Design Optimization (MSDO)

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Optimization Problems with Constraints - introduction to theory, numerical Methods and applications

Sequential Quadratic Programming Method for Nonlinear Second-Order Cone Programming Problems. Hirokazu KATO

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

Least Sparsity of p-norm based Optimization Problems with p > 1

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

TMA 4180 Optimeringsteori KARUSH-KUHN-TUCKER THEOREM

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Unconstrained minimization of smooth functions

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Nonlinear Optimization: What s important?

A sensitivity result for quadratic semidefinite programs with an application to a sequential quadratic semidefinite programming algorithm

Visual SLAM Tutorial: Bundle Adjustment

Programming, numerics and optimization

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems

Interior-Point Methods for Linear Optimization

Higher-Order Methods

minimize x subject to (x 2)(x 4) u,

An Inexact Newton Method for Nonlinear Constrained Optimization

arxiv: v1 [math.oc] 23 May 2017

Numerical Optimization

In English, this means that if we travel on a straight line between any two points in C, then we never leave C.

1. Introduction. We analyze a trust region version of Newton s method for the optimization problem

Chapter 8 Gradient Methods

Nonlinear equations and optimization

Linear Algebra Massoud Malek

THE INVERSE FUNCTION THEOREM FOR LIPSCHITZ MAPS

OPER 627: Nonlinear Optimization Lecture 9: Trust-region methods

Second Order Optimization Algorithms I

Conjugate Directions for Stochastic Gradient Descent

On Lagrange multipliers of trust region subproblems

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

Linear Algebra and Robot Modeling

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

Transcription:

On the Connection Between Sequential Quadratic Programming and Riemannian Gradient Methods Yu Bai Song Mei arxiv:1805.08756v1 [math.oc] 22 May 2018 May 23, 2018 Abstract We prove that a simple Sequential Quadratic Programming (SQP) algorithm for equality constrained optimization has local linear convergence with rate 1 1/κ R, where κ R is the condition number of the Riemannian Hessian. Our analysis builds on insights from Riemannian optimization and indicates that first-order Riemannian algorithms and simple SQP algorithms have nearly identical local behavior. Unlike Riemannian algorithms, SQP avoids calculating the projection or retraction of the points onto the manifold for each iterates. All the SQP iterates will automatically be quadratically close the the manifold near the local minimizer. 1 Introduction In this paper, we consider the equality-constrained optimization problem minimize f(x), x R n subject to x M = {x : F (x) = 0}, where we assume f : R n R and F : R n R m are C 2 smooth functions with m n. The focus of this paper is on the local convergence rate of first-order methods for (1): methods that only query f(x) at each iteration (but can do whatever they want with the constraint). An example is projected gradient descent, in which each iterate moves in the negative gradient direction followed by a projection back onto M. As each iterate only receives first-order information about f, one expects local linear convergence of such algorithms, whose rate indicates the difficulty of the problem in the same way that the condition number of 2 f(x ) tells about the difficulty of unconstrained optimization. While numerous first-order methods can solve problem (1) [Ber99, NW06], we will restrict attention to two types of methods: Riemannian first-order methods and Sequential Quadratic Programming, which we now briefly review. When M has a manifold structure near x, one could use Riemannian optimization algorithms [AMS09], whose iterates are maintained on the constraint set M. Usually, a first-order Riemannian algorithm computes the Riemannian gradient and then takes a descent step on the manifold based on this gradient. Classical Riemannian type first order algorithms date back to geodesic gradient projection [Lue72] and geodesic steepest descent algorithm [Gab82a]. These algorithms proposed to perform descent steps along the geodesics of the manifold. Beyond several simple cases, the geodesics along a manifold is in general hard to (1) Department of Statistics, Stanford University Institute for Computational and Mathematical Engineering, Stanford University 1

compute, making these algorithms hard to implement in practice. Later, Riemannian algorithms are simplified by making use of retraction, a mapping from the tangent space to the manifold that can replace the necessity of computing the exact geodesics while still maintaining the same convergence rate. Intuitively, first-order Riemannian methods can be viewed as variants of projected gradient descent that utilize the manifold structure more carefully. Analyses of many such Riemannian algorithms are given in [AMS09, Section 4]. An alternative approach for solving problem (1) is Sequential Quadratic Programming (SQP) [NW06, Section 18]. Each iteration of SQP solves a quadratic programming problem which minimizes a quadratic approximation of f on the linearized constraint set {x : F (x k ) + F (x k )(x x k ) = 0}. This makes SQP an infeasible method: each iterate may be infeasible but eventually they converge to a feasible point. When the quadratic approximation uses the Hessian of the objective function, this is equivalent to Newton method solving nonlinear equations. When the full Hessian is intractable, one can either approximate the Hessian with BFGS-type updates, or just use some raw estimate such as a big PSD matrix [BT95]. It is known that the convergence rate of many first-order algorithms is determined by the condition number κ R of the restricted Hessian of the Lagrangian (e.g. [LY08, Section 11.6]). Many algorithms are shown to achieve local linear convergence with rate (1 1/κ R ), which is called the canonical rate for problem (1) [LY08]. Riemannian algorithms that achieve the canonical rate include geodesic gradient projection [Lue72, Theorem 2], geodesic steepest descent [Gab82a, Theorem 4.4], Riemannian gradient descent [AMS09, Theorem 4.5.6]. The canonical rate is also achieved by the modified Newton method on the quadratic penalty function [LY08, Section 15.7], which resembles an SQP method in spirit. Though analyses are well-established for both the Riemannian and the SQP approach (even by the same author in [Gab82a] for the Riemannian approach and [Gab82b] for the SQP approach), the connection of these algorithms did not receive much attention until the past 10 years. The connection between SQP and Riemannian method is re-emphasized in [ATMA09]: the authors pointed out that the feasibly-projected sequential quadratic programming (FP-SQP) method in [WT04] gives the same update as the Riemannian Newton update. However, this connection is established for second-order methods between the Riemannian and the SQP approaches, and such connection for first-order methods are not we believe explicitly pointed out yet. 1.1 Outline and our contribution In this paper, we prove that the local convergence rate of a simple version of SQP is the canonical rate 1 1/κ R (Theorem 1) from a Riemannian optimization perspective. The algorithm we consider is the following: given initialization x 0 close to a local minimizer x, iterate x k+1 = arg min f(x k ) + f(x k ), x x k + 1 2η x x k 2 2 subject to F (x k ) + F (x k )(x x k ) = 0, (2) where η > 0 is the stepsize. This is a standard SQP in which the quadratic term uses the matrix I/(2η), and in practice is slightly easier to implement than BFGS-type SQP. Generic convergence theorems for SQP show that algorithm (2) has local linear convergence [BT95, Theorem 3.2]. Though the rate is not the focus there, one could extract from their proof a rate that depends on the condition number of the Hessian of the Lagrangian, which is slightly slower than the canonical rate. Meanwhile, modified Newton methods that resemble an SQP are shown to achieve the canonical rate [LY08, Section 15.7]. Though these are compelling reasons for one 2

T x? M x? {z } " x k+1 x k " 2 M Figure 1: Illustration for SQP iterates to believe that the SQP (2) converges with the canonical rate, the authors were unable to find an explicitly written proof. Unlike classical analyses of SQP which often construct a merit function based on function values, our analysis focuses on the distance to optimality in the tangent and normal directions separately. Crucially, by observing the fact that the iterates will remain quadratically closing to the manifold, we choose our merit function as P x (x k x ) 2 2 + σ P x (x k x ) 2 for some σ > 0, where P x and P x are projection matrices onto correspondingly the tangent space and normal space of the manifold M at x. To show that this quantity converges linearly with the canonical rate, we make use of a Riemannian Taylor expansion result (Lemma 3.1). This result extends existing Riemannian Taylor expansion onto points off the manifold and is stated in an Euclidean form. Our proof uses insights from Riemannian optimization and makes precise the connection between first-order Riemannian methods and SQP methods with cheap quadratic terms: locally they are about the same. More precisely, the fact that P x (x k x ) 2 is comparable with P x (x k x ) 2 2 suggests that the iterates x k are quadratically close to the manifold. Hence, performing these SQP steps quickly becomes nearly identical to performing gradient steps on the manifold, i.e. Riemannian methods. We illusrtate this geometric observation in Figure 1. We hope that this observation can shed light on the analyses of constrained optimization problems and be of further interest. The rest of this paper is organized as the following. In Section 2, we state our assumptions and give preliminaries on the Riemannian geometry on and off the manifold M. We present our main convergence result in Section 3.1 and state the Riemannian Taylor expansion Lemma in Section 3.2. We prove our main result in Section 4, give an example in Section 5, and prove the needed technical results in the Appendix. 1.2 Notation For a matrix A R m n, we denote A R n m as its Moore-Penrose inverse, A T R n m as its transpose, and A T as the transpose of its Moore-Penrose inverse. As n m, we have A = A T (AA T ) 1. For a k th order tensor T R n 1 n 2 n k, and k vectors u 1 R n 1,..., u k R n k, we denote T [u 1,..., u k ] = i 1,...,i k T i1 i k u 1,i1 u k,ik as tensor-vectors multiplication. The operator norm of tensor T is defined as T op = sup u1 2 =1,..., u k 2 =1 T [u 1,..., u k ]. 3

For a scaler-valued function f : R n R, we write its gradient at x R n as a column vector f(x) R n. We write its Hessian at x R n as a matrix 2 f(x) R n n, and its third order derivative at x as a third order tensor 3 f(x) R n n n. For a vector-valued function F : R n R m, its Jacobian matrix at x R n is an m n matrix F (x) R m n, and its Hessian matrix at x as a third order tensor 2 F (x) R m n n. 2 Preliminaries 2.1 Assumptions Let x be a local minimizer of problem (1). Throughout the rest of this paper, we make the following assumptions on problem (1). In particular, all these assumptions are local, meaning that they only depend on the properties of f and F in B(x, δ) for some δ > 0. Assumption 1 (Smoothness). Within B(x, δ), the functions f and F are C 2 with local Lipschitz constants L f, L F, Lipschitz gradients with constants β f, β F, and Lipschitz Hessians with constants ρ f, ρ F. Assumption 2 (Manifold structure and constraint qualification). The set M is a m-dimensional smooth submanifold of R n. Further, inf x B(x,δ) σ min ( F (x)) γ F for some constant γ F > 0. Smoothness and constraint qualification together implies that the constraints F (x) are wellconditioned and problem (1) is C 2 near x. In particular, we can define a matrix Hessf(x ) via the formula Hessf(x ) = P x 2 f(x )P x [ F (x ) T f(x )] i P x 2 F i (x )P x, (3) where P x = I n F (x ) F (x ). (4) We can see later that Hessf(x ) is the matrix representation of the Riemannian Hessian of function f on M at x. Assumption 3 (Eigenvalues of the Riemannian Hessian). Define λ max = sup{ u, Hessf(x )u : u 2 = 1, F (x )u = 0}, λ min = inf{ u, Hessf(x )u : u 2 = 1, F (x )u = 0}. We assume 0 < λ min λ max <. We call κ R = λ max /λ min the condition number of Hessf(x ). 2.2 Geometry on the manifold M Since we assumed that the set M is a smooth submanifold of R n, we endow M with the Riemannian geometry induced by the Euclidean space R n. At any point x M, the tangent space (viewed as a subspace of R n ) is obtained by taking the differential of the equality constraints T x M = {u R n : F (x)u = 0}. (5) Let P x be the orthogonal projection operator from R n onto T x M. For any u R n, we have P x (u) = [I n F (x) T ( F (x) F (x) T ) 1 F (x)]u. (6) 4

Let P x be the orthogonal projection operator from R n onto the complement subspace of T x M. For any u R n, we have P x (u) = F (x) T ( F (x) F (x) T ) 1 F (x)u. (7) With a little abuse of notations, we will not distinguish P x and P x with their matrix representations. That is, we also think of P x, P x R n n as two matrices. We denote f(x) and gradf(x) respectively the Euclidean gradient and the Riemannian gradient of f at x M. The Riemannian gradient of f is the projection of the Euclidean gradient onto the tangent space gradf(x) = P x ( f(x)) = [I n F (x) T ( F (x) F (x) T ) 1 F (x)] f(x). (8) Since x is a local minimizer of f on the manifold M, we have gradf(x ) = 0. At x M, let 2 f(x) and Hessf(x) be respectively the Euclidean and the Riemannian Hessian of f. The Riemannian Hessian is a symmetric operator on the tangent space and is given by projecting the directional derivative of the gradient vector field. That is, for any u, v T x M, we have (we use D to denote the directional derivative) Hessf(x)[u, v] = v, P x (Dgradf(x)[u]) = v, P x 2 f(x)u P x (DP x [u]) f =v T P x 2 f(x)p x u 2 F (x)[ F (x) T f(x), u, v]. With a little abuse of notation, we will not distinguish the Hessian operator with its matrix representation. That is Hessf(x) = P x 2 f(x)p x [ F (x) T f(x)] i P x 2 F i (x)p x. (10) 2.3 Geometry off the manifold M We can extend the definition of the matrix representations of the above Riemannian quantities outside the manifold M. For any x R, we denote P x = I n F (x) T ( F (x) F (x) T ) 1 F (x), P x = F (x) T ( F (x) F (x) T ) 1 F (x), gradf(x) = [I n F (x) T ( F (x) F (x) T ) 1 F (x)] f(x). By the constraint qualification assumption (Assumption 2), F (x) F (x) T is invertible and the quantities above are well defined in B(x, δ). We call gradf(x) the pseudo-riemannian gradient of f at x, which not only depends on function f on the manifold M, but also depends on the function outside the manifold M, and the formulation of the constraint function F. 2.4 Closed-form expression of the SQP iterate The above definitions makes it possible to have a concise closed-form expression for the SQP iterate (2). Indeed, as each iterate solves a standard QP, the expression can be obtained explicitly by writing out the optimality condition. Letting x k = x, the next iteration x k+1 = x + is given by x + = x η[i n F (x) T ( F (x) F (x) T ) 1 F (x)] f(x) F (x) T ( F (x) F (x) T ) 1 F (x) = x ηp x f(x) F (x) T ( F (x) F (x) T ) 1 F (x) = x ηgradf(x) F (x) F (x). We will frequently refer to this expression in our proof. 5 (9) (11) (12)

3 Main result 3.1 Convergence theorem Under Assumptions 1, 2 and 3, we show that the SQP algorithm (2) converges locally linearly with the canonical rate. Theorem 1 (Local linear convergence of SQP with canonical rate). There exists ε > 0 and a constant σ > 0 such that the following holds. Let x 0 B(x, ε) and x k be the iterates of Equation (2) with stepsize η = 1/λ max (Hessf(x )). Letting we have a 2 k+1 + σb k+1 a k = P x (x k x ) 2, b k = P x (x k x ) 2, ( 1 1 2κ R ) 2(a 2 k + σb k ), where κ R = λ max (Hessf(x ))/λ min (Hessf(x )) is the condition number of the Riemannian Hessian of function f on the manifold M at x. Consequently, the distance x k x 2 2 = a2 k + b2 k also converges linearly with rate 1 1/(2κ R ). Remark Theorem 1 requires choosing the stepsize η according to the maximum eigenvalue of Hessf(x ), which might not be known in advance. In practice, one could implement a line-search (for example as in [AMS09, Section 4.2] for Riemannian gradient methods) to achieve the same optimal rate. However, line-search methods for the SQP (2) will call for a different convergence analysis. 3.2 Riemannian Taylor expansion The proof of Theorem 1 relies on a particular expansion of the Riemannian gradient off the manifold, which we state as follows. Lemma 3.1 (First-order expansion of Riemannian gradient). There exist constants ε 0 > 0 and C r,1, C r,2 > 0 such that for all x B(x, ε 0 ), gradf(x) = Hessf(x )(x x ) + r(x), r(x) 2 C r,1 x x 2 2 + C r,2 P x. (13) Lemma 3.1 extends classical Riemannian Taylor expansion (e.g. [AMS09, Section 7.1]) to points off the manifold, where the remainder term contains an additional first-order error. While the error term is linear in P x, this expansion is particularly suitable when this distance is on the order of P x 2, which results in a quadratic bound on r(x) 2. In the proof of Theorem 1, gradf(x) mostly appears through its squared norm and inner product with x x. We summarize the expansion of these terms in the following corollary. Corollary 3.2. There exist constants ε 0 > 0 and C r,1, C r,2 > 0 such that the following hold. For any ε ε 0, x B(x, ε), gradf(x), x x = x x, Hessf(x )(x x ) + R 1, gradf(x) 2 2 = x x, [Hessf(x )] 2 (x x ) + R 2, where max { R 1, R 2 } ε(c r,1 P x 2 + C r,2 P x ). 6

4 Proof of Theorem 1 The following perturbation bound (see, e.g. [Sun01]) for projections is useful in the proof. Lemma 4.1. For x 1, x 2 B(x, δ), we have where β P = 2β F /γ F. P x1 P x2 op β P x 1 x 2 2, P x1 P x 2 op β P x 1 x 2 2. We now prove the main theorem. Step 1: A trivial bound on x + x 2 2. Consider one iterate of the algorithm x x +, whose closed-form expression is given in (12). We have x + = x +, where = η gradf(x) F (x) F (x). (14) We would like to relate x + x 2 with x x 2. Observe that gradf(x ) = 0, we have Observe that F (x ) = 0, we have gradf(x) 2 = gradf(x) gradf(x ) 2 β E x x 2. F (x) F (x) 2 = F (x) (F (x) F (x )) 2 F (x) op F (x) F (x ) 2 (β F /γ F ) x x 2. Accordingly, we have x + x 2 x x 2 + η gradf(x) 2 + F (x) F (x) 2 [1 + ηβ E + (β F /γ F )] x x 2. Hence, for any stepsize η, letting C d = [1 + ηβ E + (β F /γ F )] 2, for x B(x, δ), we have x + x 2 2 C d x x 2 2. (15) Step 2: Analyze normal and tangent distances. This is the key step of the proof. We look into the normal direction and the tangent direction separately. The intuition for this process is that the normal part of x x is a measure of feasibility, and as we will see, converges much more quickly. Now we look at equation x + x = x x +. Multiplying it by P x and P x gives P x (x + x ) = P x (x x ) + P x = P x (x x ) F (x) F (x), (16) P x (x + x ) = P x (x x ) + P x = P x (x x ) η gradf(x). (17) We now take squared norms on both equalities and bound the growth. Define the normal and tangent distances as a = P x, b = P x, (18) and (a +, b + ) similarly for x +. Note that the definitions of a, b use the projection at x, so they are slightly different from quantities (16) and (17). From now on, we assume that x x 2 ε 0, and ε 0 is sufficiently small such that (15) holds. The requirements on ε 0 will later be tightened when necessary. 7

The normal direction. We have P x (x + x ) 2 2 = P x 2 + 2 x x, P x + P x 2 2 = P x 2 2 x x, F (x) F (x) + F (x), ( F (x) F (x) T ) 1 F (x) = P x 2 2 F (x)(x x ), ( F (x) F (x) T ) 1 ( F (x)(x x ) + r(x)) + F (x)(x x ) + r(x), ( F (x) F (x) T ) 1 ( F (x)(x x ) + r(x)) = r(x), ( F (x) F (x) T ) 1 r(x), where r(x) = F (x) F (x)(x x ). By the smoothness of function F, we have r(x) 2 β F /2 x x 2 2. Accordingly, we get P x (x + x ) 2 2 β 2 F /(4γF 2 ) x x 4 2 = β 2 F /(4γF 2 ) [ P x 2 + P x 2] 2 =β 2 F /(4γF 2 ) (a 2 + b 2 ) 2. Applying the perturbation bound on projections (Lemma 4.1), we get b + = P x (x + x ) 2 P x (x + x ) 2 + P x P x op x + x 2 β F /(2γ F ) (a 2 + b 2 ) + β P x x 2 x + x 2 β F /(2γ F ) (a 2 + b 2 ) + β P C d x x 2 2 = [β F /(2γ F ) + β P C d ] (a 2 + b 2 ) = C }{{} b (a 2 + b 2 ). C b (19) The tangent direction. We have P x (x + x ) 2 2 = P x 2 + 2 x x, P x + P x 2 2 = P x 2 2η gradf(x), x x + η 2 gradf(x) 2 2. Applying Lemma 4.1, we get that for any vector v, P x v 2 2 P x v 2 2 = v, (P x P x )v β P x x 2 v 2 2. Applying this to vectors x + x and x x gives a 2 + a 2 2η gradf(x), x x + η 2 gradf(x) 2 2 + β P ( x x 3 2 + x x 2 x + x 2 2) a 2 2η gradf(x), x x + η 2 gradf(x) 2 2 + β P (1 + C d ) x x 3 2. (20) Applying Corollary 3.2, and note that Hessf(x ) = P x Hessf(x )P x by the property of the Riemannian Hessian, we get a 2 + a 2 2η x x, Hessf(x )(x x ) + η 2 x x, (Hessf(x )) 2 (x x ) + ( 2ηR 1 + η 2 R 2 ) + ε 0 C(a 2 + b 2 ) = P x (x x ), (I ηhessf(x )) 2 P x (x x ) + ( 2ηR 1 + η 2 R 2 ) + ε 0 C(a 2 + b 2 ), where the remainders are bounded as max { R 1, R 2 } ε 0 ( Cr,1 P x 2 + C r,2 P x ) = ε0 (C r,1 a 2 + C r,2 b). Choosing the stepsize as η = 1/λ max (Hessf(x )), we have I ηhessf(x ) (1 1/κ R )I. For this choice of η, using the above bound for R 1, R 2, we get that there exists some constant C 1, C 2 such that ( a 2 + 1 1 ) 2a 2 + ε 0 (C 1 a 2 + C 2 b). (21) κ R 8

Putting together. Let σ > 0 be a constant to be determined. Looking at the quantity a 2 + + σb +, by the bounds (19) and (21) we have ( a 2 + + σb + 1 1 ) 2a 2 + ε 0 (C 1 a 2 + C 2 b) + σc b (a 2 + b 2 ) κ R (( 1 1 ) 2 ) + ε0 C 1 + σc b a 2 + ε 0 (C 2 + σc b )b. κ R We can choose σ sufficiently small, then ε 0 sufficiently small, such that a 2 + + σb + ( 1 1 2κ R ) 2(a 2 + σb). Step 3: Connect the entire iteration path. Consider iterates x k of the first-order algorithm (2). Initializing x 0 sufficiently close to x, we can guarantee that the descent on a 2 k + σb k ensures a 2 k + b2 k ε2 0. Thus we chain this analysis on (x, x +) = (x k, x k+1 ) and get that a 2 k + σb k converges linearly with rate 1 1/(2κ R ). 5 Example We provide the eigenvalue problem as a simple example, showing that SQP algorithm is as good as power iterations. We emphasize that this is a simple and well-studied problem; our goal here is only to illustrate the connection between SQP and the Riemannian algorithms, in particular the canonical rate. Example 1 (SQP for eigenvalue problems): Consider the eigenvalue problem for a symmetric matrix A R n n : minimize 1 2 x Ax subject to x 2 2 1 = 0. This is an instance of problem (1) with f(x) = (1/2)x Ax and F (x) = x 2 2 1. Let λ 1 < λ 2 λ n be the eigenvalues of A, and v i (A) be the corresponding eigenvectors. The SQP iterate for this problem is x x + = x +, where solves the subproblem Applying (12), we obtain the explicit formula minimize Ax, + 1 2η 2 2 subject to x 2 2 + 2 x, 1 = 0. x + = x 2 2 + 1 2 x 2 x η (I n xx ) 2 x 2 Ax. (22) 2 By Theorem 1, the local convergence rate is 1 1/κ R, which we now examine. At x = v 1 (A), we have Hessf(x ) = A λ 1 I n, and the tangent space T x M = span(v 2 (A),..., v n (A)). Therefore, the condition number of the Riemannian Hessian is κ R = sup v T x M, v 2 =1 v, Hessf(x )v inf v Tx M, v 2 =1 v, Hessf(x )v = λ n λ 1 λ 2 λ 1. 9

Hence, choosing the right stepsize η, the convergence rate of the SQP method is 1 1 κ R = λ n λ 2 λ n λ 1, matching the rate of power iteration (Riemannian gradient descent). The keen reader might notice that the iterate (22) converges globally to x as long as x 0, x 0, which is another feature shared with power iteration. Whether such global convergence holds more generally would be an interesting direction for future study. Acknowledgement We thank Nicolas Boumal and John Duchi for a number of helpful discussions. YB was partially supported by John Duchi s National Science Foundation award CAREER-1553086. SM was supported by an Office of Technology Licensing Stanford Graduate Fellowship. References [AMS09] P-A Absil, Robert Mahony, and Rodolphe Sepulchre, Optimization algorithms on matrix manifolds, Princeton University Press, 2009. [ATMA09] PA Absil, Jochen Trumpf, Robert Mahony, and Ben Andrews, All roads lead to newton: Feasible second-order methods for equality-constrained optimization, Tech. report, Technical Report UCL-INMA-2009.024, UCLouvain, 2009. [Ber99] Dimitri P Bertsekas, Nonlinear programming, Athena scientific Belmont, 1999. [BT95] Paul T Boggs and Jon W Tolle, Sequential quadratic programming, Acta numerica 4 (1995), 1 51. [Gab82a] Daniel Gabay, Minimizing a differentiable function over a differential manifold, Journal of Optimization Theory and Applications 37 (1982), no. 2, 177 219. [Gab82b], Reduced quasi-newton methods with feasibility improvement for nonlinearly constrained optimization, Algorithms for Constrained Minimization of Smooth Nonlinear Functions, Springer, 1982, pp. 18 44. [Lue72] David G Luenberger, The gradient projection method along geodesics, Management Science 18 (1972), no. 11, 620 631. [LY08] David G Luenberger and Yinyu Ye, Linear and nonlinear programming, third edition, Springer, 2008. [MZ10] Lingsheng Meng and Bing Zheng, The optimal perturbation bounds of the moore penrose inverse under the frobenius norm, Linear Algebra and its Applications 432 (2010), no. 4, 956 963. [NW06] Jorge Nocedal and Stephen J Wright, Numerical optimization 2nd, Springer, 2006. [Sun01] JG Sun, Matrix perturbation analysis, Science Press, Beijing (2001). [WT04] Stephen J Wright and Matthew J Tenny, A feasible trust-region sequential quadratic programming algorithm, SIAM journal on optimization 14 (2004), no. 4, 1074 1105. 10

A Proof of technical results A.1 Some tools Lemma A.1 ([MZ10]). For x 1, x 2 B(x, δ), we have where β D = 2β F /γ 2 F. F (x 1 ) F (x 2 ) op β D x 1 x 2 2. Lemma A.2. The pseudo-riemannian gradient gradf(x) is β E -Lipschitz in B(x, δ), where β E = β P L f + β f. Proof We have gradf(x 1 ) gradf(x 2 ) 2 = P x1 f(x 1 ) P x2 f(x 2 ) 2 (P x1 P x2 ) f(x 1 ) 2 + P x2 ( f(x 1 ) f(x 2 )) 2 (β P L f + β f ) x 1 x 2 2. A.2 Proof of Lemma 3.1 For any x B(x, ε 0 ), denote x p = P x (x x ) + x. We prove the following equation first gradf(x p ) = Hessf(x )(x p x ) + r(x p ), (23) where r(x p ) C r,1 x p x 2 2. First, we show that gradf(x p ) can be well approximated by P x gradf(x p ). Denoting r 0 = P x gradf(x p ) gradf(x p ), by Lemma 4.1 and A.2, we have where r 0 2 = P x gradf(x p ) 2 = P x P xp gradf(x p ) 2 P x P xp op gradf(x p ) 2 β P x p x 2 gradf(x p ) 2 β P β E x p x 2 2. Then, for any u R n with u 2 = 1, we have Then we have u, P x gradf(x p ) = u, P x P xp { f(x ) + 2 f(x )(x p x ) + 1/2 3 f( x p )[, (x p x ) 2 ]} = u, P x P xp [ f(x ) + 2 f(x )(x p x )] + r 1, u, P x gradf(x p ) r 1 = 1/2 u, P x P xp 3 f( x p )[, (x p x ) 2 ] 1/2 ρ f x p x 2 2. = u, P x P xp [ f(x ) + 2 f(x )(x p x )] + r 1, = u, P x P xp f(x ) + u, P x 2 f(x )(x p x ) + u, P x (P xp P x ) 2 f(x )(x p x ) + r 1 = u, P x P xp f(x ) + u, P x 2 f(x )(x p x ) + r 1 + r 2, }{{} I 11

where r 2 = u, P x (P xp P x ) 2 f(x )(x p x ) β P β f x p x 2 2. Then we look at the term I. We have I = u, P x F (x p ) T F (x p ) T f(x ) = u, P x [ F (x p ) F (x )] T F (x p ) T f(x ) = u, P x { 2 F (x )[,, x p x ] + 1/2 3 F ( x p )[,, (x p x ) 2 ]} T F (x p ) T f(x ) = [ F (x p ) T f(x )] i 2 F i (x )[x p x, P x u] + r 3, where We further have r 3 =1/2 [ F (x p ) T f(x )] i 3 F i ( x p )[(x p x ) 2, P x u] I = = ρ F L f /(2γ F ) x p x 2 2. [ F (x p ) T f(x )] i 2 F i (x )[x p x, P x u] + r 3 [ F (x ) T f(x )] i 2 F i (x )[x p x, P x u] + r 3 + r 4, where r 4 = {[ F (x p ) F (x )] T f(x )} i 2 F i (x )[x p x, P x u] β D L f β F x p x 2 2. Above all, combining all the terms, we have u, gradf(x p ) Hessf(x p )(x p x ) r 0 2 + r 1 + r 2 + r 3 + r 4 (β P β E + 1/2 ρ f + β P β f + ρ F L f /(2γ F ) + β D L f β F ) x p x 2 2. Taking C r,1 = β P β E + 1/2 ρ f + β P β f + ρ F L f /(2γ F ) + β D L f β F we get Eq. (23). Then by Lemma A.2, we have This proves the lemma. A.3 Proof of Corollary 3.2 gradf(x) gradf(x p ) C r,2 x x p 2. (24) Let ε 0 be given by Lemma 3.1, ε ε 0, and x B(x, ε). We have gradf(x) = Hessf(x )(x x ) + r(x), r(x) 2 C r,1 x x 2 2 + C r,2 P x. (25) Therefore we obtain the expansion gradf(x), x x = x x, Hessf(x )(x x ) + R 1, 12

where R 1 is bounded as R 1 = r(x), x x C r,1 x x 3 2 + C r,2 x x 2 P x εc r,1 x x 2 2 + εc r,2 P x. (26) Similarly, we have the expansion gradf(x ) 2 2 = x x, (Hessf(x )) 2 (x x ) + R 2, where R 2 is bounded as (letting H = λ max (Hessf(x )) for convenience) R 2 2 Hessf(x )(x x ), r(x) + r(x) 2 2 2H ( C r,1 x x 3 2 + C r,2 x x 2 P ) x + ( Cr,1 x 2 x 4 2 + 2C r,1 C r,2 x x 2 2 P x + Cr,2 P 2 x ) 2 (2HC r,1 ε + C 2 r,1ε 2 ) x x 2 2 + (2HC r,2 ε + 2C r,1 C r,2 ε 2 ) P x + C 2 r,2ε P x ε (2HC r,1 + Cr,1ε 2 2 0 ) x x }{{} C r,1 2 + ε (2HC r,2 + 2C r,1 C r,2 ε 0 + Cr,2ε 2 0 ) }{{} C r,2 Px = ε ( Cr,1 x x 2 2 + C r,2 P x ). (27) Overloading the constants, we get max { R 1, R 2 } ε ( C r,1 x x 2 2 + C r,2 P x ) = ε ( C r,1 P x 2 + C r,1 P x 2 + C r,2 P x ) ε ( C r,1 x x 2 ) 2 + (C r,2 + C r,1 ε 0 ) Px }{{}. C r,2 13