Convergence analysis of Riemannian trust-region methods

Similar documents
Accelerated line-search and trust-region methods

A Model-Trust-Region Framework for Symmetric Generalized Eigenvalue Problems

LECTURE 15: COMPLETENESS AND CONVEXITY

Manifold Optimization Over the Set of Doubly Stochastic Matrices: A Second-Order Geometry

1 Directional Derivatives and Differentiability

Chapter 3. Riemannian Manifolds - I. The subject of this thesis is to extend the combinatorial curve reconstruction approach to curves

The nonsmooth Newton method on Riemannian manifolds

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

An implementation of the steepest descent method using retractions on riemannian manifolds

Matrix Manifolds: Second-Order Geometry

CALCULUS ON MANIFOLDS. 1. Riemannian manifolds Recall that for any smooth manifold M, dim M = n, the union T M =

Unconstrained optimization

Metric Spaces and Topology

Inertial Game Dynamics

Existence and uniqueness of solutions for a continuous-time opinion dynamics model with state-dependent connectivity

fy (X(g)) Y (f)x(g) gy (X(f)) Y (g)x(f)) = fx(y (g)) + gx(y (f)) fy (X(g)) gy (X(f))

Vectors. January 13, 2013

Topological properties

Introduction to Real Analysis Alternative Chapter 1

A brief introduction to Semi-Riemannian geometry and general relativity. Hans Ringström

Reminder on basic differential geometry

Math 6455 Nov 1, Differential Geometry I Fall 2006, Georgia Tech

HOMEWORK 2 - RIEMANNIAN GEOMETRY. 1. Problems In what follows (M, g) will always denote a Riemannian manifold with a Levi-Civita connection.

Choice of Riemannian Metrics for Rigid Body Kinematics

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

1. Bounded linear maps. A linear map T : E F of real Banach

arxiv: v2 [math.oc] 28 Apr 2018

DIFFERENTIAL GEOMETRY, LECTURE 16-17, JULY 14-17

Line Search Methods for Unconstrained Optimisation

MATH DIFFERENTIAL GEOMETRY. Contents

1 Lyapunov theory of stability

(x, y) = d(x, y) = x y.

Differential Geometry MTG 6257 Spring 2018 Problem Set 4 Due-date: Wednesday, 4/25/18

The Inverse Function Theorem via Newton s Method. Michael Taylor

4.7 The Levi-Civita connection and parallel transport

DEVELOPMENT OF MORSE THEORY

WELL-POSEDNESS OF THE LAPLACIAN ON MANIFOLDS WITH BOUNDARY AND BOUNDED GEOMETRY

4 Riemannian geometry

Rank-Constrainted Optimization: A Riemannian Manifold Approach

Some Background Material

William P. Thurston. The Geometry and Topology of Three-Manifolds

Math 117: Continuity of Functions

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Tangent spaces, normals and extrema

4 Countability axioms

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Near-Potential Games: Geometry and Dynamics

Distances, volumes, and integration

arxiv: v1 [math.oc] 22 May 2018

September Math Course: First Order Derivative

Optimal Newton-type methods for nonconvex smooth optimization problems

Lecture Notes in Advanced Calculus 1 (80315) Raz Kupferman Institute of Mathematics The Hebrew University

Overview of normed linear spaces

Algorithms for Riemannian Optimization

A Descent Method for Equality and Inequality Constrained Multiobjective Optimization Problems

Course 212: Academic Year Section 1: Metric Spaces

8 Numerical methods for unconstrained problems

arxiv: v1 [math.oc] 7 Aug 2017

Exercises in Geometry II University of Bonn, Summer semester 2015 Professor: Prof. Christian Blohmann Assistant: Saskia Voss Sheet 1

Isodiametric problem in Carnot groups

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Elements of Linear Algebra, Topology, and Calculus

BACKGROUND IN SYMPLECTIC GEOMETRY

Fuchsian groups. 2.1 Definitions and discreteness

Introduction and Preliminaries

Morse homology of RP n

BERGMAN KERNEL ON COMPACT KÄHLER MANIFOLDS

DIFFERENTIABLE CONJUGACY NEAR COMPACT INVARIANT MANIFOLDS

Optimization and Optimal Control in Banach Spaces

Math 205C - Topology Midterm

Curvature homogeneity of type (1, 3) in pseudo-riemannian manifolds

MORSE HOMOLOGY. Contents. Manifolds are closed. Fields are Z/2.

Metric spaces and metrizability

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization

Near-Potential Games: Geometry and Dynamics

Hopf-Rinow and Hadamard Theorems

Copyrighted Material. L(t, x(t), u(t))dt + K(t f, x f ) (1.2) where L and K are given functions (running cost and terminal cost, respectively),

Pseudo-Poincaré Inequalities and Applications to Sobolev Inequalities

A Riemannian symmetric rank-one trust-region method

Math 396. Bijectivity vs. isomorphism

The Steepest Descent Algorithm for Unconstrained Optimization

Notions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy

Practice Qualifying Exam Questions, Differentiable Manifolds, Fall, 2009.

CORRIGENDUM: THE SYMPLECTIC SUM FORMULA FOR GROMOV-WITTEN INVARIANTS

PICARD S THEOREM STEFAN FRIEDL

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Riemannian geometry of surfaces

Optimization Algorithms on Riemannian Manifolds with Applications

Problem List MATH 5143 Fall, 2013

Lecture Notes on Metric Spaces

Stability of Feedback Solutions for Infinite Horizon Noncooperative Differential Games

Affine Connections: Part 2

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

Analysis Finite and Infinite Sets The Real Numbers The Cantor Set

DIFFERENTIAL GEOMETRY. LECTURE 12-13,

Existence and Uniqueness

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory

II. DIFFERENTIABLE MANIFOLDS. Washington Mio CENTER FOR APPLIED VISION AND IMAGING SCIENCES

Transcription:

Convergence analysis of Riemannian trust-region methods P.-A. Absil C. G. Baker K. A. Gallivan Optimization Online technical report Compiled June 19, 2006 Abstract This document is an expanded version of [ABG05], with a detailed convergence analysis. A general scheme for trust-region methods on Riemannian manifolds is proposed and analyzed. Among the various approaches available to (approximately) solve the trust-region subproblems, particular attention is paid to the truncated conjugate-gradient technique. The method is illustrated on problems from numerical linear algebra. Key words. Numerical optimization on manifolds, trust-region, truncated conjugategradient, Steihaug-Toint, global convergence, local convergence, superlinear, symmetric eigenvalue problem. 1 Introduction Several problems related to numerical linear algebra can be expressed as optimizing a smooth function whose domain is a differentiable manifold. Applications appear in various areas, including computer vision [MKS01], machine learning [NA05], maximum likelihood estimation [Smi05, XB05], electronic structure computation [LE00], system balancing [HM94], model reduction [YL99], and robot manipulation [HHM02]. The simplest algorithms for solving optimization problems on manifolds are arguably those based on the idea of steepest descent; see, e.g., [HM94, Udr94] and references therein. These algorithms have good global convergence properties but slow (linear) local convergence. Other methods achieve superlinear convergence by using second-order information on the cost function. Among these methods, Newton s method is conceptually the simplest. The history of Newton s method on manifolds can be traced back to Luenberger [Lue72], if not earlier. This work was supported by the US National Science Foundation under Grants ACI0324944 and CCR9912415, by the School of Computational Science of Florida State University through a postdoctoral fellowship, and by Microsoft Research through a Research Fellowship at Peterhouse, Cambridge. This research was initiated while the first author was a Research Fellow with the Belgian National Fund for Scientific Research (FNRS) at the University of Liège. Department of Mathematical Engineering, Université catholique de Louvain, 1348 Louvain-la-Neuve, Belgium. URL: http://www.inma.ucl.ac.be/ absil/ Peterhouse, University of Cambridge, Cambridge CB2 1RD, UK. School of Computational Science, Florida State University, Tallahassee, FL 32306-4120, USA. URLs: http://www.csit.fsu.edu/{ cbaker, gallivan} 1

Gabay [Gab82] proposed a Newton method on embedded submanifolds of R n. Smith [Smi93, Smi94] and Udrişte [Udr94] formulated and analyzed the method on general Riemannian manifolds. Related work includes [Shu86, EAS98, OW00, Man02, MM02, ADM + 02, DPM03, HT04]. A plain Newton method, on general manifolds as well as in R n, has major drawbacks as a numerical optimization method. The computational cost is often prohibitive, as a linear system has to be solved at each iteration. Moreover, the method is locally attracted to any stationary point, be it a local minimizer, local maxizer or saddle point. Finally, the method may not even converge to stationary points, unless some strong conditions are statisfied (such as convexity of the cost function). In the case of cost functions on R n, several techniques exist to improve the convergence properties of Newton s method. Most of these techniques fall into two categories: line-search methods and trust-region methods; see [NW99]. The advantages of a trust-region method over the pure Newton method are multiple. First, under mild conditions, trust-region schemes are provably convergent to a set of stationary points of the cost functions for all initial conditions. Moreover, the cost function is nonincreasing at each iterate which favors convergence to a local minimizer. Finally, the presence of a trust-region gives an additional guideline to stop the inner iteration early, hence reducing the computational cost, while preserving the fast local convergence of the exact scheme. Line-search techniques have been considered on Riemannian manifolds by Udrişte [Udr94] and Yang [Yan05]. The main purpose of this paper is to provide a theoretical and algorithmic framework for trust-region methods on Riemannian manifolds, applicable to multiple problems. The Riemannian trust-region (RTR) approach we propose works along the following lines. First, much as in the work of Shub [Shu86, ADM + 02], a retraction R (Definition 2.1) is chosen on the Riemannian manifold M that defines for any point x M a one-to-one correspondence R x between a neighborhood of x in M and a neighborhood of 0 x in the tangent space T x M (see Figure 1). Using this retraction, the cost function f on M is lifted to a cost function ˆf x = f R x on T x M. Since T x M is an Euclidean space, it is possible to define a quadratic model of ˆfx and adapt classical methods in R n to compute (in general, approximately) a minimizer of the model within a trust-region around 0 x T x M. This minimizer is then retracted back from T x M to M using the retraction R x. This point is a candidate for the new iterate, which will be accepted or rejected depending on the quality of the agreement between the quadratic model and the function f itself. It is this lift-solve-retract procedure that distinguishes the proposed RTR approach from the standard trust-region methods; the standard methods, since they live in R n, only require the solve part. On a manifold, lifting the cost function makes it possible to locally fall back to a friendly Euclidean world (the tangent space T x M) where classical techniques can be applied, and the retraction brings the result back to the manifold. A difficulty, from an analysis perspective, is that the RTR method does not deal with a unique cost function (as in the classical case), but rather with a succession of different lifted cost functions ˆf xk, where x k is the kth iterate. A main contribution of this paper is to show that, under reasonable conditions, the nice properties of the standard trust-region schemes are preserved in their Riemannian generalizations (see Section 4). Notice that it is theoretically possible to choose once and for all the retraction as the Riemannian exponential mapping. This corresponds to a strategy used by numerous Riemannian optimization methods when they compute the exponential of a tangent update vector in order to obtain a new iterate on the manifold; see [Smi94, Udr94, EAS98, Yan05]. However, as 2

pointed out by Manton [Man02, Section IX], the exponential map may not always be the most appropriate or computationally efficient way of performing the update. Our convergence analysis shows that the good properties of the algorithms hold for all suitably defined retractions (Definition 2.1) and not only for the exponential mapping. The lift-solve-retract technique, as an aside, is not specific to trust-region methods, and can be applied to generalize a wide variety of classical optimization methods to manifolds. This approach, which finds its roots in the work of Shub [Shu86], seems to have received little attention in the literature until recently [ADM + 02, ABG04a]. Note that coordinatebased approaches follow a similar pattern; in practice, however, lifting to the tangent space tends to lead to more streamlined and computationally efficient algorithms (see discussion in Section 2.1). There is clearly a link between techniques of optimization on manifolds and standard constrained optimization approaches. However, there are manifolds that are not defined as constrained sets in R n ; an important example is the Grassmann manifold (see Section 5.3). (Clearly, by Nash s embedding theorem [Nas56], every Riemannian manifold can be smoothly isometrically embedded in a Euclidean space; but this is only an existence theorem, and such an embedding may be elusive or computationally intractable.) Also, there are constrained sets that do not admit a regular manifold structure; a simple example is {x R n : x = 1}. The application areas thus overlap, but are not identical. On the problems that can be tackled by both approaches, an interesting feature of manifold optimization schemes is that they are feasible: each iterate belongs to the constrained set. Feasibility is advantageous, or even essential, in several situations (see, e.g., [LT01]). For example, the cost function is sometimes not defined outside the feasible set; or, the value of the cost function may have little if any relevance outside the feasible set; moreover, if the algorithm runs out of time or computing resources, it should be able to terminate and return a feasible point. We assume throughout that it is computationally impractical to determine whether the Hessian of the cost function is positive definite; trust-region subproblems are thus solved using inner iterations, such as the truncated conjugate-gradient method, that improve on the socalled Cauchy point by only using the Hessian of the model through its application to a vector. As a consequence, convergence of the trust-region algorithm to stationary points that are not local minima (i.e., saddle points and local maxima) cannot be ruled out. However, because trust-region methods are descent methods (the value of the cost function never increases), the situation is fundamentally different from the pure Newton case: convergence to saddle points and local minima of the cost function is numerically unstable and is thus not expected to occur in practice; and indeed, convergence to saddle points and local minima is only observed on very specifically crafted numerical experiments. The theory and algorithms can be adapted to exploit the properties of specific manifolds and problems in several disciplines. Numerical linear algebra considers several problems that can be analyzed and solved using this approach. A particularly illustrative and computationally efficient application is the computation of the rightmost or leftmost eigenvalue and associated eigenvector of a symmetric/positive-definite matrix pencil (A, B). In this case, the manifold can be chosen as the projective space and a possible choice for the cost function is the Rayleigh quotient. The resulting trust-region algorithm can be interpreted as an inexact Rayleigh quotient iteration; we refer to [ABG06] for details. This paper makes use of basic notions of Riemannian geometry and numerical optimization; background can be found in [dc92] and [NW99]. The general concept of trust-region methods on Riemannian manifolds is presented in Section 2. Methods for (approximately) 3

solving the trust-region subproblems are considered in Section 3. Convergence properties are investigated in Section 4. The theory is illustrated on practical examples in Section 5. In particular, a block algorithm for the computation of extreme eigenpairs, which evolves on the Grassmann manifold, is derived in detail in Section 5.3. Conclusions are presented in Section 6. A preliminary version of the results presented in this paper appeared in the proceedings of the 16th MTNS conference [ABG04a]. 2 The general algorithm We follow the usual conventions of matrix computations and view R n as the set of column vectors with n real components. The basic trust-region method in R n for a cost function f consists of adding to the current iterate x R n the update vector η R n solving the trust-region subproblem min m(η) = f(x) + f(x)η + 1 η R n 2 ηt 2 f(x)η η (1) where f = ( 1 f,..., n f) is the differential of f, ( 2 f) ij = ij 2 f is the Hessian matrix some convergence results allow for 2 f(x) in (1) to be replaced by any symmetric matrix, but we postpone this relaxation until later in the development and is the trust-region radius. The quality of the model m is assessed by forming the quotient ρ = f(x) f(x + η) m(0) m(η). (2) Depending on the value of ρ, the new iterate will be accepted or discarded and the trustregion radius will be updated. More details will be given later in this paper. Major references on trust-region methods include the early work of Powell [Pow74] and Moré and Sorensen [Sor82, MS84], and the textbook [CGT00]; see also [WD05] and references therein for recent work on trust-region update strategies. With a view towards extending the concept of trust-region subproblem to manifolds, we first consider the case of an abstract Euclidean space, i.e., a vector space endowed with an inner product (that is, a symmetric, bilinear, positive-definite form). This generalization to an Euclidean space E of dimension d requires little effort since E may be identified with R d once a basis of E is chosen (we refer to [Boo75, Section I.2] for a discussion on the distinction between R n and abstract Euclidean spaces). Let g(, ) denote the inner product on E. Given a function f : E R and a current iterate x E, one can choose a basis (e i ) i=1,...,d of E (not necessarily orthonormal with respect to the inner product) and write a classical G-norm trust-region subproblem (see, e.g., [GLRT99, Section 2]) min η R d m( η) := f( x) + f( x) η + 1 2 ηt 2 f( x) η, η T G η 2 x (3) where x = i x ie i, η = i η ie i, f( x) = f( i x ie i ) and G ij = g(e i, e j ). It can be shown that m(η) does not depend on the choice of basis (e i ) i=1,...,d ; therefore (3) can be written as 4

a coordinate-free expression min η E m(η) = f(x) + Df(x)[η] + 1 2 D2 f(x)[η, η] = f(x) + g(gradf(x), η) + 1 2 g(hessf[η], η) s.t. g(η, η) 2 x (4) for the trust-region subproblem in the Euclidean space E. Now let M be a manifold of dimension d. Intuitively, this means that M looks locally like R d. Local correspondences between M and R d are given by coordinate charts φ α : Ω α M R n ; see, e.g., [dc92] for details. How can we define a trust-region method for a cost function f on M? Given a current iterate x, it is tempting to choose a coordinate neighborhood Ω α containing x, translate the problem to R d through the chart φ α, build a quadratic model m, solve the trust-region problem in R d and bring back the solution to M through φ 1 α. The difficulty is that there are in general infinitely many α s such that x Ω α. Each choice will yield a different model function m φ α and a different trust region {y M : φ α (y) }, hence a different next iterate x +. This kind of situation is pervasive in numerics on manifolds; it is usually addressed, assuming that M is a Riemannian manifold, by working in so-called normal coordinates. In order to explain the concept of normal coordinates, we now present a condensed overview of Riemannian geometric concepts; we refer to [dc92, O N83] for details. In what follows, M will be a (C ) Riemannian manifold, i.e., M is endowed with a correspondence, called a Riemannian metric, which associates to each point x of M an inner product g x (, ) on the tangent space T x M and which varies differentiably. The Riemannian metric induces a norm ξ = g x (ξ, ξ) on the tangent spaces T x M. Also associated with a Riemannian manifold are the notions of Levi-Civita (or Riemannian) connection, parallel transport, geodesic (which, intuitively, generalizes the notion of straight line) and associated exponential map defined by Exp x ξ = γ(1) where γ is the geodesic satisfying γ(0) = x and γ (0) = ξ, with γ (0) denoting the tangent vector to the curve γ at t = 0. Given a point x in M, there is a ball B ǫ (0 x ) in T x M of radius ǫ around the origin 0 x of T x M such that Exp x is a diffeomorphism of B ǫ (0 x ) onto an open subset of M. Then Exp x (B ǫ (0 x )) = U is called a normal neighborhood of x, and Exp x defines a diffeomorphism between the Euclidean space T x M and U. The supremum of these ǫ s is the injectivity radius i x (M) at x, and i(m) := inf x M i x is the injectivity radius of M. Finally, normal coordinates are defined in a normal neighborhood U by considering an orthonormal basis {e i } of T x M and taking (u 1,...,u d ) as the coordinates of y = Exp x ( n i=1 u ie i ). An important observation is that, for the purpose of defining a trust-region method, the choice of a basis {e i } in T x M is immaterial, since trust-region subproblems on a Euclidean space like T x M admit a coordinate-free expression (4). Therefore, the exponential mapping makes it possible to uniquely define trust-region subproblems on Riemannian manifolds by locally mapping the manifold to the Euclidean space T x M. However, as pointed out in [Man02], the systematic use of the exponential mapping may not be desirable in all cases: other local mappings to T x M may reduce the computational cost while preserving the useful convergence properties of the considered method. Therefore, in this paper, we relax the exponential to a class of mappings called retractions, a concept that we borrow from [Shu86, ADM + 02] with some modifications (see also the illustration on Figure 1). 5

Definition 2.1 (retraction) A retraction on a manifold M is a mapping R on the tangent bundle TM into M with the following properties. Let R x denote the restriction of R to T x M. 1. R is continuously differentiable. 2. R x (0 x ) = x, where 0 x denotes the zero element of T x M. 3. DR x (0 x ) = id TxM, the identity mapping on T x M, with the canonical identification T 0x T x M T x M. It follows from the inverse function theorem (see [dc92, Ch. 0, Th. 2.10]) that R x is a local diffeomorphism at 0 x, namely, R x is not only C 1 but also bijective with differentiable inverse on a neighborhood V of 0 x in T x M. In particular, the exponential mapping is a retraction (see Proposition 2.9 in [dc92, Ch. 3] and the proof thereof), and any other retraction can be thought of as a first-order approximation of the exponential mapping. Notice that no assumption is made on the second and higher derivatives of the retractions; in particular, D 2 (Exp 1 x R x )(0 x ) need not vanish (see also discussion in Section 2.1). Practical examples of retractions on specific Riemannian manifolds, that may be more tractable computationally than the exponential, are given in Section 5. We point out that the requirements in Definition 2.1 are stronger than needed to obtain the convergence results; in particular, we could allow R to be defined only in a certain subset of TM. However, weaker assumptions would make the forthcoming developments more complicated, and there is no evidence that they would be more relevant in practical applications. For the same reason, we assume throughout that the manifold M is complete, i.e., Expξ exists for all ξ in TM. ξ x T x M R x ξ M Figure 1: Illustration of retractions. We can now lay out the structure of a trust-region method on a Riemannian manifold (M, g) with retraction R. Given a cost function f : M R and a current iterate x k M, we use R xk to locally map the minimization problem for f on M into a minimization problem for the cost function ˆf xk : T xk M R : ξ f(r xk (ξ)). (5) The Riemannian metric g turns T xk M into a Euclidean space endowed with the inner product g xk (, ), and, following (4), the trust-region subproblem on T xk M reads min η T xk M m x k (η) = ˆf xk (0 xk ) + D ˆf xk (0 xk )[η] + 1 2 D2 ˆfxk (0 xk )[η, η] = ˆf xk (0 xk ) + g xk (grad ˆf xk (0 xk ), η) + 1 2 g x k (Hess ˆf xk (0 xk )[η], η) s.t. g xk (η, η) 2 k. (6) For the global convergence theory it is only required that the second-order term in the model be some symmetric form. Therefore, instead of (6), we will consider the following more general formulation min m x k (η) = f(x k ) + g xk (gradf(x k ), η) + 1 η T xk M 2 g x k (H xk η, η) s.t. g xk (η, η) 2 k, (7) 6

where H xk : T xk M T xk M is some symmetric linear operator, i.e., g xk (H xk ξ, χ) = g xk (ξ, H xk χ), ξ, χ T x M. This is called the trust-region subproblem. Next, an (approximate) solution η k of the Euclidean trust-region subproblem (7) is computed using any available method: if an iterative method is used, its iterations are called inner iterations of the overall algorithm (see Section 3). The candidate for the new iterate is then given by x + = R xk (η k ). The decision to accept or not the candidate and to update the trust-region radius is based on the quotient ρ k = f(x k) f(r xk (η k )) m xk (0 xk ) m xk (η k ) = ˆf xk (0 xk ) ˆf xk (η k ) m xk (0 xk ) m xk (η k ). (8) If ρ k is exceedingly small, then the model is very inaccurate: the step must be rejected and the trust-region radius must be reduced. If ρ k is small but less dramatically so, then the step is accepted but the trust-region radius is reduced. If ρ k is close to 1, then there is a good agreement between the model and the function over the step, and the trust-region radius can be expanded. This procedure can be formalized as the following algorithm; it reduces to [NW99, Alg. 4.1] in the classical R n case (see [CGT00, Ch. 10] for variants). Algorithm 1 (RTR basic Riemannian Trust-Region algorithm) Data: Complete Riemannian manifold (M, g); real function f on M; retraction R from TM to M as in Definition 2.1. Parameters: > 0, 0 (0, ), and ρ [0, 1 4 ). Input: initial iterate x 0 M. Output: sequence of iterates {x k }. for k = 0, 1, 2,... Obtain η k by (approximately) solving (7); Evaluate ρ k from (8); if ρ k < 1 4 k+1 = 1 4 k else if ρ k > 3 4 and η k = k k+1 = min(2 k, ) else k+1 = k ; if ρ k > ρ x k+1 = R x (η k ) else x k+1 = x k ; end (for). In the sequel we will sometimes drop the subscript k and denote x k+1 by x +. 2.1 Discussion The concept of retraction (Definition 2.1) is related to the notion of smooth family of parameterizations [HT04]. To any smooth family of parameterizations {µ x } x M, one can associate a retraction defined by R x := µ x (Dµ x (0 x )) 1. Conversely, given smooth vector fields e i, i = 1,...,d such that {e i (x)} i=1,...,d is a basis of T x M for all x in a neighbourhood U of x, to 7

a retraction R one can associate the mappings µ x : (u 1,...,u d ) R x ( u i e i (x)) which form a locally smooth family of parameterizations around x. Notice that in general this technique cannot be applied globally: while a smooth basis {e i (x)} i=1,...,d always exists locally, there are manifolds for which such a basis fails to exist globally. (An example is a sphere embedded in a space of odd dimension: a result informally known as the hairy ball theorem states that there is no nonvanishing smooth vector field on such a sphere, hence there is no globally smooth basis of vector fields.) Moreover, from a computational viewpoint, such a basis of vector fields may be difficult or even impossible to deal with. The use of retractions in the context of trust-region methods is thus particularly convenient, as it takes advantage of the fact that trust-region methods (and optimization methods in general) can be defined on general Euclidean spaces, devoid of a particular basis. We refer to Section 5 for illustrations on concrete examples. From the first-order rigidity condition DR x (0 x ) = id TxM satisfied by retractions (Definition 2.1), it follows that grad ˆf xk (0 xk ) = gradf(x k ), where gradf(x), the gradient of f at x, is defined by g x (gradf(x), ξ) = df x (ξ), ξ T x M (see [dc92, Ch. 3, Ex. 8]). Since no assumption is made on the second and higher-order derivatives of the retraction (such conditions are not necessary in the local and global convergence analyses carried out in Section 4), it follows that in general Hess ˆf(0 x ) Hess f(x), where Hess f(x) : T x M T x M : ξ Hess f(x)[ξ] := ξ gradf(x), (9) is the Hessian operator of f at x, as defined in [dc92, Ch. 6, Ex. 11]. In Lemma 4.11, secondorder conditions are given on the retraction such that Hess ˆf xk (0 x ) = Hess f(x). In the case of embedded submanifolds, the Levi-Civita connection reduces to a directional derivative in the embedding space followed by a projection onto the tangent space to the manifold [O N83, 4.3]; this facilitates the derivation of a formula for Hess f(x)[ξ]. (Notice that in the literature, the word Hessian is sometimes used for the second covariant differential D 2 f, defined by D 2 f(x) : T x M T x M R : (ξ, χ) ξχf ( ξ χ)f, which is related to the Hessian operator (9) by the identity D 2 f(ξ, χ) = g x (Hess f(x)[ξ], χ) [O N83].) In general, there is no assumption on the operator H xk in (7) other than being a symmetric linear operator. Consequently, even though m xk was initially presented as a model of f R xk, the choice of the retraction R xk does not impose any constraint on m xk. In order to achieve superlinear convergence, however, H xk will be required to be an approximate Hessian (Theorem 4.13). Obtaining an appropriate approximate Hessian in practice is addressed in Section 5.1. A possible way of choosing H xk is to define m x as the quadratic model of f R xk, where R x is a retraction, not necessarily equal to R xk ; a similar point of view was adopted in [HT04] in the framework of Newton s method. We conclude this section by pointing out more explicitly the link between Algorithm 1 and the Riemannian Newton method. Assume that H xk in (7) is the exact Hessian of f at x k, and assume that the exact solution η of the trust-region subproblem (7) lies in the interior of the trust region. Then η satisfies gradf + η gradf = 0, which is the Riemannian Newton equation of Smith [Smi93, Smi94] and Udrişte [Udr94, Ch. 7, 5]. Note that both authors propose to apply the update vector η using the Riemannian 8

exponential retraction; namely, the new iterate is defined as x + = Exp x η. As shown by Smith [Smi93, Smi94], the Riemannian Newton algorithm converges locally quadratically to the nondegenerate stationary points of f. A cubic rate of convergence is even observed in frequently encountered cases where some symmetry condition holds [Smi93, Smi94, EAS98, AMS04]. We will see in Section 4 that the superlinear convergence property of Newton s method is preserved by the trust-region modification, while the global convergence properties are improved: the accumulation points are guaranteed to be stationary points regardless of the initial conditions, and among the stationary points only the local minima can be local attractors. 3 Computing a trust-region step We have seen in Section 2 that the use of retractions yields trust-region subproblems expressed in Euclidean spaces T x M. Therefore, all the classical methods for solving the trust-region subproblem can be applied. As mentioned in the introduction, it is assumed here that for some reason, usually related to the large size of the problem under consideration or to the computational efficiency required to outperform alternative methods, it is impractical to check positive-definiteness of H xk ; rather, H xk is only available via its application to a vector. The truncated conjugategradient method of Steihaug [Ste83] and Toint [Toi81] is particularly appropriate in these circumstances. The following algorithm is a straightforward adaptation of the method of [Ste83] to the trust-region subproblem (7). This algorithm is an inner iteration as it is an iteration used within the RTR framework (Algorithm 1) to compute an approximate solution of the trust-region subproblems. Note that we use indices in superscript to denote the evolution of η within the inner iteration, while subscripts are used in the outer iteration. Algorithm 2 (tcg truncated CG for the trust-region subproblem) Set η 0 = 0, r 0 = gradf(x k ), δ 0 = r 0 ; for j = 0, 1, 2,... until a stopping criterion is satisfied, perform the iteration: if g xk (δ j, H xk δ j ) 0 Compute τ such that η = η j + τδ j minimizes m xk (η) in (7) and satisfies η gx = ; return η; Set α j = g xk (r j, r j )/g xk (δ j, H xk δ j ); Set η j+1 = η j + α j δ j ; if η j+1 gx Compute τ 0 such that η = η j + τδ j satisfies η gx = ; return η; Set r j+1 = r j + α j H xk δ j ; Set β j+1 = g xk (r j+1, r j+1 )/g xk (r j, r j ); Set δ j+1 = r j+1 + β j+1 δ j ; end (for). The simplest stopping criterion for Algorithm 2 is to truncate after a fixed number of iterations. In order to improve the convergence rate, a possibility is to stop as soon as an iteration j is reached for which r j r 0 min( r 0 θ, κ). (10) 9

Concerning the computation of τ, it can be shown that when g(δ j, H xk δ j ) 0, arg min τ R m xk (η j + τδ j ) is equal to the positive root of η j + τδ k gx =, which is explicitly given by g x (η j, δ j ) + g x (η j, δ j ) 2 ( 2 g x (η j, η j ))g x (δ j, δ j ). g x (δ j, δ j ) Notice that the tcg algorithm is inverse-free, as it uses H xk in the computation of H xk δ j only. The reader interested in the underlying principles of the Steihaug-Toint truncated CG method should refer to [Ste83], [NW99] or [CGT00]. Alternatives to tcg for approximately solving trust-region subproblems are mentioned in [CGT00, Section 7.5.4]; see also [Hag01, HP05]. 4 Convergence analysis In this section, we first study the global convergence properties of the RTR scheme (Algorithm 1), without any assumption on the way the trust-region subproblems (7) are solved, except that the approximate solution η k must produce a decrease of the model that is at least a fixed fraction of the so-called Cauchy decrease. Under mild additional assumptions on the retraction and the cost function, it is shown that the sequences {x k } produced by Algorithm 1 converge to the set of stationary points of the cost function. This result is well known in the R n case; in the case of manifolds, the convergence analysis has to address the fact that a different lifted cost function ˆf xk is considered at each iterate x k. We then analyze the local convergence of Algorithm 1-2 around nondegenerate local minima. Algorithm 1-2 refers to the RTR framework where the trust-region subproblems are approximately solved using the tcg algorithm with stopping criterion (10). It is shown that the iterates of the algorithm converge to nondegenerate stationary points with an order of convergence min{θ + 1, 2} (at least). 4.1 Global convergence The objective of this section is to show that, under appropriate assumptions, the sequence {x k } generated by Algorithm 1 satisfies lim k gradf(x k ) = 0; this generalizes a classical convergence property of trust-region methods in R n, see [NW99, Theorem 4.8]. In what follows, (M, g) is a complete Riemannian manifold of dimension d, and R is a retraction on M (Definition 2.1). We define ˆf : TM R : ξ f(r(ξ)) (11) and, in accordance with (5), ˆf x denotes the restriction of ˆf to T x M. We denote by B δ (0 x ) = {ξ T x M : ξ < δ} the open ball in T x M of radius δ centered at 0 x, and B δ (x) stands for the set {y M : dist(x, y) < δ} where dist denotes the Riemannian distance (i.e., the distance defined in terms of the Riemannian metric; see, e.g., [O N83, 5.15]). We denote by P t t 0 γ v the vector of T γ(t) M obtained by parallel transporting the vector v T γ(t0 )M along a curve γ. As in the classical R n case, we first show that at least one accumulation point of {x k } is stationary. The convergence result requires that m xk (η k ) be a sufficiently good approximation of ˆf xk (η k ). In [CGT00, Thm 6.4.5] this is guaranteed by the assumption that the Hessian of the cost function is bounded. It is however possible to weaken this assumption, which leads us to consider the following definition. 10

Definition 4.1 (radially L-C 1 function) Let ˆf : TM R be as in (11). We say that ˆf is radially Lipschitz continuously differentiable if there exist reals β RL > 0 and δ RL > 0 such that, for all x M, for all ξ T x M with ξ = 1, and for all t < δ RL, it holds d dτ ˆf x (τξ) τ=t d dτ ˆf x (τξ) τ=0 β RLt. (12) For the purpose of Algorithm 1, which is a descent algorithm, this condition needs only to be imposed for all x in the level set {x M : f(x) f(x 0 )}. (13) A key assumption in the classical global convergence result in R n is that the approximate solution η k of the trust-region subproblem (7) produces at least as much decrease in the model function as a fixed fraction of the Cauchy decrease; see [NW99, Section 4.3]. Since the trust-region subproblem (7) is expressed on a Euclidean space, the definition of the Cauchy point is adapted from R n without difficulty, and the bound ( m xk (0) m xk (η k ) c 1 gradf(x k ) min k, gradf(x ) k), (14) H xk for some constant c 1 > 0, is readily obtained from the R n case, where H xk is defined as H xk := sup{ H xk ζ : ζ T xk M, ζ = 1}. (15) In particular, the truncated CG method (Algorithm 2) satisfies this bound (with c 1 = 1 2, see [NW99, Lemma 4.5]) since it first computes the Cauchy point and then attempts to improve the model decrease. With these preliminaries in place, we can state and prove the first global convergence result. Note that this theorem is presented under weak assumptions; stronger but arguably easier to check assumptions are given in Proposition 4.5. Theorem 4.2 Let {x k } be a sequence of iterates generated by Algorithm 1 with ρ [0, 1 4 ). Suppose that f is C 1 and bounded below on the level set (13), that ˆf is radially L-C 1 (Definition 4.1), and that H xk β for some constant β. Further suppose that all approximate solutions η k of (7) satisfy the Cauchy decrease inequality (14) for some positive constant c 1. We then have lim inf k gradf(x k) = 0. Proof. First, we perform some manipulation of ρ k from (8). Notice that (f(x k ) ρ k 1 = ˆf xk (η k )) (m xk (0) m xk (η k )) m xk (0) m xk (η k ) m xk (η k ) = ˆf xk (η k ) m xk (0) m xk (η k ). (16) 11

Direct manipulations on the function t ˆf xk (t η k η k ) yield ˆf xk (η k ) = ˆf xk (0 xk ) + η k d ˆf dτ xk (τ η k + ηk 0 η k ) τ=0 ( d ˆf dτ xk (τ η k η k ) τ=t d ˆf ) dτ xk (τ η k η k ) τ=0 = f(x k ) + g xk (gradf(x k ), η k ) + ǫ where ǫ < η k 0 β RL t dt = 1 2 β RL η k 2 whenever η k < δ RL, and β RL and δ RL are the constants in the radially L-C 1 property (12). Therefore, it follows from the definition (7) of m xk that m xk (η k ) ˆf xk (η k ) = 1 2 g x k (H xk η k, η k ) ǫ 1 2 β η k 2 + 1 2 β RL η k 2 β η k 2 (17) whenever η k < δ RL, where β = max(β, β RL ). Assume for purpose of contradiction that the theorem does not hold; that is, assume there exist ǫ > 0 and a positive index K such that From (14), for k K, we have m xk (0) m xk (η k ) c 1 gradf(x k ) min Substituting (17), and (19) into (16), we have that gradf(x k ) ǫ, for all k K. (18) ( k, gradf(x ) k) c 1 ǫ min ( k, ǫβ ) H xk. (19) dt ρ k 1 β η k 2 ) c 1 ǫ min ( k, ǫ β β 2 k ) (20) c 1 ǫ min ( k, ǫ β whenever η k < δ RL. We can choose a value of ˆ that allows us to bound the right-hand-side of the inequality (20), when k ˆ. Choose ˆ as follows: ( c1 ǫ ˆ min ǫ ) 2β, β,δ RL. ) This gives us min ( k, ǫ β = k. We can now write (20) as follows: ρ k 1 β ˆ k ) c 1 ǫ min ( k, ǫ β k ) = 2 min ( 1 k, ǫ 2. β Therefore, ρ k 1 2 > 1 4 whenever k ˆ, so that by the workings of Algorithm 1, it follows (from the argument above) that k+1 k whenever k ˆ. It follows that a reduction of k (by a factor of 1 4 ) can occur in Algorithm 1 only when k > ˆ. Therefore, we conclude that ( k min K, ˆ /4 ), for all k K. (21) 12

Suppose now that there is an infinite subsequence K such that ρ k 1 4 > ρ for k K. If k K and k K, we have from (19) that f(x k ) f(x k+1 ) = f xk ˆf xk (η k ) 1 4 (m x k (0) m xk (η k )) 1 4 c 1ǫ min ( k, ǫβ ). Since f is bounded below on the level set containing these iterates, it follows from this inequality that lim k = 0, k K,k clearly contradicting (21). Then such an infinite subsequence as K cannot exist. If follows that we must have ρ k < 1 4 for all k sufficiently large, so that k will be reduced by a factor of 1 4 on every iteration. Then we have, lim k k = 0, which again contradicts (21). Then our original assumption (18) must be false, giving us the desired result. To further show that all accumulation points of {x k } are stationary points, we need to make an additional regularity assumption on the cost function f. The global convergence result in R n, as stated in [NW99, Theorem 4.8], requires that f be Lipschitz continuously differentiable. That is to say, for any x, y R n, gradf(y) gradf(x) β 1 y x. (22) A key to obtaining a Riemannian counterpart of this global convergence result is to adapt the notion of Lipschitz continuous differentiability to the Riemannian manifold (M, g). The expression x y in the right-hand side of (22) naturally becomes the Riemannian distance dist(x, y). For the left-hand side of (22), observe that the operation gradf(x) gradf(y) is not well-defined in general on a Riemannian manifold since gradf(x) and gradf(y) belong to two different tangent spaces, namely T x M and T y M. However, if y belongs to a normal neighborhood of x, then there is a unique geodesic α(t) = Exp x (texp 1 x y) such that α(0) = x and α(1) = y, and we can parallel transport gradf(y) along α to obtain the vector Pα 0 1 gradf(y) in T x M, to yield the following definition. Definition 4.3 (Lipschitz continuous differentiability) Assume that (M, g) has an injectivity radius i(m) > 0. A real function f on M is Lipschitz continuous differentiable if it is differentiable and if, for all x, y in M such that dist(x, y) < i(m), it holds that P 0 1 α gradf(y) gradf(x) β 1 dist(y, x), (23) where α is the unique geodesic with α(0) = x and α(1) = y. Note that (23) is symmetric in x and y; indeed, since the parallel transport is an isometry, it follows that P 0 1 α gradf(y) gradf(x) = gradf(y) Pα 1 0 gradf(x). Moreover, we place one additional requirement on the retraction R, that there exists some µ > 0 and δ µ > 0 such that ξ µd(x, R x (ξ)), for all x M, for all ξ T x M, ξ δ µ (24) 13

Note that for the exponential retraction discussed in this paper, (24) is satisfied as an equality, with µ = 1. The bound is also satisfied when R is smooth and M is compact (Corollary 4.6). We are now ready to show that under some additional assumptions, the gradient of the cost function converges to zero on the whole sequence of iterates. Here again we refer to Proposition 4.5 for a simpler (but slightly stronger) set of assumptions that yield the same result. Theorem 4.4 Let {x k } be a sequence of iterates generated by Algorithm 1. Suppose that all the assumptions of Theorem 4.2 are satisfied. Further suppose that ρ (0, 1 4 ), that f is Lipschitz continuously differentiable (Definition 4.3), and that (24) is satisfied for some µ > 0, δ µ > 0. It then follows that lim gradf(x k) = 0. k Proof. Consider any index m such that gradf(x m ) 0. The Lipschitz property (23) yields P 1 0 α gradf(x) gradf(x m ) β 1 dist(x, x m ) for all x. Define scalars ǫ = 1 2 gradf(x m), ( ) ( ) gradf(xm ) ǫ r = min, i(m) = min, i(m) 2β 1 β 1 Define the ball B r (x m ) := {x : dist(x, x m ) < r}. Then for any x B r (x m ), we have gradf(x) = Pα 0 1 gradf(x) = Pα 0 1 gradf(x) + gradf(x m ) gradf(x m ) gradf(x m ) Pα 0 1 2ǫ β 1 dist(x, x m ) ( gradf(xm ) > 2ǫ β 1 min 2β 1 gradf(x) gradf(x m ) ), i(m) = ǫ. 2ǫ 1 2 gradf(x m) If the entire sequence {x k } k m stays inside of the ball B r (x m ), then we would have gradf(x k ) > ǫ for all k m, which contradicts the results of Theorem 4.2. Then the sequence eventually leaves the ball B r (x m ). Let the index l m be such that x l+1 is the first iterate after x m outside of B r (x m ). Since 14

gradf(x k ) > ǫ for k = m, m + 1,...,l, we have f(x m ) f(x l+1 ) = l f(x k ) f(x k+1 ) k=m l k=m,x k x k+1 ρ (m xk (0) m xk (η k )) l k=m,x k x k+1 ρ c 1 gradf(x k ) min l k=m,x k x k+1 ρ c 1 ǫ min ( k, ǫ ). β ( k, gradf(x ) k) B k We distinguish two cases. If k > ǫ/β in at least one of the terms of the sum, then In the other case, we have f(x m ) f(x l+1 ) ρ c 1 ǫ f(x m ) f(x l+1 ) ρ c 1 ǫ ǫ β. (25) l k=m,x k x k+1 k ρ c 1 ǫ If η k > δ µ in at least one term in the sum, then Otherwise, (26) yields f(x m ) f(x l+1 ) It follows from (25), (27) and (28) that l k=m,x k x k+1 η k. (26) f(x m ) f(x l+1 ) ρ c 1 ǫδ µ. (27) ρ c 1 ǫ = ρ c 1 ǫµ f(x m ) f(x l+1 ) ρ c 1 ǫ min l k=m,x k x k+1 µd(x k, R xk (η k )) l k=m,x k x k+1 d(x k, x k+1 ) ρ c 1 ǫµr = ρ c 1 ǫµmin ( ) ǫ, i(m). (28) β 1 ( ǫ β, δ µ, ǫµ ), i(m)µ. (29) β 1 Because {f(x k )} k=0 is decreasing and bounded below, we have f(x k ) f, (30) 15

for some f >. It then follows from (29) that f(x m ) f f(x m ) f(x l+1 ) ( ǫ ρ c 1 ǫ min β, δ µ, ǫµ ), i(m)µ β 1 = 1 ( gradf(xm ) 2 ρ c 1 gradf(x m ) min 2β, δ µ, gradf(x ) m) µ, i(m)µ. 2β 1 Assume for the purpose of contradiction that it is not the case that lim m gradf(x m ) = 0. Then there exists ω > 0 and an infinite sequence K such that gradf(x k ) > ω, for all k K. Then for k K, k m, we have f(x k ) f 1 ( gradf(xk ) 2 ρ c 1 gradf(x k ) min, gradf(x ) k) µ, i(m)µ 2β 2β 1 > 1 ( ω 2 ρ c 1 ω min 2β, ωµ ), i(m)µ 2β 1 which is a positive constant. This contradicts lim k (f(x k ) f ) = 0, so that our hypothetical assumption must be false, and lim gradf(x m) = 0. m Note that this theorem reduces gracefully to the classical R n case, taking M = R n endowed with the classical inner product and R x (ξ) := x + ξ. Then i(m) = + > 0, R satisfies (24), and the Lipschitz condition (23) reduces to the classical expression, which subsumes the radially L-C 1 condition. The following proposition shows that the regularity conditions on f and ˆf required in the previous theorems are satisfied under stronger but possibly easier to check conditions. These conditions impose a bound on the Hessian of f and on the acceleration along curves t R(tξ). Note also that all these conditions need only be checked on the level set {x M : f(x) f(x 0 )}. Proposition 4.5 Suppose that gradf(x) β g and Hess f(x) β H for some constants β g, β H, and all x M. Moreover suppose that D d dt dt R(tξ) β D (31) for some constant β D, for all ξ TM with ξ = 1 and all t < δ D, where D dt denotes the covariant derivative along the curve t R(tξ) (see [dc92, Ch. 2, Prop. 2.2]). Then the Lipschitz-C 1 condition on f (Definition 4.3) is satisfied with β L = β H ; the radially Lipschitz-C 1 condition on ˆf (Definition 4.1) is satisfied for δ RL < δ D and β RL = β H (1 + β D δ D ) + β g β D ; and the condition (24) on R is satisfied for values of µ and δ µ satisfying δ µ < δ D and 1 2 β Dδ µ < 1 µ 1. 16

Proof. By a standard Taylor argument (see Lemma 4.7), boundedness of the Hessian of f implies the Lipschitz-C 1 property of f. For (24), define u(t) = R(tξ) and observe that dist(x, R(tξ)) t 0 u (τ) dτ where t 0 u (τ) dτ is the length of the curve u between 0 and t. Using the Cauchy-Schwarz inequality and the invariance of the metric by the connection, we have d dτ u (τ) = d dτ for all t < δ D. Therefore t 0 g u(τ) (u (τ), u g u(τ) ( D dt (τ)) = u (τ), u (τ)) u (τ) β D u (τ) u β D (τ) u (τ) dτ t which is smaller than t µ if 1 2 β Dt < 1 µ 1. 0 u (0) + β D τ dτ = ξ t + 1 2 β Dt 2 = t + 1 2 β Dt 2, For the radially Lipschitz-C 1 condition, let u(t) = R(tξ) and h(t) = f(u(t)) = ˆf(tξ) with ξ T x M, ξ = 1. Then h (t) = g u(t) (gradf(u(t)), u (t)) and h (t) = D dt g u(t)(gradf(u(t)), u (t)) = g u(t) ( D dt gradf(u(t)), u (t)) + g u(t) (gradf(u(t)), D dt u (t)). Now, D dt gradf(u(t)) = u (t)gradf(u(t)) = Hess f(u(t))[u (t)]. It follows that h (t) is bounded on t [0, δ D ) by the constant β RL = β H (1 + β D δ D ) + β g β D. Then h (t) h (0) t 0 h (τ) dτ tβ RL. In many practical cases, the cost function and the retraction are smooth and the Riemannian manifold is compact. This is a comfortable situation, as the next result shows. Corollary 4.6 (smoothness and compactness) If the cost function f and the retraction R are smooth and the Riemannian manifold M is compact, then all the conditions in Proposition 4.5 are satisfied. Proof. This result comes from the fact that every continuous real function on a compact space is bounded. The only nontrivial part is to show that the set {ξ TM : ξ = 1} is compact. Compactness is a topological notion; recall that the topology of a manifold is the topology induced by the topological basis formed by the collection of coordinate domains. First we need to prove that every compact Riemannian manifold has a finite covering by compact subsets of coordinate domains. This can be done via a result on paracompactness found in [Mun00, Lem. 41.3]. Here we give a direct argument. Let M be a compact Riemannian manifold. Since M is Riemannian, it is metrizable [O N83, 5.18], thus it is a regular topological space. 17

Let {A i } i=1,...,n be an (open) covering of M by coordinate domains. Let C 1 = M n i=2 A i. It is a closed thus compact set contained in A 1. Since M is a regular space, it follows that for all x in C 1, there is an open set V x containing x such that V x A 1. The collection {V x } x M is an open covering of the compact set C 1, thus it admits a finite subcovering {V j } j=1,...,m. Let B 1 = m j=1 V j. Then B 1 is open, and B 1 A 1, and {B 1, A 2,...,A n } is still an open covering of M. The same operation can now be performed on A 2, to obtain an open covering {B 1, B 2,...,A n }. Finally, we have an open covering {B i } i=1,...,n such that B i A i, i = 1,...,n. The collection {B i } i=1,...,n is a covering of M by compacts such that B i A i, i = 1,...,n, as reqired. Now let S := {ξ TM : ξ = 1}. We want to show that S is compact. Using the construction above, let S i := {ξ TM : ξ = 1, πξ B i }, where πξ denotes the foot of ξ, and notice that S = n i=1 S i. Since each S i is included in a coordinate domain, it is sufficient to show that the coordinate expression of each S i is compact. Abusing notation, S i has the coordinate expression {(x, ξ) : x B i, ξ i g ij (x)ξ i = 1}. It is closed as the inverse image of a closed set by a continuous function, and it is bounded by continuity and nondegeneracy of the metric, hence it is compact, which completes the proof. 4.2 Local convergence We now state local convergence properties of Algorithm 1-2 (i.e., Algorithm 1 where the trust-region subproblem (7) is solved approximately with Algorithm 2). We first state a few preparation lemmas. As before, (M, g) is a complete Riemannian manifold of dimension d, and R is a retraction on M (Definition 2.1). The first lemma is a first-order Taylor formula for tangent vector fields. (Similar Taylor developments on manifolds can be found in [Smi94].) Lemma 4.7 (Taylor) Let x M, let V be a normal neighborhood of x, and let ζ be a C 1 tangent vector field on M. Then, for all y V, P 0 1 γ ζ y = ζ x + ξ ζ + 1 0 ( P 0 τ γ γ (τ)ζ ξ ζ ) dτ, (32) where γ is the unique minimizing geodesic satisfying γ(0) = x and γ(1) = y, and ξ = Exp 1 x y = γ (0). Proof. Start from 1 Pγ 0 1 d ζ y = ζ x + 0 dτ P γ 0 τ ζ dτ = ζ x + ξ ζ + 1 0 ( ) d dτ P γ 0 τ ζ ξ ζ dτ and use the formula for the connection in terms of the parallel transport, see [dc92, Ch. 2, Ex. 2], to obtain d dτ P γ 0 τ ζ = d dǫ P γ 0 τ Pγ τ τ+ǫ ζ = Pγ 0 τ γ ζ. ǫ=0 We use this lemma to show that in some neighborhood of a nondegenerate local minimizer v of f, the norm of the gradient of f can be taken as a measure of the Riemannian distance to v. 18

Lemma 4.8 Let v M and let f be a C 2 cost function such that gradf(v) = 0 and Hess f(v) is positive definite with maximal and minimal eigenvalues λ max and λ min. Then, given c 0 < λ min and c 1 > λ max, there exists a neighborhood V of v such that, for all x V, it holds that Proof. From Taylor (Lemma 4.7), it follows that P 0 1 γ gradf(v) = Hess f(v)[γ (0)] + c 0 dist(v, x) gradf(x) c 1 dist(v, x). (33) 1 0 ( P 0 τ γ Hess f(γ(τ))[γ (τ)] Hess f(v)[γ (0)] ) dτ. (34) Since f is C 2 and since γ (τ) = dist(v, x) for all τ [0, 1], we have the following bound for the integral in (34): 1 0 Pγ 0 τ Hess f(γ(τ))[γ (τ)] Hess f(v)[γ (0)] dτ = 1 0 ( P 0 τ γ Hess f(γ(τ)) P τ 0 γ Hess f(v) ) [γ (0)] dτ ǫ(dist(v, x))dist(v, x) where lim t 0 ǫ(t) = 0. Since Hess f(v) is nonsingular, it follows that λ min > 0. Take V sufficiently small so that λ min ǫ(dist(v, x)) > c 0 and λ max + ǫ(dist(v, x)) < c 1 for all x in V. Then, using the fact that the parallel translation is an isometry, (33) follows from (34). We need a relation between the gradient of f at R x (ξ) and the gradient of ˆf x at ξ. Lemma 4.9 Let R be a retraction on M and let f be a C 1 cost function on M. Then, given v M and c 5 > 1, there exists a neighborhood V of v and δ > 0 such that gradf(r(ξ)) c 5 grad ˆf(ξ) for all x V and all ξ T x M with ξ δ, where ˆf is as in (11). Proof. Consider a parameterization of M at v, and consider the corresponding parameterization of T M (see [dc92, Ch. 0, Example 4.1]). Using Einstein s convention (see, e.g., [Sak96]), and denoting i f by f,i, we have ˆf x,i (ξ) = f,j (R(ξ))A j i (ξ), where A(ξ) stands for the differential of R x at ξ T x M. Then, and grad ˆf x (ξ) 2 = ˆf x,i (ξ)g ij (x) ˆf x,j (ξ) = f,k (R x (ξ))a k i (ξ)g ij (x)a l j(ξ)f,l (R x (ξ)) gradf(r x (ξ)) 2 = f,j (R x (ξ))g ij (R x (ξ))f,j (R x (ξ)). The conclusion follows by a real analysis argument, invoking the smoothness properties of R and g, compactness of the set {(x, ξ) : x V, ξ T x M, ξ δ}, and using A(0 x ) = id. Finally, we need the following result concerning the Hessian at stationary points. Lemma 4.10 Let R be a C 2 retraction, let f be a C 2 cost function, and let v be a stationary point of f (i.e., gradf(v) = 0). Then Hess ˆf v (0 v ) = Hess f(v). 19

Proof. Let A denote the differential of R v at 0 v. Working in a parameterization of M around v and using Einstein s convention, one obtains (see [Sak96] for the notation) and (Hess ˆf ) ( ) v ij = gik j k ˆfv = g ik j k (f R v ) = g ik j l fa l k (Hess f) i j = gik k j f = g ik k j f g ik Γ l kj lf = g ik l f j A l k + gik l j fa l k where Γ l kj are the Christoffel symbols. At v, one has lf = 0 and A l k is the identity, hence Hess ˆf v (0 v ) = Hess f(v). Away from the stationary points, the Hessians Hess f(x) and Hess ˆf x (0 x ) do not coincide. They do coincide if a zero acceleration condition (35) is imposed on the retraction. This result will not be used in the convergence analysis but it can be useful in applications, as explained after (9). Lemma 4.11 Suppose that D dt ( ) d dt R(tξ) t=0 = 0, for all ξ TM, (35) where D dt denotes the covariant derivative along the curve t R(tξ) (see [dc92, Ch. 2, Prop. 2.2]). Then Hess f(x) = Hess ˆf(0 x ). Proof. Observe that D 2 f(x)[ξ, ξ] = d2 f(exp dt 2 x tξ) t=0 and D 2 ˆf(0x )[ξ, ξ] = d2 f(r dt 2 x (tξ)) t=0 = d dt (df d dt R x(tξ)) t=0 = ξ dfξ + df D dt ( d dt R x(tξ)) t=0. The result follows from the definitions of the Hessians and the one-to-one correspondance between symmetric bilinear forms and quadratic forms. We now state and prove the local convergence results. The first result states that the nondegenerate local minima are attractors of Algorithm 1-2. The principle of the argument is closely related to the Capture Theorem, see [Ber95, Theorem 1.2.5]. Theorem 4.12 (local convergence to local minima) Consider Algorithm 1-2 i.e., the Riemannian trust-region algorithm where the trust-region subproblems (7) are solved using the truncated CG algorithm with stopping criterion (10) with all the assumptions of Theorem 4.2. Let v be a nondegenerate local minimizer of f, i.e., grad f(v) = 0 and Hess f(v) is positive definite. Assume that x Hx 1 is bounded on a neighbourhood of v and that (24) holds for some µ > 0 and δ µ > 0. Then there exists a neighborhood V of v such that, for all x 0 V, the sequence {x k } generated by Algorithm 1-2 converges to v. Proof. Take δ 1 > 0 with δ 1 < δ µ such that B δ1 (v) is a neighborhood of v, which contains only v as stationary point, and such that f(x) > f(v) for all x B δ1 (v). Take δ 2 small enough that for all x B δ2 (v), it holds that η (x) µ(δ 1 δ 2 ), where η is the (unique) solution of Hη = gradf(x); such a δ 2 exists because of Lemma 4.8 and the bound on Hx 1. Consider a level set L of f such that V := L B δ1 (v) is a subset of B δ2 (v); invoke that f C 1 to show that such a level set exists. Then, V is a neighborhood of v and for all x V, we have dist(x, x + ) 1 µ ηtcg (x, ) 1 µ η (δ 1 δ 2 ), where we used the fact that η is increasing along the truncated CG process [Ste83, Thm 2.1]. It follows from the equation above that x + is in B δ1 (v). Moreover, since f(x + ) f(x), it 20