A Riemannian symmetric rank-one trust-region method

Size: px

Start display at page:

Download "A Riemannian symmetric rank-one trust-region method"

Amy Francis
5 years ago
Views:

1 Tech. report UCL-INMA v2 A Riemannian symmetric ran-one trust-region method Wen Huang P.-A. Absil K. A. Gallivan January 15, 2014 Abstract The well-nown symmetric ran-one trust-region method where the Hessian approximation is generated by the symmetric ran-one update is generalized to the problem of minimizing a real-valued function over a d-dimensional Riemannian manifold. The generalization relies on basic differential-geometric concepts, such as tangent spaces, Riemannian metrics, and the Riemannian gradient, as well as on the more recent notions of (first-order) retraction and vector transport. The new method, called RTR-SR1, is shown to converge globally and d + 1-step q- superlinearly to stationary points of the objective function. A limited-memory version, referred to as LRTR-SR1, is also introduced. In this context, novel efficient strategies are presented to construct a vector transport on a submanifold of a Euclidean space. Numerical experiments Rayleigh quotient minimization on the sphere and a joint diagonalization problem on the Stiefel manifold illustrate the value of the new methods. Key words: Riemannian optimization; optimization on manifolds; symmetric ran-one update; Rayleigh quotient; joint diagonalization; Stiefel manifold 2010 Mathematics Subject Classification: 65K05, 90C48, 90C53 1 Introduction We consider the problem min f(x) (1) x M of minimizing a smooth real-valued function f defined on a Riemannian manifold M. Recently investigated application areas include image segmentation [RW12] and recognition [TVSC11], electrostatics and electronic structure calculation [WY12], finance and chemistry [Bor12], multilinear algebra [SL10, IAVD11], low-ran learning [MMBS11, BA11], and blind source separation [KS12, SAGQ12]. This paper presents research results of the Belgian Networ DYSCO (Dynamical Systems, Control, and Optimization), funded by the Interuniversity Attraction Poles Programme initiated by the Belgian Science Policy Office. This wor was financially supported by the Belgian FRFC (Fonds de la Recherche Fondamentale Collective). This wor was performed in part while the third author was a Visiting Professor at the Institut de mathématiques pures et appliquées (MAPA) at Université catholique de Louvain. Department of Mathematics, 208 Love Building, 1017 Academic Way, Florida State University, Tallahassee FL , USA Department of Mathematical Engineering, ICTEAM Institute, Université catholique de Louvain, B-1348 Louvainla-Neuve, Belgium Corresponding author. absil@inma.ucl.ac.be. Phone: Fax:

2 The wealth of applications has stimulated the development of general-purpose methods for(1) see, e.g., [AMS08, RW12, SI13] and references therein including the trust-region approach upon which we focus in this wor. A well-nown technique in optimization [CGT00], the trust-region method was extended to Riemannian manifolds in [ABG07] (or see [AMS08, Ch. 7]), and found applications, e.g., in [JBAS10, VV10, IAVD11, MMBS11, BA11]. Trust-region methods construct a quadratic model m of the objective function f around the current iterate x and produce a candidate new iterate by (approximately) minimizing the model m within a region where it is trusted. Depending on the discrepancy between f and m at the candidate new iterate, the size of the trust region is updated and the candidate new iterate is accepted or rejected. For lac of efficient techniques to produce a second-order term in m that is inexact but nevertheless guarantees superlinear convergence, the Riemannian trust-region (RTR) framewor loses some of its appeal when the exact second-order term the Hessian of f is not available. This is in contrast with the Euclidean case, where several strategies exist to build an inexact second-order term that preserves superlinear convergence of the trust-region method. Among these strategies, the symmetric ran-one (SR1) update is favored in view of its simplicity and because it preserves symmetry without unnecessarily enforcing positive definiteness; see, e.g., [NW06, 6.2] for a more detailed discussion. The n+1 step q-superlinear rate of convergence of the SR1 trust-region method was shown by Byrd et al. [BKS96] using a sophisticated analysis that builds on [CGT91, KBS93]. The classical (Euclidean) SR1 trust-region method can also be viewed as a quasi-newton method, enhanced with a trust-region globalization strategy. The idea of quasi-newton methods on manifolds is not new [Gab82, 4.5], however, most of the literature of which we are aware restricts consideration to generalizing the Broyden Fletcher Goldfarb Shanno (BFGS) quasi-newton method combined with a line search strategy. Early wor such as [BM06, SL10] used Riemannian BFGS methods for a specific application without an analysis of convergence properties, but more recently, systematic analyses of Riemannian BFGS methods based on the framewor of retraction and vector transport developed in [ADM02, AMS08] have been made. Qi [Qi11] analyzed a version of Riemannian BFGS methods with retraction and vector transport restricted to exponential mapping and parallel translation and showed superlinear convergence using a Riemannian Dennis-Moré condition. Ring and Wirth [RW12] proposed and analyzed a Riemannian BFGS that avoids the restrictions on retraction and vector transport assumed by Qi but that needs to resort to the derivative of the retraction. Seibert et al. [SKH13] discussed the freedom available when generalizing BFGS to Riemannian manifolds and analyzed one generalization of BFGS method on Riemannian manifolds that are isometric to R n. Most recently, Huang [Hua13] developed a complete convergence theory that avoids the restrictions of Qi, Ring and Wirth, guarantees superlinear convergence for the Riemannian Broyden family of quasi-newton methods (including a version of SR1), and facilitates efficient implementation. In this paper, motivated by the situation described above, we introduce a generalization of the classical (i.e., Euclidean) SR1 trust-region method to the Riemannian setting (1). Besides maing use of basic Riemannian geometric concepts (tangent space, Riemannian metric, gradient), the new method, called RTR-SR1, relies on the notions of retraction and vector transport introduced in [ADM02, AMS08]. A detailed global and local convergence analysis is given. A limited-memory version of RTR-SR1, referred to as LRTR-SR1, is also introduced. Numerical experiments show that the RTR-SR1 method displays the expected convergence properties. When the Hessian of f is not available, RTR-SR1 thus offers an attractive way of tacling (1) by a trust-region approach. Moreover, even when the Hessian of f is available, maing use of it can be expensive computa- 2

3 tionally, and the numerical experiments show that ignoring the Hessian information and resorting instead to the RTR-SR1 approach can be beneficial. Another contribution of this paper with respect to [BKS96] is an extension of the analysis to allow for inexact solutions of the trust-region subproblem compare (10) with [BKS96, (2.4)]. This extension maes it possible to resort to inner iterations such as the Steihaug Toint truncated CG method (see [AMS08, 7.3.2] for its Riemannian extension) while staying within the assumptions of the convergence analysis. The paper is organized as follows. The RTR-SR1 method is stated and discussed in Section 2. The convergence analysis is carried out in Section 3. The limited-memory version is introduced in Section 4. Numerical experiments are reported in Section 5. Conclusions are drawn in Section 6. 2 The Riemannian SR1 trust-region method The proposed Riemannian SR1 trust-region (RTR-SR1) method is described in Algorithm 1. The algorithm statement is commented in Section 2.1 and the important questions of representing tangent vectors and choosing the vector transport are discussed in Sections 2.2 and A guide to Algorithm 1 Algorithm 1 can be viewed as a Riemannian version of the classical (Euclidean) SR1 trust-region method (see, e.g., [NW06, Algorithm 6.2]). It can also be viewed as an SR1 version of the Riemannian trust-region framewor [AMS08, Algorithm 10 p. 142]. Therefore, several pieces of information given in [AMS08, Ch. 7] remain relevant for Algorithm 1. In particular, the algorithm statement maes use of standard Riemannian concepts that are described, e.g., in [O N83, AMS08], such as the tangent space T x M to the manifold M at a point x, a Riemannian metric g, and the gradient gradf of a real-valued function f on M. The algorithm statement also relies on the notion of retraction, introduced in [ADM02] (or see [AMS08, 4.1]). A retraction R on M is a smooth map from the tangent bundle TM (i.e., the set of all tangent vectors to M) onto M such that, for all x M and all ξ x T x M, the curve t R(tξ x ) is tangent to ξ x at t = 0. We let R x denote the restriction of R to T x M. The domain of R need not be the entire tangent bundle, but this is usually the case in practice, and in this wor we assume throughout that R is defined wherever needed. Specific ways of constructing retractions are proposed in [ADM02, AMS08, AM12]; see also [WY12, JD13] for the important case of the Stiefel manifold. Within the Riemannian trust-region framewor, the characterizing aspect of Algorithm 1 lies in the update mechanism for the Hessian approximation B. The proposed update mechanism, based on formula (3) and on Step 6 of Algorithm 1, is a rather straightforward Riemannian generalization of the classical SR1 update B +1 = B + (y B s )(y B s ) T (y B s ) T s. Significantly less straightforward is the Riemannian generalization of the superlinear convergence result, as we will see in Section 3.4. (Observe that the local convergence result [AMS08, Theorem ] does not apply here because the Hessian approximation condition [AMS08, (7.36)] is not guaranteed to hold.) 3

4 Algorithm 1 Riemannian trust region with symmetric ran-one update (RTR-SR1) Input: Riemannian manifold M with Riemannian metric g; retraction R; isometric vector transport T S ; differentiable real-valued objective function f on M; initial iterate x 0 M; initial Hessian approximation B 0, symmetric with respect to g. 1: Choose 0 > 0, ν (0,1), c (0,0.1), τ 1 (0,1) and τ 2 > 1; Set 0; 2: Obtain s T x M by (approximately) solving s = arg min s T x M m (s) = arg min s T x M f(x )+g(gradf(x ),s)+ 1 2 g(s,b s),s.t. s ; (2) 3: Set ρ f(x ) f(r x (s )) m (0) m (s ) ; 4: Let y = T 1 S s gradf(r x (s )) gradf(x ); If g(s,y B s ) < ν s y B s, then B +1 = B, otherwise define the linear operator B +1 : T x M T x M by B +1 = B + (y B s )(y B s ), (SR1) (3) g(s,y B s ) where a denotes the flat of a T x M, i.e., a : T x M R : v g(a,v); 5: if ρ > c then 6: x +1 R x (s ); B +1 T Ss B +1 T 1 S s ; 7: else 8: x +1 x ; B +1 B +1 ; 9: end if 10: if ρ > 3 4 then 11: if s 0.8 then 12: +1 τ 2 ; 13: else 14: +1 ; 15: end if 16: else if ρ < 0.1 then 17: +1 τ 1 ; 18: else 19: +1 ; 20: end if 21: +1, goto 2 until convergence. Instrumental in the Riemannian SR1 update is the notion of vector transport, introduced in [AMS08, 8.1] as a generalization of the classical Riemannian concept of parallel translation. A vector transport on a manifold M on top of a retraction R is a smooth mapping satisfying the following properties for all x M: TM TM TM : (η x,ξ x ) T ηx (ξ x ) TM 1. (Associated retraction) T ηx (ξ x ) T Rx(ξ x)m for all ξ x T x M; 4

5 2. (Consistency) T 0x (ξ x ) = ξ x for all ξ x T x M; 3. (Linearity) T ηx (aξ x +bζ x ) = at ηx (ξ x )+bt ηx (ζ x ). The Riemannian SR1 update uses tangent vectors at the current iterate to produce a new Hessian approximation at the next iterate, hence the need to perform a vector transport (see Step 6) from the current iterate to the next. In the Input step of Algorithm 1, the requirement that the vector transport T S is isometric means that, for all x M and all ξ x,ζ x,η x T x M, the equation g(t Sηx ξ x,t Sηx ζ x ) = g(ξ x,ζ x ) (4) holds. Techniques for constructing an isometric vector transport on submanifolds of Euclidean spaces are described in Section 2.3. ThesymmetryrequirementonB 0 withrespecttotheriemannianmetricg meansthatg(b 0 ξ x0,η x0 ) = g(ξ x0,b 0 η x0 ) for all ξ x0,η x0 T x0 M. It is readily seen from (3) and Step 6 of Algorithm 1 that B is symmetric for all. Note however that B is, in general, not positive definite. A possible stopping criterion for Algorithm 1 is gradf(x ) < ǫ for some specified ǫ > 0, where, which also appears in the statement of Algorithm 1, denotes the norm induced by the Riemannian metric g, i.e., ξ = g(ξ,ξ). (5) In the spirit of [RW12, Remar 4], we point out that it is possible to formulate the SR1 update (3) in the new tangent space T x+1 M; in the present case of SR1, the algorithm remains equivalent since the vector transport is isometric. Otherwise, Algorithm 1 does not call for comments other than those made in [AMS08, Ch. 7]. In particular, we point out that the meaning of approximately in Step 2 of Algorithm 1 depends on the desired convergence results. We will see in the convergence analysis (Section 3) that enforcing the Cauchy decrease (9) is enough to ensure global convergence to stationary points, but another condition such as (10) is needed to guarantee superlinear convergence. The truncated CG method, discussed in [AMS08, 7.3.2] in the Riemannian context, is an inner iteration for Step 2 that returns an s satisfying conditions (9) and (10). 2.2 Representation of tangent vectors Let us now consider the frequently encountered situation where the manifold M is described as a d-dimensional submanifold of an m-dimensional Euclidean space E. In particular, this is the case of the sphere and the Stiefel manifold involved in the numerical experiments in Section 5. A tangent vector in T x M can be represented either by its d-dimensional vector of coordinates in a given basis B x of T x M, or else as an m-dimensional vector in E since T x M T x E E. The latter option may be preferable when the codimension m d is small (e.g., the sphere) because building, storing and manipulating the basis B x of T x M may be inconvenient. Liewise, since B is a linear transformation of T x M, it can be represented in the basis B x as a d d matrix, or as an m m matrix restricted to act on T x M. Here again, the latter approach may be computationally more efficient when the codimension m d is small. A related choice has to be made for the representation of the vector transport, since T ηx is a linear map from T x M to T Rx(η x)m. This question is addressed in Section

6 2.3 Isometric vector transport We present two ways of constructing an isometric vector transport on a d-dimensional submanifold M of an m-dimensional Euclidean space E Vector transport by parallelization An open subset U of M is termed parallelizable if it admits a smooth field of tangent bases, i.e., a smooth function B : U R m d : z B z where B z is a basis of T z M. The whole manifold M itself may not be parallelizable; in particular, every manifold of nonzero Euler characteristic is not parallelizable [Sti35, 1.6], and it is also nown that the only parallelizable spheres are those of dimension 1, 3, and 7 [Boo03, p. 116]. However, for the global convergence analysis carried out in Section 3.2, the vector transport is not required to be smooth or even continuous, and for the local convergence analysis in Section 3.4, we only need a parallelizable neighborhood U of the limit point x. Such a neighborhood always exists (tae for example a coordinate neighborhood [AMS08, p. 37]). Given an orthonormal smooth field of tangent bases B, i.e., such that B xb x = I for all x (where I stands for the identity matrix of adequate size), the proposed isometric vector transport from T x M to T y M is given by T = B y B x. (6) The d d matrix representation of this vector transport in the pair of bases (B x,b y ) is simply the identity. This considerably simplifies the implementation of Algorithm Vector transport by rigging If M is described as a d-dimensional submanifold of an m-dimensional Euclidean space E and the codimension (m d) is much smaller than the dimension d, then the vector transport by rigging, introduced next, may be preferable. For generality, we do not assume that M is a Riemannian submanifold of E; in other words, the Riemannian metric g on M may not be the one induced by the metric of E. A motivation for this generality is to be able to handle the canonical metric of the Stiefel manifold [EAS98, (2.22)]. For simplicity of the exposition, we wor in an orthonormal basis of E and, for x M, we let G x denote a matrix expression of g x, i.e., g x (ξ x,η x ) = ξ T xg x η x for all ξ x,η x T x M. An open subset U of M is termed rigged if it admits a smooth field of normal bases, i.e., a smooth function N : U R m d : z N z where N z is a basis of the normal space N z M. The whole manifold M itself may not be rigged, but it is always locally rigged. Given a smooth field of normal bases N, the proposed isometric vector transport T from T x M to T y M is defined as follows. Compute (I N x (N T x N x ) 1 N T x )N y (i.e., the orthogonal projection of N y onto T x M) and observe that its column space is T x M (T x M T y M). Obtain an orthonormal matrix Q x by Gram-Schmidt orthonormalizing (I N x (N T x N x ) 1 N T x )N y. Proceed liewise with x and y interchanged to get Q y. Finally, let T = G 1 2 y (I Q x Q T x Q y Q T x)g 1 2 x. (7) While it is clear that T satisfies the three properties of vector transport mentioned in Section 2.1, proving that T is (locally) smooth remains an open question. Moreover, the column space of 6

7 (I N x (N T x N x ) 1 N T x )N y gets more sensitive to numerical errors as the distance between x and y decreases. Nevertheless, there is evidence that T is smooth indeed, and we have observed that using vector transport by rigging in Algorithm 1 is a worthy alternative in large-scale low-codimension problems. 3 Convergence analysis of RTR-SR1 3.1 Notation and standing assumptions Throughout the convergence analysis, unless otherwise specified, we let {x }, {B }, { B }, {s }, {y }, and { } be infinite sequences generated by Algorithm 1, and we mae use of the notation introduced in that algorithm. We let Ω denote the sublevel set of x 0, i.e., Ω = {x M : f(x) f(x 0 )}. The global and local convergence analyses each mae standing assumptions at the beginning of their respective sections. The numbered assumptions introduced below are not standing assumptions and will be invoed specifically whenever needed. Note that, apart from Assumption 3.6, all the numbered assumptions are Riemannian generalizations of assumptions made in [BKS96] for the analysis of the Euclidean SR1 trust-region method. 3.2 Global convergence analysis In some results, we will assume for the retraction R that there exists µ > 0 and δ µ > 0 such that ξ µdist(x,r x (ξ)) for all x Ω, for all ξ T x M, ξ δ µ. (8) This corresponds to [AMS08, (7.25)] restricted to the sublevel set Ω. Such a condition is instrumental in the global convergence analysis of Riemannian trust-region schemes. Note that, in view of [RW12, Lemma 6], condition (8) can be shown to hold globally under the condition that R has equicontinuous derivatives. The next assumption corresponds to [BKS96, (A3)]. Assumption 3.1. The sequence of linear operators {B } is bounded by a constant M such that B M for all. We will often require that the trust-region subproblem (2) is solved accurately enough that, for some positive constants σ 1 and σ 2, and that gradf(x ) m (0) m (s ) σ 1 gradf(x ) min{,σ 2 }, (9) B B s = gradf(x )+δ with δ gradf(x ) 1+θ, whenever s 0.8, (10) where θ > 0 is a constant. These conditions are generalizations of [BKS96, (2.3 4)]. Observe that, even if we restrict to the Euclidean case, condition (10) remains weaer than condition [BKS96, (2.4)]. The purpose of introducing δ in (10) is to encompass stopping criteria such as [AMS08, 7

8 (7.10)] that do not require the computation of an exact solution of the trust-region subproblem. We point out in particular that (9) and (10) hold if the approximate solution of the trust-region subproblem (2) is obtained from the truncated CG method, described in [AMS08, 7.3.2] in the Riemannian context. We can now state and prove the main global convergence results. Point (iii) generalizes [BKS96, Theorem 2.1] while points (i) and (ii) are based on [AMS08, 7.4.1]. Theorem 3.1 (convergence). (i) If f C 2 is bounded below on the sublevel set Ω, Assumption 3.1 holds, condition (9) holds, and (8) is satisfied then lim gradf(x ) = 0. (ii) If f C 2, M is compact, Assumption 3.1 holds, and (9) holds then lim gradf(x ) = 0, {x } has at least one limit point, and every limit point of {x } is a stationary point of f. (iii) If f C 2, the sublevel set Ω is compact, f has a unique stationary point x in Ω, Assumption 3.1 holds, condition (9) holds, and (8) is satisfied then {x } converges to x. Proof. (i) Observe that the proof of [AMS08, Theorem 7.4.4] still holds when condition [AMS08, (7.25)] is weaened to its restriction (8) to Ω. Indeed, since the trust-region method is a descent iteration, it follows that all iterates are in Ω. The assumptions thus allow us to conclude, by [AMS08, Theorem 7.4.4], that lim gradf(x ) = 0. (ii) It follows from [AMS08, Proposition 7.4.5] and [AMS08, Corollary 7.4.6] that all the assumptions of [AMS08, Theorem 7.4.4] hold. Hence lim gradf(x ) = 0, and every limit point is thus a stationary point of f. Since M is compact, {x }isguaranteedtohaveatleastonelimitpoint. (iii)againby[ams08, Theorem7.4.4], we get that lim gradf(x ) = 0. Since {x } belongs to the compact set Ω and cannot have limit points other than x, it follows that {x } converges to x. 3.3 More notation and standing assumptions For the purpose of conducting a local convergence analysis, we now assume that {x } converges to a point x. Moreover, we assume throughout that f C 2. We let U trn be a totally retractive neighborhood of x, a concept inspired from the notion of totally normal neighborhood (see [dc92, 3.3]). By this, we mean that there is δ trn > 0 such that, for each y U trn, we have that R y (B(0 y,δ trn )) U trn and R y ( ) is a diffeomorphism on B(0 y,δ trn ), where B(0 y,δ trn ) denotes the ball of radius δ trn in T y M centered at the origin 0 y. The existence of a totally retractive neighborhood can be shown along the lines of [dc92, Theorem 3.3.7]. We assume without loss of generality that {x } U trn. Whenever we consider an inverse retraction R 1 x (y), we implicitly assume that x,y U trn. 3.4 Local convergence analysis The purpose of this section is to obtain a superlinear convergence result for Algorithm 1, stated in Theorem The analysis can be viewed as a Riemannian generalization of the local analysis in [BKS96, 2]. As we proceed, we will point out the main hurdles that had to be overcome in the generalization. The analysis maes use of several preparation lemmas, independent of Algorithm 1, that are of potential interest in the broader context of Riemannian optimization. These preparation lemmas become trivial or well nown in the Euclidean context. The next assumption corresponds to a part of [BKS96, (A1)]. Assumption 3.2. The point x is a nondegenerate local minimizer of f. In other words, gradf(x ) = 0 and Hessf(x ) is positive definite. 8

9 The next assumption generalizes the assumption, contained in [BKS96, (A1)], that the Hessian of f is Lipschitz continuous near x. (Recall that T S is the vector transport invoed in Algorithm 1.) Note that the assumption holds if f C 3 ; see Lemma 3.5. Assumption 3.3. There exists a constant c 0 such that for all x,y U trn, Hessf(y) T Sη Hessf(x)T 1 S η cdist(x,y), where Hessf(x) is the Riemannian Hessian of f at x (see, e.g., [AMS08, 5.5]), η = R 1 x (y), and is also used to denote the operator norm induced by the Riemannian norm (5). The next assumption is introduced to handle the Riemannian case; in the classical Euclidean setting, Assumption 3.4 follows from Assumption 3.3. Assumption 3.4 is mild since it holds if f C 3, as shown in Lemma 3.5. Assumption 3.4. There exists a constant c 0 such that for all x,y U trn, all ξ x T x M with R x (ξ x ) U trn, and all ξ y T y M with R y (ξ y ) U trn, the inequality Hess ˆf y (ξ y ) T Sη Hess ˆf x (ξ x )T 1 S η c 0 ( ξ y + ξ x + η ) holds, where η = R 1 x (y), ˆf x = f R x, and ˆf y = f R y. The next assumption corresponds to [BKS96, (A2)]. It implies that no updates of B are sipped. In the Euclidean case, Khalfan et al. [KBS93] show that this is usually the case in practice. Assumption 3.5. The inequality holds. g(s,y B s ) ν s y B s The next assumption is introduced to handle the Riemannian case. It states that the iterates eventually continuously stay in the totally retractive neighborhood U trn (the terminology is borrowed from [ATV13, Definition 2.8]). The assumption is needed, in particular, for Lemma 3.6. Note that, whereas in the Euclidean setting the assumption follows from the standing assumption that {x } converges to x, this is no longer the case on some Riemannian manifolds, where {x } may converge to x while the connecting segments {R x (ts ) : t [0,1]} do not. Assumption 3.6 is thus invoed to ensure that we are in a position to carry out a local convergence analysis. Assumption 3.6. There exists N such that, for all N and all t [0,1], R x (ts ) U trn. The next lemma is proved in [GQA12, Lemma 14.1]. Lemma 3.2. Let M be a Riemannian manifold, let U be a compact coordinate neighborhood in M, and let the hat denote coordinate expressions. Then there are c 2 > c 1 > 0 such that, for all x,y U, we have c 1 ˆx ŷ 2 dist(x,y) c 2 ˆx ŷ 2, where 2 denotes the Euclidean norm, i.e., ˆx 2 = ˆx Tˆx. 9

10 Lemma 3.3. Let M be a Riemannian manifold endowed with a retraction R and let x M. Then there exist a 0 > 0, a 1 > 0, and δ a0,a 1 > 0 such that for all x in a sufficiently small neighborhood of x and all ξ,η T x M with ξ δ a0,a 1 and η δ a0,a 1, the inequalities hold. a 0 ξ η dist(r x (η),r x (ξ)) a 1 ξ η Proof. Since R is smooth, we can choose a neighborhood small enough such that R satisfies the condition of [RW12, Lemma 6], and the result follows from that lemma. The following lemma follows from Lemma 3.3 by setting η = 0. We state it separately for convenience as we will frequently invoe it in the analysis. Lemma 3.4. Let M be a Riemannian manifold endowed with retraction R and let x M. Then there exist a 0 > 0, a 1 > 0, and δ a0,a 1 > 0 such that for all x in a sufficiently small neighborhood of x and all ξ T x M with ξ δ a0,a 1, the inequalities hold. a 0 ξ dist(x,r x (ξ)) a 1 ξ Lemma 3.5. If f C 3, then Assumptions 3.3 and 3.4 hold. Proof. First, we prove that Assumption 3.3 holds. Define a function h : M M T M TM,(x,y,ξ y ) T Sη Hessf(x)T 1 S η ξ y, where η = Rx 1 (y). Since f C 3, we now that h(x,y,ξ y ) is C 1. Then there exists b 0 such that for all x,y U trn, ξ y T y M, ξ y = 1, h(y,y,ξ y ) h(x,y,ξ y ) b 0 dist({y,y,ξ y },{x,y,ξ y }) where b 0, b 1 and b 2 are some constants. So we have b 1 {ŷ,ŷ, ˆξ y } {ˆx,ŷ, ˆξ y } 2 (by Lemma 3.2) = b 1 ŷ ˆx 2 b 2 dist(y,x), (by Lemma 3.2) b 2 dist(y,x) h(y,y,ξ y ) h(x,y,ξ y ) = (Hessf(y) T Sη Hessf(x)T 1 S η )ξ y Given any linear operator A on T y M, we have A by definition is sup ξ =1 Aξ. Note that ξ = 1 is a compact set. Hence, there exists ξ = 1 such that A = Aξ. Therefore, we can choose ξ y, ξ y = 1 such that (Hessf(y) T Sη Hessf(x)T 1 S η )ξ y = (Hessf(y) T Sη Hessf(x)T 1 S η ). We obtain Hessf(y) T Sη Hessf(x)T 1 S η b 2 dist(y,x). 10

11 To prove Assumption 3.4, we redefine h as h(y,x,ξ x ) = T Sη Hess ˆf x (ξ x )T 1 S η. If we use orthonormal vector fields to obtain the coordinate expression of h, denoted by ĥ, then the manifold norm and the Euclidean norm of coordinate expressions are the same and we have Hess ˆf y (ξ y ) T Sη Hess ˆf x (ξ x )T 1 S η = Hess ˆf y (ˆξ y ) ˆT Sη Hess ˆf x (ˆξ x )ˆT 1 S η 2. (11) Since f C 3, we now that ĥ is also in C1. Hence there exists a constant b 3 such that Therefore ĥ(ŷ,ŷ, ˆξ y ) ĥ(ŷ,ˆx, ˆξ x ) 2 b 3 {ŷ,ŷ, ˆξ y } {ŷ,ˆx, ˆξ x } 2. Hess ˆf y (ˆξ y ) ˆT Sη Hess ˆf x (ˆξ x )ˆT 1 S η 2 = ĥ(ŷ,ŷ, ˆξ y ) ĥ(ŷ,ˆx, ˆξ x ) 2 This and (11) give us Assumption 3.4. b 3 {ŷ,ŷ, ˆξ y } {ŷ,ˆx, ˆξ x } 2 b 4 ( ŷ ˆx 2 + ˆξ y 2 + ˆξ x 2 ) b 5 (dist(x,y)+ ˆξ y 2 + ˆξ x 2 ) (by Lemma 3.2) b 6 ( η + ξ y + ξ x ) (by Lemma 3.4) The next lemma generalizes [BKS96, Lemma 2.2]. The ey difference with the Euclidean case is the following: in the Euclidean case, when s is accepted, we simply have s = x +1 x, while in the Riemannian generalization, we invoe Assumption 3.6 and Lemma 3.4 to deduce that s 1 a 0 dist(x +1,x ). Note that Assumption 3.6 cannot be removed. To see this, consider for example the unit sphere with the exponential retraction, where we can have x = x +1 with s = 2π. Lemma 3.6. Suppose Assumption 3.6 holds. Then either or there exist K > 0 and > 0 such that for all > K In either case s 0. 0 (12) =. (13) Proof. Let = liminf and suppose first that > 0. From line 11 of Algorithm 1, if is increased, then s 0.8 andx +1 = R x (s ), whichimpliesbylemma3.4andassumption3.6 that dist(x,x +1 ) a The latter inequality cannot hold for infinitely many values of since x x and liminf > 0. Hence, there exists K 0 such that is not increased for any K. Since > 0, this implies that for all K. In view of the trust-region update mechanism in Algorithm 1 and since = liminf, we also now that, for some K 1 > K, K1 < 1 τ 1. If the trust region radius were to be decreased we would have K1 +1 <, which we have ruled out. Since neither increase nor decrease can occur, we must have = for all K 1. Suppose now that = 0. Since x x, for every ǫ > 0 there exists K ǫ 0 such that dist(x +1,x ) < ǫ for all K ǫ. Since liminf = 0, there exists j K ǫ such that j < ǫ. But 11

12 since is increased only if s 1 0.8a 0 dist(x +1,x ) < ǫ 0.8a 0, and the increase factor is τ 2, we have that < τ 2ǫ 0.8a 0 for all j. Therefore (12) follows. To show that s 0, note that if (12) is true, then clearly s 0. If (13) is true, then for all > K, the step s is accepted and s 1 a 0 dist(x +1,x ) (by Lemma 3.4), hence s 0 since {x } converges. Lemma 3.7. Let M be a Riemannian manifold endowed with two vector transports T 1 and T 2, and let x M. Then there exist a constant a 4 and a neighborhood U of x such that for all x,y U and all ξ T y M, T1 1 η ξ T2 1 η ξ a 4 ξ η, where η = R 1 x (y). Proof. We use the hat to denote coordinate expressions. Let T 1 (ˆx,ˆη) and T 2 (ˆx,ˆη) denote the coordinate expression of T1 1 η and T2 1 η, respectively. Then T 1 1 η ξ T 1 2 η ξ b 0 (T 1 (ˆx,ˆη) T 2 (ˆx,ˆη))ˆξ 2 b 0 ˆξ 2 T 1 (ˆx,ˆη) T 2 (ˆx,ˆη) 2 b 1 ˆξ 2 ˆη 2 (since T 1 (ˆx,0) = T 2 (ˆx,0) and both T 1 and T 2 are smooth) b 2 ξ η for some constants b 0, b 1, and b 2. The next lemma is proved in [GQA12, Lemma 14.5]. Lemma 3.8. Let F be a C 1 vector field on a Riemannian manifold M and let x M be a nondegenerate zero of F. Then there exist a neighborhood U of x and a 5,a 6 > 0 such that for all x U, a 5 dist(x, x) F(x) a 6 dist(x, x). In the Euclidean case, the next lemma holds with ã 7 = 0 and reduces to the Fundamental Theorem of Calculus. Lemma 3.9. Let F be a C 1 vector field on a Riemannian manifold M, let R be a retraction on M, and let x M. Then there exist a neighborhood U of x and a constant ã 7 such that for all x,y U, Pγ 0 1 F(y) F(x) [ 1 0 Pγ 0 t DF(γ(t))Pγ t 0 dt]η ã 7 η 2, where η = R 1 x (y) and P γ is the parallel translation along the curve γ given by by γ(t) = R x (tη). Proof. Define G : [0,1] T x M : t G(t) = Pγ 0 t F(γ(t)). Observe that G(0) = F(x) and G(1) = Pγ 0 1 F(y). We have G (t) = d dǫ G(t+ǫ) ǫ=0 = P 0 t γ d γ F(γ(t+ǫ)) ǫ=0 dǫ Pt t+ǫ = P 0 t γ DF(γ(t))[ d dǫ γ(t+ǫ)] ǫ=0 = Pγ 0 t DF(γ(t))[T R( tη)η], 12

13 where we have used an expression of the covariant derivative D in terms of the parallel translation P (see, e.g., [Cha06, theorem I.2.1]), and where T R(tη) η = d dt (R(tη)). Since G(1) G(0) = 1 0 G (t)dt, we obtain = Pγ 0 1 F(y) F(x) where b 0 is some constant. 0 P 0 t γ P 0 t γ P 0 t γ 1 0 P 0 t γ DF(γ(t))(T R(tη) η Pγ t 0 η)dt DF(γ(t))P t 0 γ DF(γ(t))P t 0 γ b 0 η 2 (by Lemma 3.7) DF(γ(t))Pγ t 0 ηdt (Pγ 0 t T R( tη)η η) dt (Pγ 0 t T R(tη) η T 1 R(tη) T R(tη)η) dt Lemma Suppose Assumptions 3.2 and 3.3 hold. Then there exist a neighborhood U and a constant a 7 such that for all x 1, x 1, x 2, and x 2 U, we have g(t Sζ ξ 1,y 2 ) g(t Sζ y 1,ξ 2 ) a 7 max{dist(x 1,x ),dist(x 2,x ),dist( x 1,x ),dist( x 2,x )} ξ 1 ξ 2, where ζ = R 1 x 1 (x 2 ), ξ 1 = R 1 x 1 ( x 1 ), ξ 2 = R 1 x 2 ( x 2 ), y 1 = T 1 S ξ1 gradf( x 1 ) gradf(x 1 ), and y 2 = T 1 S ξ2 gradf( x 2 ) gradf(x 2 ). Proof. Define ȳ 1 = Pγ gradf( x 1 ) gradf(x 1 ) and ȳ 2 = Pγ gradf( x 2 ) gradf(x 2 ), where P is the parallel transport, γ 1 (t) = R x1 (tξ 1 ), and γ 2 (t) = R x2 (tξ 2 ). From Lemma 3.9, we have ȳ 1 H 1 (x 1, x 1 )ξ 1 b 0 ξ 1 2 and ȳ 2 H 2 (x 2, x 2 )ξ 2 b 0 ξ 2 2, (14) where H 1 (x 1, x 1 ) = 1 0 P0 t γ 1 Hessf(γ 1 (t))pγ t 0 1 dt, H2 (x 2, x 2 ) = 1 b 0 is a constant. It follows that g(t Sζ ξ 1,y 2 ) g(t Sζ y 1,ξ 2 ) 0 P0 t γ 2 g(t Sζ ξ 1,ȳ 2 ) g(t Sζ ȳ 1,ξ 2 ) + g(t Sζ ξ 1,y 2 ȳ 2 ) g(t Sζ (y 1 ȳ 1 ),ξ 2 ) Hessf(γ 2 (t))pγ t 0 2 dt, and g(t Sζ ξ 1, H 2 (x 2, x 2 )ξ 2 ) g(t Sζ H1 (x 1, x 1 )ξ 1,ξ 2 ) +b 1 ( ξ 1 + ξ 2 ) ξ 1 ξ 2 (by (14)) + g(t Sζ ξ 1,T 1 S ζ2 gradf( x 2 ) Pγ gradf( x 2 )) + g(t Sζ (T 1 S ζ1 gradf( x 1 ) P 0 1 γ 1 gradf( x 1 )),ξ 2 ) g(t Sζ ξ 1, H 2 (x 2, x 2 )ξ 2 ) g(t Sζ H1 (x 1, x 1 )ξ 1,ξ 2 ) +b 1 ( ξ 1 + ξ 2 ) ξ 1 ξ 2 +b 2 ξ 1 ξ 2 gradf( x 2 ) +b 3 ξ 1 ξ 2 gradf( x 1 ) (by Lemma 3.7) g( H 2 (x 2, x 2 )T Sζ ξ 1,ξ 2 ) g(t Sζ H1 (x 1, x 1 )ξ 1,ξ 2 ) (average Hessian is also self-adjoint) +b 4 ξ 1 ξ 2 (dist(x 1, x 1 )+dist(x 2, x 2 )+dist( x 2,x )+dist( x 1,x )) (by Lemmas 3.4 and 3.8) b 5 ξ 1 ξ 2 max{dist(x 1,x ),dist(x 2,x ),dist( x 1,x ),dist( x 2,x )} (by triangle inequality of distance) + g( H 2 (x 2, x 2 )T Sζ ξ 1,ξ 2 ) g(t Sζ H1 (x 1, x 1 )ξ 1,ξ 2 ) (15) 13

14 where b 1,b 2, b 3, b 4 and b 5 are some constants. Using hat to denote coordinate expressions, T(ˆx 1,ˆx 2 ) to denote T ζ and Ĝ(ˆx 2) to denote the matrix expression of the Riemannian metric at x 2, we have g( H 2 (x 2, x 2 )T Sζ ξ 1,ξ 2 ) g(t Sζ H1 (x 1, x 1 )ξ 1,ξ 2 ) = ˆξ 1 T T(ˆx 1,ˆx 2 ) T ˆ H2 (ˆx 2,ˆ x 2 ) T Ĝ(ˆx 2 )ˆξ 2 ˆξ 1 T ˆ H1 (ˆx 1,ˆ x 1 ) T T(ˆx 1,ˆx 2 ) T Ĝ(ˆx 2 )ˆξ 2 ˆξ 1 2 T(ˆx 1,ˆx 2 ) T ˆ H2 (ˆx 2,ˆ x 2 ) T ˆ H1 (ˆx 1,ˆ x 1 ) T T(ˆx 1,ˆx 2 ) T 2 Ĝ(ˆx 2) 2 ˆξ 2 2 (16) where 2 denotes the Euclidean norm. Define a function J(ˆx 1,ˆ x 1,ˆx 2,ˆ x 2 ) = T(ˆx 1,ˆx 2 ) T ˆ H2 (ˆx 2,ˆ x 2 ) T ˆ H1 (ˆx 1,ˆ x 1 ) T T(ˆx 1,ˆx 2 ) T. We can see that when (ˆx T 1,ˆ x T 1 ) = (ˆxT 2,ˆ x T 2 ), J = 0. Since, in view of Assumption 3.3, J is Lipschitz continuous, it follows that (16) becomes g( H 2 T Sζ ξ 1,ξ 2 ) g(t Sζ H1 ξ 1,ξ 2 ) b 6 (ˆx T 1,ˆ x T 1) (ˆx T 2,ˆ x T 2) 2 ˆξ 1 2 ˆξ 2 2 b 7 ξ 1 ξ 2 max{dist(x 1,x 2 ),dist( x 1, x 2 )}, where b 6,b 7 are some constants. Combining this equation with (15), we obtain g(t Sζ ξ 1,y 2 ) g(t Sζ y 1,ξ 2 ) b 8 ξ 1 ξ 2 max{dist(x 1,x ),dist(x 2,x ),dist( x 1,x ),dist( x 2,x )}, where b 8 is a constant. Lemma Let M be a Riemannian manifold endowed with a vector transport T with associated retraction R, and let x M. Then there is a neighborhood U of x and a 8 such that for all x,y U, id T 1 ξ T 1 η T ζ a 8 max(dist(x, x),dist(y, x)), id T 1 ζ T η T ξ a 8 max(dist(x, x),dist(y, x)), where ξ = R 1 x (x), η = R 1 x (y), ζ = R 1 x (y), id is the identity operator, and is the operator norm induced by the Riemannian metric. Proof. Let the hat denote coordinate expressions, chosen such that the matrix expression of the Riemannian metric at x is the identity. Let L(x,y) denote T R 1 x (y). We have id T 1 ξ T 1 η T ζ = I L( x,x) 1 L(x,y) 1 L( x,y). Define a function J( x,ξ,ζ) = I L( x,r x (ξ)) 1 L(R x (ξ),r x (ζ)) 1 L( x,r x (ζ)). Notice that J is a smooth function and J( x,0 x,0 x ) = 0. So J( x,ξ,ζ) = J( x,ξ,ζ) J( x,0 x,0 x ) = Ĵ(ˆ x, ˆξ, ˆζ) Ĵ(ˆ x,ˆ0 x,ˆ0 x ) 2 b 0 ( ˆξ 2 + ˆζ 2 ) (smoothness of J) b 1 (dist(x, x)+dist(y, x)) (by Lemma 3.4) b 2 max(dist(x, x),dist(y, x)), where b 0,b 1 and b 2 are some constants and 2 denotes the Euclidean norm. So id T 1 ξ Tη 1 T ζ b 2 max(dist(x, x),dist(y, x)). This concludes the first part of the proof. The second part of the result follows from a similar argument. 14

15 The next lemma generalizes [CGT91, Lemma 1]. It is instrumental in the proof of Lemma 3.14 below. In the Euclidean setting, it is possible to give an expression for a 9 and a 10 in terms of c of Assumption 3.3 and ν of Assumption 3.5. In the Riemannian setting, we could not obtain such an expression, in part because the constant b 2 that appears in the proof below is no longer zero. However, the existence of a 9 and a 10 can still be shown, under the assumption that {x } converges to x, and this is all we need in order to carry on with Lemma Lemma Suppose Assumptions 3.1, 3.2, 3.3, and 3.5 hold. Then for all j. Moreover, there exist constants a 9 and a 10 such that for all j, i j +1, where ǫ i,j = max j i dist(x,x ) and with ζ j,i = R 1 x j (x i ). Proof. From (3), we have y j B j+1 s j = 0 (17) y j (B i ) j s j a 9 a i j 2 10 ǫ i,j s j (18) (B i ) j = T 1 S ζj,i B i T Sζj,i B j+1 s j = (B j + (y j B j s j )(y j B j s j ) )s j = y j. g(s j,y j B j s j ) This yields (17), as well as (18) with i = j +1. The proof of (18) for i > j +1 is by induction. We choose j +1 and assume that (18) holds for all i = j +1,...,. Let r = y B s. We have g(r,t Sζj, s j ) = g(y B s,t Sζj, s j ) g(y,t Sζj, s j ) g(s,t Sζj, y j ) + g(s,t Sζj, (y j (B ) j s j )) + g(s,t Sζj, ((B ) j s j )) g(b s,t Sζj, s j ) g(y,t Sζj, s j ) g(s,t Sζj, y j ) + T Sζj, (y j (B ) j s j ) s + g(s,b T Sζj, s j ) g(b s,t Sζj, s j ) g(y,t Sζj, s j ) g(s,t Sζj, y j ) +b 0 a 9 a j 2 10 ǫ,j s j s (B self-adjoint and induction assumption) b 0 a 9 a j 2 10 ǫ,j s j s +b 1 ǫ +1,j s s j, (by Lemma 3.10) where b 0 and b 1 are some constants. It follows that y j (B +1 ) j s j = y j T 1 S ζj,+1 B +1 T Sζj,+1 s j = y j T 1 S ζj,+1 T Ss B+1 T 1 S s T Sζj,+1 s j y j T 1 S ζj, B+1 T Sζj, s j + T 1 S ζj, B+1 T Sζj, s j T 1 S ζj,+1 T Ss B+1 T 1 S s T Sζj,+1 s j y j ((B ) j +T 1 S ζj, (r )(r ) g(s,r ) T S ζj, )s j +b 2 ǫ +1,j s j (by Lemma 3.11, Assumption 3.1, and (3)) y j (B ) j s j +b 3 g(r,t Sζj, s j ) s +b 2 ǫ +1,j s j (by Assumption 3.5) a 9 a j 2 10 ǫ,j s j +b 3 b 0 a 9 a j 2 10 ǫ,j s j +b 3 b 1 ǫ,j s j +b 2 ǫ +1,j s j (a 9 a j b 3 b 0 a 9 a j b 3 b 1 +b 2 )ǫ +1,j s j, (note that ǫ,j ǫ +1,j ) 15

16 where b 2,b 3 are some constant. Because b 0,b 1,b 2 and b 3 are independent of a 9 and a 10, we can choose a 9 and a 10 large enough such that (a 9 a j b 3 b 0 a 9 a j b 3 b 1 +b 2 ) a 9 a +1 j for all j, j +1. Tae for example, a 9 > 1 and a 10 > 1+b 3 b 0 +b 3 b 1 +b 2. Therefore This concludes the argument by induction. y j (B +1 ) j s j a 9 a +1 j 2 10 ǫ +1,j s j. Lemma If Assumption 3.3 holds then there exist a neighborhood U of x and a constant a 11 such that for all x 1,x 2 U, the inequality y T Sζ1 Hessf(x )T 1 S ζ1 s a 11 s max{dist(x 1,x ),dist(x 2,x )} holds, where ζ 1 = R 1 x (x 1), s = R 1 x 1 (x 2 ), y = T 1 S s gradf(x 2 ) gradf(x 1 ). Proof. Define ȳ = Pγ 0 1 gradf(x 2 ) gradf(x 1 ), where P is the parallel transport along the curve γ defined by γ(t) = R x1 (ts). From Lemma 3.9, we have where H = 1 0 P0 t γ ȳ Hs b 0 s 2, (19) Hessf(γ(t))Pγ t 0 dt and b 0 is a constant. We then have y T Sζ1 Hessf(x )T 1 S ζ1 s y ȳ + ȳ Hs + Hs T Sζ1 Hessf(x )T 1 S ζ1 s = T 1 S ζ gradf(x 2 ) Pγ 0 1 gradf(x 2 ) +b 0 s 2 + H T Sζ1 Hessf(x )T 1 S ζ1 s b 1 s max{dist(x 1,x ),dist(x 2,x )}+b 0 s 2 (by Lemma 3.7) +( 1 where b 1 and b 2 are some constants. 0 P 0 t γ Hessf(γ(t))Pγ t 0 dt Hessf(x 1 ) + Hessf(x 1 ) T Sζ1 Hessf(x )T 1 S ζ1 ) s b 2 s max{dist(x 1,x ),dist(x 2,x )}, (by Assumption 3.3) With these technical lemmas in place, we now start the Riemannian generalization of the sequence of lemmas in [BKS96] that leads to the main result [BKS96, Theorem 2.7], generalized here as Theorem For an easier comparison with [BKS96], in the rest of the convergence analysis, we let n (instead of d) denote the dimension of the manifold M. The next lemma generalizes[bks96, Lemma 2.3], itself a slight variation of[kbs93, Lemma 3.2]. The proof of [BKS96, Lemma 2.3] involves considering the span of a few s j s. In the Riemannian setting, adifficultyarisesfromthefactthatthes j sarenotinthesametangentspace. Weovercome this difficulty by transporting the s j s to T x M. Lemma Let s be such that R x (s ) x. If Assumptions 3.1, 3.2, 3.3, and 3.5 hold then there exists K 0 such that for any set of n + 1 steps S = {s j : K 1 <... < n+1 }, there exists an index m with m {2,3,...,n+1} such that (B m H m )s m s m < (a 12 a n ā 12 )ǫ 1 n S, 16

17 where ǫ S = max 1 j n+1 {dist(x j,x ),dist(r xj (s j ),x )}, H m = T Sζm Hessf(x )T 1 S ζm, ζ m = R 1 x (x m ), a 12,ā 12 are some constants, and n is the dimension of the manifold. Proof. Given S, for j = 1,2,...,n+1, define [ ] s1 S j = s 1, s 2 s 2,..., s j, s j where s i = T 1 S ζi s i,i = 1,2,...,j. The proof is organized as follows. We will first obtain in (28) that there exists m [2,n + 1] and u R m 1, w T x M such that s m / s m = S m 1 u w, S m 1 has full column ran and is well conditioned, and w is small. We will also obtain in (30) that (T 1 S ζm B m T Sζm Hessf(x ))S m 1 is small due to the Hessian approximating properties of the SR1 update given in Lemma 3.13 above. The conclusion follows from these two results. Let G denote the matrix expression of inner product of T x M and Ŝj denote the coordinate expression of S j, for j {1,...,n}. Let κ j be the smallest singular value of G 1/2 Ŝ j and define κ n+1 = 0. We have 1 = κ 1 κ 2... κ n+1 = 0. Let m be the smallest integer for which Since m n+1 and κ 1 = 1, we have κ m κ m 1 < ǫ 1 n S. (20) κ m 1 = κ 1 ( κ 2 )...( κ m 1 ) > ǫ (m 2)/n S > ǫ (n 1)/n S. (21) κ 1 κ m 2 Since x x and R x (s ) x, we can assume that ǫ S (0,( 1 4 )n ) for all. Now, we choose z R m such that G 1/2 Ŝ m z 2 = κ m z 2 (22) and z = ( u 1 where u R m 1. (The last component of z is nonzero due to that m is the smallest such that (20) is true.) Let w = S m z and its coordinate expression ŵ = Ŝmz. From the definition of G 1/2 Ŝ m and z, we have G 1/2 Ŝ m 1 u G 1/2 ŵ = G1/2 ˆ s m G 1/2, (23) ˆ s m 2 where ˆ s m is the coordinate expression of s m. Since κ m 1 is the smallest singular value of G 1/2 Ŝ m 1, we have that u 2 1 G 1/2 κ m 1 < G1/2 ŵ 2 +1 ǫ (n 1)/n S Ŝ m 1 u 2 = 1 κ m 1 G 1/2 ), ŵ + G1/2 ˆ s m G 1/2 2 G ˆ s m 2 1/2 ŵ 2 +1 κ m 1 = w +1 κ m 1 (24) = w +1. (by (21)) (25) ǫ (n 1)/n S 17

18 Using (22) and (24), we have that w 2 = G 1/2 ŵ 2 2 = G 1/2 Ŝ m z 2 2 = κ 2 m z 2 2 = κ 2 m(1+ u 2 2) κ 2 m +( κ m κ m 1 ) 2 ( G 1/2 ŵ 2 +1) 2 = κ 2 m +( κ m κ m 1 ) 2 ( w +1) 2. Therefore, since (20) implies that κ m < ǫ 1/n S, using (20), This implies w 2 < ǫ 2/n S +ǫ 2/n S ( w +1)2 < 4ǫ 2/n S ( w +1)2. (26) w (1 2ǫ 1/n S ) < 2ǫ1/n S, and hence w < 1, since ǫ S < ( 1 4 )n. Therefore, (25) and (26) imply that u 2 < 2 ǫ (n 1)/n S, (27) w < 4ǫ 1/n S. (28) Equation (28) is the announced result that w is small. The bound (27) will also be invoed below. Now we show that (T 1 S ζj B j T Sζj Hessf(x ))S j 1 is small for all j [2,n+1] (and thus in particular for j = m). By Lemma 3.12, we have for all i { 1, 2,..., j 1 }. Therefore, y i (B j ) i s i a 9 a j i 2 10 ǫ j,i s i a 9 a n ǫ S s i, (29) (T 1 S ζj B j T Sζj Hessf(x )) s i s i T 1 S ζi y i T 1 S ζj B j T Sζj s i T 1 S ζi y i Hessf(x ) s i + s i s i T 1 S ζi y i T 1 S ζj B j T Sζj s i +b 1 ǫ S (by Lemma 3.13) s i T 1 S ζi (y i T Sζi T 1 S ζj B j T Sζj T 1 S ζi s i ) = +b 1 ǫ S s i (y i (B j ) i s i ) b 2 +b 3 ǫ S (by Lemma 3.11 and Assumption 3.1) s i (b 4 a n b 3 )ǫ S (by (29)) where b 2,b 3 and b 4 are some constants. Therefore, we have that for any j [2,n+1], (T 1 S ζj B j T Sζj Hessf(x ))S j 1 g,2 b 5 ǫ S, (30) where b 5 = n(b 4 a n b 3 ) and g,2 is the norm induced by the Riemannian metric g and the Euclidean norm, i.e., A g,2 = sup Av / v 2 with defined in (5). 18

19 We can now conclude the proof as follows. Using (23) and (30) with j = m, (27) and (28), we have (T 1 S ζm B m T Sζm Hessf(x )) s m s m = (T 1 S ζm B m T Sζm Hessf(x ))(S m 1 u w) (T 1 S ζm B m T Sζm Hessf(x ))S m 1 g,2 u 2 + (T 1 S ζm B m T Sζm Hessf(x )) w 2 b 5 ǫ S ǫ (n 1)/n +(M +Hessf(x ))4ǫ 1/n S (by Assumption 3.1) S (2b 5 +b 6 )ǫ 1/n S where b 6 is some constant. Finally, (B m H m )s m s m = = (B m T Sζm Hessf(x )T 1 S ζm )s m s m (T 1 S ζm B m T Sζm Hessf(x )) s m (2b 5 +b 6 )ǫ 1/n S. s m The next lemma generalizes [BKS96, Lemma 2.4]. Its proof is a translation of the proof of [BKS96, Lemma 2.4], where we invoe two manifold-specific results: the equality of Hessf(x ) and Hess(f R x )(0 x ) (which holds in view of [AMS08, Proposition 5.5.6] since x is a critical point of f), and the bound in Lemma 3.4 on the retraction R. Lemma Suppose that Assumptions 3.1, 3.2, 3.3, 3.4, 3.5 and 3.6 hold and the trust-region subproblem (2) is solved accurately enough for (9) to hold. Then there exists N such that for any set of p > n consecutive steps s +1,s +1,...,s +p with N, there exists a set, G, of at least p n indices contained in the set {i : +1 i +p} such that for all j G, (B j H j )s j s j < a 13 ǫ 1 n, where a 13 = a 12 a p ā 12, H j = T Sζj Hessf(x )T 1 S ζj, ζ j = R 1 x (x j), and ǫ = Furthermore, for sufficiently large, if j G, then max {dist(x j,x ),dist(r xj (s j ),x )}. +1 j +p s j < a 14 dist(x j,x ), (31) where a 14 is a constant, and ρ j (32) 19

20 Proof. By Lemma 3.6, s 0. Therefore, by Lemma 3.14, applied to the set {s,s +1,...,s +p }, (33) there exists N such that for any N there exists an index l 1, with +1 l 1 +p satisfying (B l1 H l1 )s l1 s l1 < a 13 ǫ 1 n, where a 13 = a 12 a p ā 12. Now we can apply Lemma 3.14 to the set {s,s +1,...,s +p } s l1 to get l 2. Repeating this p n times, we get a set of p n indices G = {l 1,l 2,...,l p n } such that if j G, then (B j H j )s j < a 13 ǫ 1 n s j. (34) We show (31) next. Consider j G. By (34), we have Therefore, where b 0 is a constant and we have g(s j,(h j B j )s j ) s j (H j B j )s j a 13 ǫ 1 n s j 2. g(s j,b j s j ) g(s j,h j s j ) a 13 ǫ 1 n s j 2 > b 0 s j 2, (choosing large enough) 0 m j (0) m j (s j ) = g(gradf(x j ),s j ) 1 2 g(s j,b j s j ) where b 1 is some constant. This yields (31). gradf(x j ) s j 1 2 b 0 s j 2 b 1 dist(x j,x ) s j 1 2 b 0 s j 2, (by Lemma 3.8) 20

21 Finally, we show (32). Let j G and define ˆf x (η) = f(r x (η)). It follows that f(x j ) f(r xj (s j )) (m j (0) m j (s j )) = f(x j ) f(r xj (s j ))+g(gradf(x j ),s j )+ 1 2 g(s j,b j s j ) = ˆf xj (0 xj ) ˆf xj (s j )+g(gradf(x j ),s j )+ 1 2 g(s j,b j s j ) = 1 2 g(s j,b j s j ) 1 0 g(hess ˆf xj (τs j )[s j ],s j )(1 τ)dτ (by Taylor s theorem) 1 2 g(s j,b j s j ) 1 2 g(s j,h j s j ) g(s j,h j s j ) = 1 2 g(s j,(b j H j )s j ) g(hess ˆf xj (τs j )[s j ],s j )(1 τ)dτ (g(s j,t Sζj Hessf(x )T 1 S ζj s j ) g(hess ˆf xj (τs j )[s j ],s j ))(1 τ)dτ 1 2 s j (B j H j )s j + s j (T Sζj Hess ˆf x (0 x )T 1 S ζj Hess ˆf xj (τs j )) (1 τ)dτ (by [AMS08, Proposition 5.5.6]) b 2 s j 2 ǫ 1 n +b 3 s j 2 (dist(x j,x )+ s j ) (by (34), Lemma 3.4 and Assumption 3.4) b 4 s j 2 ǫ 1 n, (by (31) and dist(x j,x ) is smaller than ǫ 1 n eventually) where b 2,b 3 and b 4 are some constants. In view of (31) and Lemma 3.8, we have s j < b 5 gradf(x j ), where b 5 is some constant. Combining with s j j, we obtain Noticing (9), we have s j 2 b 5 gradf(x j ) min{ j,b 5 gradf(x j ) }. f(x j ) f(r xj (s j )) (m j (0) m j (s j )) b 6 ǫ 1 n (m j (0) m j (s j )), where b 6 is a constant. This implies (32). The next result generalizes [BKS96, Lemma 2.5] in two ways: the Euclidean setting is extended to the Riemannian setting, and inexact solves are allowed by the presence of δ. The main hurdle that we had to overcome in the Riemannian generalization is that the equality dist(x +s,x ) = s ξ does not necessarily hold. As we will see, Lemma 3.3 comes to our rescue. Lemma Suppose Assumptions 3.2 and 3.3 hold. If the quantities e = dist(x,x ) and (B H )s s 21

22 are sufficiently small and if B s = gradf(x )+δ with δ gradf(x ) 1+θ, then dist(r x (s ),x (B H )s ) a 15 e +a 16 e 1+min{θ,1} s, (35) and (B H )s h(r x (s )) a 17 h(x )+a 18 h 1+min{θ,1} (x ), (36) s a 19 h(x ) e a 20 h(x ) (37) where a 15,a 16,a 17 and a 18 are some constants and h(x) = (f(x) f(x )) 1 2. Proof. By definition of s, we have s = H 1 [(H B )s gradf(x )+δ ]. (38) Define ξ = R 1 x x. Therefore, letting γ be the curve defined by γ(t) = R x (tξ ), we have s ξ = H 1 [(H B )s gradf(x )+δ H ξ ] b 0 ( (H B )s + δ + Pγ 0 1 gradf(x ) gradf(x ) ( + ( 1 0 P 0 t γ 1 0 P 0 t γ Hessf(γ(t))Pγ t 0 dt)ξ Hessf(x )ξ + Hessf(x )ξ H ξ ) b 0 ( (H B )s +b 1 ξ 1+min{θ,1} (by Lemmas 3.8 and 3.9) + ( 1 0 P 0 t γ Hessf(γ(t))Pγ t 0 dt)ξ Hessf(x )ξ + Hessf(x ) H ξ ) b 0 (H B )s +b 0 b 1 ξ 1+min{θ,1} +b 0 b 3 ξ 2 (by Assumption 3.3) Hessf(γ(t))Pγ t 0 dt)ξ b 0 (H B )s +b 4 ξ 1+min{θ,1} (39) where b 1, b 2, b 3 and b 4 are some constants. From Lemma 3.3, we have dist(r x (s ),x ) = dist(r x (s ),R x (ξ )) b 5 s ξ, (40) where b 5 is a constant. Combining (39) and (40) and using Lemma 3.4, we obtain dist(r x (s ),x ) b 0 b 5 (H B )s + b 4 b 5 e 1+min{θ,1}. (41) From (38), for large enough such that H 1 (H B )s 1 2 s, we have s 1 2 s + H 1 ( gradf(x ) + gradf(x ) 1+θ ). Using Lemma 3.8, this yields s b 6 dist(x,x ), 22

23 where b 6 is a constant. Using the latter in (41) yields dist(r x (s ),x ) b 0 b 5 b 6 (H B )s s dist(x,x )+ b 4 b 5 e 1+min{θ,1}, which shows (35). Weshow (37)next. Define ˆf x (η) = f(r x (η))andletζ = R 1 x (x ). Wehave, forsomet (0,1), ˆf x (ζ ) ˆf x (0 x ) = g(gradf(x ),ζ )+g(hess ˆf x (tζ )[ζ ],ζ ) = g(hess ˆf x (tζ )[ζ ],ζ ), where we have used (Euclidean) Taylor s theorem to get the first equality and the fact that x is a critical point of f (Assumption 3.2) for the second one. Therefore, since Hess ˆf x = Hessf(x ) is positive definite (in view of [AMS08, Proposition 5.5.6] and Assumption 3.2), there exist b 7 and b 8 such that b 7 (ˆf x (ζ ) ˆf x (0 x )) ˆζ 2 b 8 (ˆf x (ζ ) ˆf x (0 x )) Then, using Lemma 3.4, we obtain that there exist b 9 and b 10 such that In other words, b 9 (f(x ) f(x )) dist(x,x ) 2 b 10 (f(x ) f(x )). b 9 h 2 (x ) e 2 b 10h 2 (x ), and we have shown (37). Combining it with (35), we get (36). With Lemmas 3.15 and 3.16 in place, the rest of the local convergence analysis is essentially a translation of the analysis in [BKS96]. The next lemma generalizes [BKS96, Lemma 2.6]. Lemma If Assumptions 3.1, 3.2, 3.3, 3.4, 3.5, and 3.6 hold and the subproblem is solved accurately enough for (9) and (10) to hold, then where h = h(x ). h lim = 0, Proof. Let p be the smallest integer greater than 2n+n( lnτ 1 /lnτ 2 ), where τ 1 and τ 2 are defined in Algorithm 1. Then τ n 1τ p 2n 2 1. (42) Applying Lemma 3.15 with this value of p, there exists N such that if N, then there exists a set of at least p n indices, G {j : +1 j +p}, such that if j G, then We now show that for such steps, (B j H j )s j s j < cǫ 1 n, (43) ρ j h j+1 j+1 1 τ 2 h j j. (44) 23

Rank-Constrainted Optimization: A Riemannian Manifold Approach

Rank-Constrainted Optimization: A Riemannian Manifold Approach Ran-Constrainted Optimization: A Riemannian Manifold Approach Guifang Zhou1, Wen Huang2, Kyle A. Gallivan1, Paul Van Dooren2, P.-A. Absil2 1- Florida State University - Department of Mathematics 1017 Academic