A BROYDEN CLASS OF QUASI-NEWTON METHODS FOR RIEMANNIAN OPTIMIZATION

Size: px

Start display at page:

Download "A BROYDEN CLASS OF QUASI-NEWTON METHODS FOR RIEMANNIAN OPTIMIZATION"

Hortense Flynn
5 years ago
Views:

1 Tech. report UCL-INMA A BROYDEN CLASS OF QUASI-NEWTON METHODS FOR RIEMANNIAN OPTIMIZATION WEN HUANG, K. A. GALLIVAN, AND P.-A. ABSIL Abstract. This paper develops and analyzes a generalization of the Broyden class of quasi- Newton methods to the problem of minimizing a smooth objective function f on a Riemannian manifold. A condition on vector transport and retraction that guarantees convergence and facilitates efficient computation is derived. Experimental evidence is presented demonstrating the value of the extension to the Riemannian Broyden class through superior performance for some problems compared to existing Riemannian BFGS methods, in particular those that depend on differentiated retraction. Key words. Riemannian optimization; manifold optimization; Quasi-Newton methods; Broyden methods; Stiefel manifold AMS subject classifications. 65K05, 90C48, 90C53 1. Introduction. In the classical Euclidean setting, the Broyden class (see, e.g., [22, 6.3]) is a family of quasi-newton methods that depend on a real parameter, φ. Its Hessian approximation update formula is B +1 = (1 φ )B+1 BFGS +φ B+1 DFP, where B+1 BFGS stands for the update obtained by the Broyden Fletcher Goldfarb Shanno (BFGS) method and B+1 DFP for the update of the Davidon Fletcher Powell (DFP) method. Therefore, all members of the Broyden class satisfy the well-nown secant equation, central to many quasi-newton methods. For many years, BFGS, φ = 0, was the preferred member of the family, as it tends to perform better in numerical experiments. Analyzing the entire Broyden class was nevertheless a topic of interest in view of the insight that it gives into the properties of quasi-newton methods; see [11] and the many references therein. Subsequently, it was found that negative values of φ are desirable [34, 10] and recent results reported in [19] indicate that a significant improvement can be obtained by exploiting the freedom offered by φ. The problem of minimizing a smooth objective function f on a Riemannian manifold has been a topic of much interest over the past few years due to severalimportant applications. Recently considered applications include matrix completion problems [8, 21, 12, 33], truss optimization [26], finite-element discretization of Cosserat rods [27], matrix mean computation [6, 4], and independent component analysis [31, 30]. Research efforts to develop and analyze optimization methods on manifolds can be traced bac to the wor of Luenberger [20]; they concern, among others, steepestdescent methods [20], conjugate gradients [32], Newton s method [32, 3], and trustregion methods [1, 5]; see also [2] for an overview. The idea of quasi-newton methods on manifolds is not new, however, the literature of which we are aware restricts consideration to the BFGS quasi-newton method. Gabay [15] discussed a version using parallel transport on submanifolds of R n. Brace and Manton [9] applied a version on the Grassmann manifold to the problem of weighted low-ran approximations. Qi [25] compared the performance of different Department ofmathematical Engineering, ICTEAMInstitute, Universitécatholique delouvain, B-1348 Louvain-la-Neuve, Belgium. Department of Mathematics, 208 Love Building, 1017 Academic Way, Florida State University, Tallahassee FL , USA. Corresponding author. huwst08@gmail.com. 1

2 2 W. Huang and K. A. Gallivan and P.-A Absil vector transports for a version of BFGS on Riemannian manifolds. Savas and Lim [28] proposed a BFGS and limited memory BFGS methods for problems with cost functions defined on a Grassmannian and applied the methods to the best multilinear ran approximation problem. Ring and Wirth [26] systematically analyzed a version of the BFGS on Riemannian manifolds which requires differentiated retraction. Seibert et al. [29] discussed the freedom available when generalizing BFGS to Riemannian manifolds and analyzed one generalization of BFGS method on Riemannian manifolds that are isometric to R n. In view of the above considerations, generalizing the Broyden family to manifolds is an appealing endeavor, which we undertae in this paper. For φ = 0 (BFGS) the proposed algorithm is quite similar to the BFGS method of Ring and Wirth [26]. Notably, both methods rely on the framewor of retraction and vector transport developed in [3, 2]. The BFGS method of [26] is more general in the sense that it also considers infinite-dimensional manifolds. On the other hand, a characteristic of our wor is that we strive to resort as little as possible to the derivative of the retraction. Specifically,thedefinitionofy (whichcorrespondstotheusualdifferenceofgradients) in [26] involvesdf Rx (s ), whose Riesz representationis (DR x (s )) f Rx (s ); here we use the notation of [26], i.e., R is the retraction, f R = f R, s = Rx 1 (x +1 ), represents the Riemannian adjoint operator, and f is the Riemannian gradient. In contrast, our definition of y relies on the same isometric vector transport as the one that appears in the Hessian approximation update formula. This can be advantageous in situations where R is defined by means of a constraint restoration procedure that does not admit a closed-form expression. It may also be the case that the chosen R admits a closed-form expression but that its derivative is unnown to the user. The price to pay for using the isometric vectortransport in y is satisfying a novel locing condition. Fortunately, we show simple procedures that can produce a retraction or an isometric vector transport such that the pair satisfies the locing condition. As a result, efficient and convergent algorithms can be developed. Another contribution with respect to [26] is that we propose a limited-memory version of the quasi-newton algorithm for large-scale problems. The paper is organized as follows. The Riemannian Broyden(RBroyden) family of algorithms and the locing condition are defined in Section 2. A convergence analysis is presented in Section 3. Two methods of constructing an isometric vector transport and a method of constructing a retraction related to the locing condition are derived in Section 4. The limited-memory RBFGS is described in Section 5. The inverse Hessian approximation formula of Ring and Wirth s RBFGS is derived in Section 6 and numerical experiments are reported in Section 7 for all versions of the methods. Conclusions are presented and future wor suggested in Section RBroyden family of methods. The RBroyden family of methods is stated in Algorithm 1, and the details, in particular the Hessian approximation update formulas in Steps 6 and 7, are explained next. A basic bacground in differential geometry is assumed; such a bacground can be found, e.g., in [7, 2]. The concepts of retraction and vector transport can be found in [2] or [25]. Practically, the input statement of Algorithm 1 means the following. The retraction R is a C 2 mapping 1 from the tangent bundle TM onto M such that (i) R(0 x ) = x for all x M (where 0 x denotes the origin of T x M) and (ii) d R(tξ x) t=0 = ξ x for all ξ x T x M. The restriction of R to T x M is denoted by R x. The domain of R does 1 The C 2 assumption is used in Lemma 3.5.

3 A Riemannian Broyden Class Method 3 not need to be the whole tangent bundle, but in practice it often is, and in this paper we mae the blanet assumption that R is defined wherever needed. Algorithm 1 RBroyden family Input: Riemannian manifold M with Riemannian metric g; retraction R; isometric vector transport T S, with R as associated retraction, that satisfies (2.8); continuously differentiable real-valued function f on M; initial iterate x 0 M; initial Hessian approximation B 0 which is a linear transformation of the tangent space T x0 M that is symmetric positive definite with respect to the metric g; convergence tolerance ε > 0; Wolfe condition constants 0 < c 1 < 1 2 < c 2 < 1; 1: 0; 2: while gradf(x ) > ε do 3: Obtain η T x M by solving B η = gradf(x ); 4: Set x +1 = R x (α η ) where α > 0 is computed from a line search procedure to satisfy the Wolfe conditions (2.1) (2.2) f(x +1 ) f(x )+c 1 α g(gradf(x ),η ), d f(r(tη )) t=α c 2 d f(r(tη )) t=0 ; 5: Set x +1 = R x (α η ); 6: Define s = T Sα η α η and y = β 1 gradf(x +1 ) T Sα η gradf(x ), where α β = η T Rα η α η and T R is the differentiated retraction of R; 7: Define the linear operator B +1 : T x+1 M T x+1 M by B +1 p = B p g(s, B p) g(s, B B s + g(y,p) s ) g(y,s ) y +φ g(s, B s )g(v,p)v, for all p T x+1 M or equivalently (2.3) B +1 = B B s ( B s ) + y y ( B s ) s y s +φ g(s, B s )v v, where v = y /g(y,s ) B s /g(s, B s ), φ is any number in the open interval (φ c, ), B = T Sα η B T 1 S α, φ c η = 1/(1 u ), u = 1 (g(y, B y )g(s, B s ))/g(y,s ) 2, g(, ) denotesthe Riemannianmetric, and a represents the flat of a, i.e., a : T x M R : v g(a,v); 2 8: +1; 9: end while A vector transport T : TM TM TM,(η x,ξ x ) T ηx ξ x with associated retraction R is a smooth mapping such that, for all (x,η x ) in the domain of R and all ξ x,ζ x T x M, it holds that (i) T ηx ξ x T R(ηx)M, (ii) T 0x ξ x = ξ x, (iii) T ηx is a linear map. In Algorithm 1, the vector transport T S is isometric, i.e., it additionally 2 It can be shown using the Cauchy-Schwarz inequality that u 1 and u = 1 if and only if there exists a constant κ such that y = κb s.

4 4 W. Huang and K. A. Gallivan and P.-A Absil satisfies (iv) (2.4) g R(ηx)(T Sηx ξ x,t Sηx ζ x ) = g x (ξ x,ζ x ). In most practical cases, T Sη exists for all η, and we mae this assumption throughout. In fact, we do not require that the isometric vector transport is smooth. The required properties are T S C 0 and for any x M, there exists a neighborhood U of x and a constant c 0 such that for all x,y U (2.5) (2.6) T Sη T Rη c 0 η T 1 S η T 1 R η c 0 η where η = R 1 x (y), T R denotes the differentiated retraction, i.e., (2.7) T Rηx ξ x = DR(η x )[ξ x ] = d R x(η x +tξ x ) t=0 and denotes the induced norm of the Riemannian metric g. In the following analysis, we use only these two properties of isometric vector transport. In Algorithm 1, we require the isometric vector transport T S to satisfy the locing condition (2.8) T Sξ ξ = βt Rξ ξ, β = ξ T Rξ ξ, for all ξ T x M and all x M. Practical ways of building such a T S are discussed in Section 4. Observe that, throughout Algorithm 1, the differentiated retraction T R only appears in the form T Rξ ξ, which is equal to d R(tξ) t=1. Hence T Rα η α η is just the velocity vector of the line search curve α R(αη ) at time α and we are only required to be able to evaluate the differentiated retraction in the direction transported. The computational efficiency that results is also discussed in Section 4. The isometry condition (2.4) and the locing condition (2.8) are imposed on T S notably because, as shown in Lemma 2.1, they ensure that the second Wolfe condition (2.2) implies g(s,y ) > 0. Much as in the Euclidean case, it is essential that g(s,y ) > 0, otherwise the secant condition B +1 s = y cannot hold with B +1 positive definite, whereas positive definiteness of the B s is ey to guaranteeing that the search directions η are descent directions. It is possible to state Algorithm 1 without imposing the isometry and locing conditions, but then it becomes an open question whether the main convergence results would still hold. Clearly, some intermediate results would fail to hold and, assuming that the main results still hold, a completely different approach would probably be required to prove them. Whenφ = 0,theupdatingformula(2.3)reducestotheRiemannianBFGSformula of [25]. However, a crucial difference between Algorithm 1 and the Riemannian BFGS of [25] lies in the definition of y. Its definition in [25] corresponds to setting β to α 1 instead of η T Rα η α η. Our choice of β allows for a convergence analysis under more general assumptions than those of the convergence analysis of Qi [23]. Indeed, the convergence analysis of the Riemannian BFGS of [25], found in [23], assumes that retraction R is set to the exponential mapping and that vector transport T S is set to the parallel transport. These specific choices remain legitimate in Algorithm 1, hence the convergence analysis given here subsumes the one in [23]; however, several other choices become possible, as discussed in more detail in Section 4.

5 A Riemannian Broyden Class Method 5 Lemma 2.1 proves that Algorithm 1 is well-defined for φ (φ c, ), where φc is defined in Step 7 of Algorithm 1. Lemma 2.1. Algorithm 1 constructs infinite sequences {x }, {B }, { B }, {α }, {s }, and {y }, unless the stopping criterion in Step 2 is satisfied at some iteration. For all, the Hessian approximation B is symmetric positive definite with respect to metric g, η 0, and (2.9) g(s,y ) (c 2 1)α g(gradf(x ),η ). Proof. We first show that (2.9) holds when all the involved quantities exist and η 0. Define m (t) = f(r x (tη / η )). We have g(s,y ) = g(t Sα η α η,β 1 gradf(x +1 ) T Sα η gradf(x )) (2.10) = g(t Sα η α η,β 1 gradf(x +1 )) g(t Sα η α η,t Sα η gradf(x )) = g(β 1 T S α η α η,gradf(x +1 )) g(α η,gradf(x )) (by isometry) = g(t Rα η α η,gradf(x +1 )) g(α η,gradf(x )) (by (2.8)) ( dm (α η ) = α η dm ) (0). Note that guaranteeing (2.10), which will be used frequently, is the ey reason for imposing the locing condition (2.8). From the second Wolfe condition (2.2), we have (2.11) Therefore, it follows that dm (α η ) dm (0) c 2. (2.12) dm (α η ) dm (0) (c 2 1) dm (0) 1 = (c 2 1) η g(gradf(x ),η ). The claim (2.9) follows from (2.10) and (2.12). When B is symmetric positive definite, η is a descent direction. Observe that the function α f(r(αη )) remains a continuously differentiable function from R to R which is bounded below. Therefore, the classical result in [22, Lemma 3.1] guarantees the existence of a step size, α, that satisfies the Wolfe conditions. The claims are proved by induction. They hold for = 0 in view of the assumptions on B 0 of Step 3 and of the results above. Assume now that the claims hold for some. From (2.9), we have g(s,y ) (1 c 2 )α g(gradf(x ), η ) = (1 c 2 )α g(gradf(x ),B 1 gradf(x )) > 0. Recall that, in the Euclidean case, s T y > 0 is a necessary and sufficient condition for the existence of a positive-definite secant update (see [14, Lemma 9.2.1]), and that BFGS is such an update [14, (9.2.10)]. From the generalization of these results to the Riemannian case (see [23, Lemmas and 2.4.2]), it follows that B +1 is symmetric positive definite when φ 0. Consider the function h(φ ) : R R d which gives the eigenvalues of B +1. Since B +1 is symmetric positive definite when φ 0, we now all entries of h(0) are

6 6 W. Huang and K. A. Gallivan and P.-A Absil greater than 0. By calculations similar to those for the Euclidean case [10], we have det(b +1 ) = det(b ) g(y,s ) g(s,b s ) (1+φ (u 1)), where φ and u are defined in Step 7 of Algorithm 1. So det(b +1 ) = 0 if and only if φ = φ c < 0. In other words, h(φ ) has one or more 0 entries if and only if φ = φ c. In addition, since all entries of h(0) are greater than 0 and h(φ ) is a continuous function, we have that all entries of h(φ ) are greater than 0 if and only if φ > φ c. Therefore, the operator B +1 is positive definite when φ > φ c. Noting that the vector transport is isometric in (2.3), the symmetry of B +1 is easily verified. 3. Global convergence analysis. In this section, global convergence is proven under a generalized convexity assumption and for φ [0,1 δ], where δ is any number in (0,1]. The behavior of the Riemannian Broyden methods with φ not necessarily in this interval is explored in the experiments. Note the result derived in this section also guarantees local convergence to an isolated local minimizer Basic assumptions and definitions. Throughout the convergence analysis, {x }, {B }, { B }, {α }, {s }, {y }, and {η } are infinite sequences generated by Algorithm 1, Ω denotes the sublevel set {x : f(x) f(x 0 )}, and x is a local minimizer of f in the level set Ω. The existence of such an x is guaranteed if Ω is compact, which happens, in particular, whenever the manifold M is compact. Coordinate expressions in a neighborhood and in tangent spaces are used when appropriate. For an element of the manifold, v M, ˆv R d denotes the coordinates defined by a chart ϕ over a neighborhood U, i.e., ˆv = ϕ(v) for v U. Coordinate expressions, ˆF(x), for elements, F(x), of a vector field F on M are written in terms of the canonical basis of the associated tangent space, T x M, via the coordinate vector fields defined by the chart ϕ. The convergence analysis depends on the property of(strong) retraction-convexity formalized in Definition 3.1 and the following three additional assumptions. Definition 3.1. For a function f : M R : x f(x) on a Riemannian manifold M with retraction R, define m x,η (t) = f(r x (tη)) for x M and η T x M. The function f is retraction-convex with respect to the retraction R in a set S if for all x S, all η T x M and η = 1, m x,η (t) is convex for all t which satisfy R x (τη) S for all τ [0,t]. Moreover, f is strongly retraction-convex in S if m x,η (t) is strongly convex, i.e., there exist two constants 0 < a 0 < a 1 such that a 0 d2 m x,η (t) a 2 1, for all x S, all η = 1 and all t such that R x (τη) S for all τ [0,t]. Assumption 3.1. The objective function f is twice continuously differentiable. Let Ω be a neighborhood of x and ρ be a positive constant such that, for all y Ω, Ω R y (B(0 y,ρ)) and R y ( ) is a diffeomorphism on B(0 y,ρ). The existence of Ω, termed ρ-totally retractive neighborhood of x, is guaranteed [18, 3.3]. Shrining Ω if necessary, further assume that it is an R-star shaped neighborhood of x, i.e., R x (trx 1 (x)) Ω for all x Ω and t [0,1]. The next two assumptions are a Riemannian generalization of a weaened version of [11, Assumption 2.1]: in the Euclidean setting (M is the Euclidean space R n and the retraction R is the standard one, i.e., R x (η) = x+η), if [11, Assumption 2.1] holds, then the next two assumptions are satisfied by letting Ω be the sublevel set Ω. Assumption 3.2. The iterates x stay continuouslyin Ω, meaning that R x (tη ) Ω for all t [0,α ]. Observe that, in view of Lemma 2.1 (B remains symmetric positive definite with respect to g), for all K > 0, the sequences {x }, {B }, { B }, {α }, {s }, {y }, and {η }, for K, are still generated by Algorithm 1. Hence, Assumption 3.2

7 A Riemannian Broyden Class Method 7 amounts to requiring that the iteratesx eventually stay continuouslyin Ω. Note that Assumption 3.2 cannot be removed. To see this, consider for example the unit sphere with the exponential retraction, where we can have x = x +1 with α η = 2π. (A similar comment was made in [18] before Lemma 3.6.) Assumption 3.3. f is strongly retraction-convex (Definition 3.1) with respect to the retraction R in Ω. The definition of retraction-convexity generalizes standard Euclidean and Riemannian concepts. It is easily seen that (strong) retraction-convexity reduces to (strong) convexity when the function is defined on Euclidean space. It can be shown that when R is the exponential mapping, (strong) retraction-convexity is ordinary (strong) convexity for a C 2 function based on the definiteness of its Hessian. It also can be shown that a set S as in Definition 3.1 always exists around a nondegenerate local minimizer of a C 2 function (Lemma 3.1). Lemma 3.1. Suppose Assumption 3.1 holds and Hessf(x ) is positive definite. Define m x,η (t) = f(r x (tη)). Then there exist a -totally retractive neighborhood N of x and two constants 0 < ã 0 < ã 1 such that (3.1) ã 0 d2 m x,η 2 (t) ã 1, for all x N, η T x M, η = 1 and t <. Proof. By definition, we have d f(r x(tη)) = g(gradf(r x (tη)),dr x (tη)[η]) and d 2 2f(R x(tη)) =g(hessf(r x (tη))[dr x (tη)[η]],dr x (tη)[η]) ( +g gradf(r x (tη)), D ) DR x(tη)[η], where the definition of D can be found in [2, 5.4]. Since DR x(tη)[η] t=0 = η and gradf(r x (tη)) t=0,x=x = 0 x, itholdsthat d2 f(r 2 x (tη)) t=0,x=x = g(hessf(x )[η],η). In addition, since Hessf(x ) is positive definite, there exist two constants 0 < b 0 < b 1 such that inequalities b 0 < d2 f(r 2 x (tη)) t=0,x=x < b 1 hold for all η T x M and η = 1. Note that d2 f(r 2 x (tη)) is continuous with respect to (x,t,η). Therefore, there exist a neighborhood U of x and a neighborhood V of 0 such that b 0 /2 < d2 f(r 2 x (tη)) < 2b 1 for all (x,t) U V, η T x M and η = 1. Choose > 0 such that [, ] V and let N U be a -totally retractive neighborhood of x Preliminary lemmas. The lemmas in this section provide the results needed to show global convergence stated in Theorem 3.1. The strategy generalizes that for the Euclidean case in [11]. Where appropriate, comments are included indicating important adaptations of the reasoning to Riemannian manifolds. The main adaptations originate from the fact that the Euclidean analysis exploits the relationship y = Ḡs ([11, (2.3)]), where Ḡ is a average Hessian of f, and this relationship does not gracefully generalize to Riemannian manifolds (unless the isometric vector transport T S in Algorithm 1 is chosen as the parallel transport, a choice we often want to avoid in view of its computational cost). This difficulty requires alternative approaches in severalproofs. The ey idea of the approachesis to mae use of the scaled function m x,η (t) rather than f, a well-nown strategy in the Riemannian setting.

8 8 W. Huang and K. A. Gallivan and P.-A Absil The first result, Lemma 3.2, is used to prove Lemma 3.4. Lemma 3.2. If Assumptions 3.1, 3.2 and 3.3 hold then there exists a constant a 0 > 0 such that (3.2) 1 2 a 0 s 2 (c 1 1)α g(gradf(x ),η ). Constant a 0 can be chosen as in Definition 3.1. Proof. In Euclidean space, Taylor s Theorem is used to characterize a function around a point. A generalization of Taylor s formula to Riemannian manifolds was proposed in [32, Remar 3.2], but it is restricted to the exponential mapping rather than allowing for an arbitrary retraction. This difficulty is overcome by defining a function on a curve on the manifold and applying Taylor s Theorem. Define m (t) = f(r x (tη / η )). Since f C 2 is strongly retraction-convex on Ω by Assumption 3.3, there exist constants 0 < a 0 < a 1 such that a 0 d2 m x,η(t) a 2 1 for all t [0,α η ]. From Taylor s theorem, we now f(x +1 ) f(x ) = m (α η ) m (0) = dm (0) (3.3) = g(gradf(x ),α η )+ 1 2 α η + 1 d 2 m (p) 2 d 2 m (p) 2 (α η ) 2 g(gradf(x ),α η )+ 1 2 a 0(α η ) 2, 2 (α η ) 2 where 0 p α η. Using (3.3), the first Wolfe condition (2.1) and that s = α η, we obtain (c 1 1)g(gradf(x ),α η ) a 0 s 2 /2. Lemma 3.3 generalizes [11, (2.4)]. Lemma 3.3. If Assumptions 3.1, 3.2 and 3.3 hold then there exist two constants 0 < a 0 a 1 such that (3.4) a 0 g(s,s ) g(s,y ) a 1 g(s,s ), for all. Constants a 0 and a 1 can be chosen as in Definition 3.1. Proof. In the Euclidean case of [11, (2.4)], the proof follows easily from the convexity of the cost function and the resulting positive definiteness of the Hessian over the entire relevant set. The Euclidean proof exploits the relationship y = Ḡs, where Ḡ is the average Hessian, and that Ḡ must be positive definite to bound the inner product s T y using the largest and smallest eigenvalues that can in turn be bounded on the relevant set. We do not have this property on a Riemannian manifold but the locing condition, retraction-convexity and replacing the average Hessian with a quantity derived from a function defined on a curve on the manifold allows the generalization. Definem (t) = f(r x (tη / η )). Usingthelocingcondition(2.10)andTaylor s Theorem yields ( dm(α η ) g(s,y ) = α η dm(0) ) α η d 2 m = α η 0 2 (s)ds, and since g(s,s ) = α 2 η 2, we have g(s,y ) g(s,s ) = 1 α η d 2 m α η 0 2 (s)ds.

9 A Riemannian Broyden Class Method 9 By Assumption 3.3, it follows that a 0 g(s,y ) g(s,s ) a 1. Lemma 3.4 generalizes [11, Lemma 2.1]. Lemma 3.4. Suppose Assumptions 3.1, 3.2 and 3.3 hold. Then there exist two constants 0 < a 2 < a 3 such that (3.5) a 2 gradf(x ) cosθ s a 3 gradf(x ) cosθ, for all, where cosθ = g(gradf(x ),η ) gradf(x ) η, i.e., θ is the angle between the search direction η and the steepest descent direction, gradf(x ). Proof. Define m (t) = f(r x (tη / η )). By (2.9) of Lemma 2.1, we have g(s,y ) α (c 2 1)g(gradf(x ),η ) = α (1 c 2 ) gradf(x ) η cosθ. Using (3.4) and noticing α η = s, we now s a 2 gradf(x ) cosθ, where a 2 = (1 c 2 )/a 1, proving the left inequality. By (3.2) of Lemma 3.2, we have (c 1 1)g(gradf(x ),α η ) a 0 s 2 /2. Noting s = α η and by the definition of cosθ, we have s a 3 gradf(x ) cosθ, where a 3 = 2(1 c 1 )/a 0. Lemma 3.5 is needed to prove Lemmas 3.6 and 3.9. Lemma 3.5 gives a Lipschitzlie relationship between two related vector transports applied to the same tangent vector. T 1 below plays the same role as T S in Algorithm 1. Lemma 3.5. Let M be a Riemannian manifold endowed with two vector transports T 1 C 0 and T 2 C where T 1 satisfies (2.5) and (2.6) and both transports are associated with a same retraction R. Then for any x M there exists a constant a 4 > 0 and a neighborhood of x, U, such that for all x,y U T 1η ξ T 2η ξ a 4 ξ η, where η = R 1 x y and ξ T x M. Proof. L R (ˆx,ˆη) and L 2 (ˆx,ˆη) denote the coordinate form of T Rη and T 2η respectively. Since all norms are equivalent for a finite dimensional space, there exist 0 < b 1 (x) < b 2 (x) such that b 1 (x) ξ x ˆξ x 2 b 2 (x) ξ x for all x U, where 2 denotes the Euclidean norm, i.e., 2-norm. By choosing U compact, it follows that b 1 ξ x ˆξ x 2 b 2 ξ x for all x U where 0 < b 1 < b 2, b 1 = min x U b 1 (x) and b 2 = max x U b 2 (x). It follows that T 1η ξ T 2η ξ = T 1η ξ T Rη ξ +T Rη ξ T 2η ξ c 0 η ξ + T Rη ξ T 2η ξ c 0 η ξ + 1 b 0 (L R (ˆx,ˆη) L 2 (ˆx,ˆη))ˆξ 2 c 0 η ξ + 1 b 0 ˆξ 2 L R (ˆx,ˆη) L 2 (ˆx,ˆη) 2 c 0 η ξ +b 1 ˆξ 2 ˆη 2 (since L R (ˆx,0) = L 2 (ˆx,0) and L 2 is smooth and L R C 1 because R C 2.) = a 4 ξ η where b 0, b 1 are positive constants. Lemma 3.6 is a consequence of Lemma 3.5. Lemma 3.6. Let M be a Riemannian manifold endowed with a retraction R whose differentiated retraction is denoted T R. Let x M. Then there is a neighborhood U

10 10 W. Huang and K. A. Gallivan and P.-A Absil of x and a constant ã 4 > 0 such that for all x,y U, any ξ T x M with ξ = 1, the effect of the differentiated retraction is bounded with T Rη ξ 1 ã 4 η, where η = R 1 x y. Proof. Applying Lemma 3.5 with T 1 = T R and T 2 isometric, we have T Rη ξ T 2η ξ b 0 ξ η, where b 0 is a positive constant. Noticing ξ = 1 and that is the induced norm, we have, by an application of the triangle inequality, b 0 η T Rη ξ T 2η ξ T Rη ξ T 2η ξ = T Rη ξ 1. Similarly, we have b 0 η T 2η ξ T Rη ξ T 2η ξ T Rη ξ = 1 T Rη ξ to complete the proof. Lemma 3.7 generalizes [11, (2.13)] and implies a generalization of the Zoutendij Condition [22, Theorem 3.2], i.e., if cosθ does not approach 0, then according to this lemma, the algorithm is convergent. Lemma 3.7. Suppose Assumptions 3.1, 3.2 and 3.3 hold. Then there exists a constant a 5 > 0 such that for all f(x +1 ) f(x ) (1 a 5 cos 2 θ )(f(x ) f(x )), where cosθ = g(gradf(x ),η ) gradf(x ) η. Proof. The original proof in [11, (2.13)] uses the average Hessian. As when proving Lemma 3.2, this is replaced by considering a function defined on a curve on the manifold. Let z = Rx 1 x and ζ = (Rx 1 x )/z. Define m (t) = f(r x (tζ )). From Taylor s Theorem, we have (3.6) m (0) m (z ) = dm (z ) (0 z )+ 1 d 2 m (p) 2 2 (0 z ) 2, where p is some number between 0 and z. Notice that x is the minimizer, so m (0) m (z ) 0. Also note that a 0 d2 m (p) a 2 1 by Assumption 3.3 and Definition 3.1. We thus have (3.7) dm (z ) 1 2 a 0z. Still using (3.6) and noticing that d2 m (p) (0 z 2 ) 2 0, we have (3.8) f(x ) f(x ) dm (z ) z. Combining (3.7) and (3.8) and noticing that dm (z ) have f(x ) f(x ) 2 a 0 g 2 (gradf(x ),T Rz ζ ζ ) and = g(gradf(x ),T Rz ζ ζ ), we (3.9) f(x ) f(x ) 2 a 0 gradf(x ) 2 T Rz ζ ζ 2 b 0 gradf(x ) 2, by Lemma 3.6 and the assumptions on Ω, where b 0 is a positive constant. Using (3.5), the firstwolfecondition(2.1) anhe definitionofcosθ, weobtainf(x +1 ) f(x ) b 1 gradf(x ) 2 cos 2 θ, where b 1 is some positive constant. Using (3.9), we obtain f(x +1 ) f(x ) f(x +1 ) f(x ) + f(x ) f(x ) b 1 gradf(x ) 2 cos 2 θ + f(x ) f(x ) b1 b 0 cos 2 θ (f(x ) f(x ))+f(x ) f(x ) = (1 a 5 cos 2 θ )(f(x ) f(x )), where a 5 = b 1 /b 0 is a positive constant.

11 A Riemannian Broyden Class Method 11 Lemma 3.8 generalizes [11, Lemma 2.2]. Lemma 3.8. Suppose Assumptions 3.1, 3.2 and 3.3 hold. Then there exist two constants 0 < a 6 < a 7 such that (3.10) a 6 g(s, B s ) s 2 α a 7 g(s, B s ) s 2, for all. Proof. We have (1 c 2 )g(s, B s ) = (1 c 2 )g(α η,α B η ) = (c 2 1)α 2 g(η,gradf(x )) (by Step 3 of Algorithm 1) α g(s,y ) (by (2.9) of Lemma 2.1) α a 1 s 2 (by (3.4)). g(s Therefore,wehaveα a, B s ) 6 s, wherea 2 6 = (1 c 2 )/a 1,givingtheleftinequality. By(3.2) oflemma3.2, we have(c 1 1)α g(gradf(x ),η ) 1 2 a 0 s 2. It follows that (c 1 1)α g(gradf(x ),η ) = (1 c 1 )g(b η,α η ) = 1 c 1 α g(s, B s ). g(s Therefore, we have α a, B s ) 7 s, where a 2 7 = 2(1 c 1 )/a 0, giving the right inequality. Lemma 3.9 generalizes [11, Lemma 3.1, Equation (3.3)]. Lemma 3.9. Suppose Assumption 3.1 holds. Then, for all there exists a constant a 9 > 0 such that (3.11) g(y,y ) a 9 g(s,y ). Proof. Define y P = gradf(x +1) Pγ 1 0 gradf(x ), where γ (t) = R x (tα η ), i.e., the retraction line from x to x +1 and P γ is the parallel transport along γ (t). Details about the definition of parallel transport can be found, e.g., in [2, 5.4]. From [18, Lemma 8], we have Pγ 0 1 y P H α η b 0 α η 2 = b 0 s 2, where H = 1 γ Hessf(γ (t))pγ t 0 and b 0 > 0. It follows that 0 P0 t y y y + y P P = y y + P P γ 0 1 y P y y + P P γ 0 1 y P H α η + H α η gradf(x +1 )/β T Sα η gradf(x ) gradf(x +1 )+Pγ 1 0 gradf(x ) + H α η +b 0 s 2 gradf(x +1 )/β gradf(x +1 ) + Pγ 1 0 gradf(x ) T Sα η gradf(x ) + H α η +b 0 s 2 b 1 s gradf(x +1 ) +b 2 s gradf(x ) (by Lemmas 3.5 and 3.6) +b 3 s +b 0 s 2 (By Assumption 3.2 and b 4 s (By Assumption 3.2) Hessf is bounded above in a compact set)

12 12 W. Huang and K. A. Gallivan and P.-A Absil whereb 1, b 2, b 3 and b 4 > 0. Therefore, bylemma 3.3, wehave g(y,y ) g(s,y ) g(y,y ) a 0g(s,s ) b 2 4 a 0 giving the desired result. Lemma 3.10 generalizes [11, Lemma 3.1] and as with the earlier lemmas the proof does not use an average Hessian. Lemma Suppose Assumptions 3.1, 3.2 and 3.3 hold. Then there exist constants a 10 > 0, a 11 > 0, a 12 > 0 such that (3.12) (3.13) (3.14) g(s, B s ) g(s,y ) a 10 α B s 2 g(s, B s ) a α 11 cos 2 θ g(y, B s ) g(y,s ) a 12 α cosθ for all. Proof. By (2.9) of Lemma 2.1, we have g(s,y ) (c 2 1)g(gradf(x ),α η ). So by the Step 3 of Algorithm 1, we obtain g(s,y ) (1 c2) g(s, Bs ) and therefore g(s, B s ) g(s,y ) a 10 α, where a 10 = 1/(1 c 2 ), proving (3.12). Inequality (3.13) follows from B s 2 g(s, B s ) = α 2 gradf(x ) 2 α s gradf(x ) cosθ (by Step 3 of Algorithm 1 and the definition of cosθ ) = α gradf(x ) s cosθ a 11 α cos 2 θ, (by (3.5)) where a 11 > 0. Finally, inequality (3.14) follows from g(y, B s ) g(s,y ) α y gradf(x ) g(s,y ) a1/2 9 α gradf(x ) g 1/2 (s,y ) a1/2 9 α gradf(x ) a 1/2 0 s α (by Step 3 of Algorithm 1) (by (3.11)) (by (3.4)) a 12 α cosθ, (by (3.5)) where a 12 is a positive constant. Lemma 3.11 generalizes [11, Lemma 3.2]. Lemma Suppose Assumptions 3.1, 3.2 and 3.3 hold. φ [0,1]. Then there exists a constant a 13 > 0 such that (3.15) α j a 13, j=1 for all 1.

13 A Riemannian Broyden Class Method 13 Proof. The major difference between the Euclidean and Riemannian proofs is that in the Riemannian case, we have two operators B and B as opposed to a single operator in the Euclidean case. Once we have proven that they have the same trace and determinant, the proof unfolds similarly to the Euclidean proof. The details are given for the reader s convenience. Use hat to denote the coordinates expression of the operators B and B in Algorithm 1 and consider trace(ˆb) and det(ˆb). Note that trace(ˆb) and det(ˆb) are independent of the chosen basis. Since T S is an isometric vector transport, we have that T Sα η is invertible for all, and thus trace(ˆ B ) = trace(ˆt Sα η ˆBˆT 1 S α η ) = trace(ˆb ), det(ˆ B ) = det(ˆt Sα η ˆBˆT 1 S α η ) = det(ˆb ). From the update formula of B in Algorithm 1, the trace of the update formula is (3.16) trace(ˆb +1 ) = trace(ˆb )+ y 2 g(y,s ) +φ y 2 g(y,s ) g(s, B s ) g(y,s ) (1 φ ) B s 2 g(s, B s ) 2φ g(y, B s ) g(y,s ). Recall that φ g(s, B s ) 0. If we choose a particularbasis such that the expression of the metric is the identity, then the Broyden update equation (2.3) is exactly the classical Broyden update equation, except that B is replaced by ˆ B, where B is defined in [11, (1.4)], and by [11, (3.9)] we have (3.17) det(ˆb +1 ) det(ˆb ) g(y,s ) g(s, B s ). Since det and g(, ) are independent of the basis, it follows that (3.17) holds regardless of the chosen basis. Using (3.11), (3.12), (3.13) and (3.14) for (3.16), we obtain (3.18) trace(ˆb +1 ) trace(ˆb )+a 9 +φ a 9 a 10 α a 11(1 φ )α cos 2 θ + 2φ a 12 α cosθ Notice that (3.19) α = α gradf(x ) η = α B s s cosθ g(gradf(x ),η ) g(s, B s ) = B s s α s 2 g(s, B s ) a 7 B s s (by (3.10)). Since the fourth term in (3.18) is always negative, cosθ 1, φ 0, (3.19) and (3.18) imply that trace(ˆb +1 ) trace(ˆb )+a 9 +(φ a 9 a 10 a 7 +2φ a 12 a 7 ) B s s.

14 14 W. Huang and K. A. Gallivan and P.-A Absil Since B s s B = σ 1 d i=1 σ i = trace( B ), where σ 1 σ 2... σ d are the singular values of B, we have trace(ˆb +1 ) a 9 + (1 + φ a 9 a 10 a 7 + 2φ a 12 a 7 )trace(ˆb ). This inequality implies that there exists a constant b 0 > 0 such that (3.20) trace(ˆb +1 ) b 0. Using (3.12) and (3.17), we have 1 1 (3.21) det(ˆb +1 ) det(ˆb ) det(ˆb 1 ). a 10 α a 10 α j From the geometric/arithmetic mean inequality 3 applied to the eigenvalues of ˆB +1, we now det(ˆb +1 ) ( trace( ˆB +1 ) d ) d, where d is the dimension of manifold M. Therefore, by (3.20) and (3.21), j=1 1 a 10 α j 1 det(ˆb 1 ) ( j=1 ) d trace(ˆb +1 ) 1 d det(ˆb 1 )d d(bd 0 ). Thus there exists a constant a 13 > 0 such that j=1 α j a 13, for all Main convergence result. With the preliminary lemmas in place, the main convergence result can be stated and proven in a manner that closely follows the Euclidean proof of [11, Theorem 3.1]. Theorem 3.1. Suppose Assumptions 3.1, 3.2 and 3.3 hold and φ [0,1 δ]. Then the sequence {x } generated by Algorithm 1 converges to a minimizer x of f. Proof. Inequality (3.18) can be written as (3.22) trace(ˆb +1 ) trace(ˆb )+a 9 +t α, where t = φ a 9 a 10 a 11 (1 φ )/cos 2 θ +2φ a 12 /cosθ. The proof is by contradiction. Assume cosθ 0, then t since φ is bounded away from 1, i.e., φ 1 δ. So there exists a constant K 0 > 0 such that t < 2a 9 /a 13 for all K 0. Using (3.22) and that ˆB +1 is positive definite, we have (3.23) 0 < trace(ˆb +1 ) trace(ˆb K0 )+a 9 ( +1 K 0 )+ < trace(ˆb K0 )+a 9 ( +1 K 0 ) 2a 9 a 13 j=k 0 t j α j j=k 0 α j. Applying the geometric/arithmetic mean inequality to (3.15), we get j=1 α j a 13 and therefore (3.24) j=k 0 α j a 13 K 0 1 j=1 α j. 3 For x i 0, ( d i=1 x i) 1/d d i=1 x i/d.

15 Plugging (3.24) into (3.23), we obtain A Riemannian Broyden Class Method 15 0 < trace(ˆb K0 )+a 9 ( +1 K 0 ) 2a 9 a a K a 13 a 13 = trace(ˆb K0 )+a 9 (1 K 0 )+ 2a K α j. a 13 For large enough, the right-hand side of the inequality is negative, which contradicts theassumptionthatcosθ 0. Thereforethereexistsaconstantϑandasubsequence such that cosθ j > ϑ > 0 for all j, i.e., there is a subsequence that does not converge to 0. Applying Lemma 3.7 completes the proof. 4. Ensuring the locing condition. In order to apply an algorithm in the RBroyden family, we must specify a retraction R and an isometric vector transport T S that satisfy the locing condition (2.8). Exponential mapping and parallel transport satisfy condition (2.8) with β = 1. However, for some manifolds, we do not have the analytical form of exponential mapping and parallel transport. Even if a form is nown its evaluation may be unacceptably expensive. Two methods of constructing an isometric vector transport given a retraction and a method for constructing a retraction and an isometric vector transport simultaneously are discussed in this section. In practice, the choice of the pair must also consider if an efficient implementation is possible. In Sections 4.1 and 4.2, the differentiated retraction T Rη ξ is only needed for η and ξ in the same direction, where η,ξ T x M and x M. A reduction of complexity can follow from this restricted usage of T R. We illustrate this point for the Stiefel manifold St(p,n) = {X R n p X T X = I p }asit isusedin ourexperiments. Consider the retraction by polar decomposition [2, (4.8)], j=1 (4.1) R x (η) = (x+η)(i p +η T η) 1/2. For retraction (4.1), as shown in [17, (10.2.7)], the differentiated retraction is T Rη ξ = zλ + (I n zz T )ξ(z T (x + η)) 1, where z = R x (η), Λ is the unique solution of the Sylvester equation (4.2) z T ξ ξ T z = ΛP +PΛ and P = y T (x+η) = (I p +η T η) 1/2. Now consider the more specific tas of computing T Rη ξ for η and ξ in the same direction. Since T Rη ξ is linear in ξ, we can assume without loss of generality that ξ = η. Then the solution of (4.2) has a closed form, i.e., Λ = P 1 x T ηp 1. This can be seen from z T η η T z = z T (η +x x) (η +x x) T z = x T z z T x+z T (η +x) (η +x) T z =x T z z T x = x T (x+η)(i p +η T η) 1/2 (I p +η T η) 1/2 (x+η) T x =x T ηp 1 P 1 η T x = x T ηp 1 +P 1 x T η (x T η is a sew symmetric matrix) =P(P 1 x T ηp 1 )+(P 1 x T ηp 1 )P. The closed form solution of Λ yields a form, T Rη η = (I n zp 1 η T )ηp 1, with lower complexity. We are not aware of such a low-complexity closed-form solution for T Rη ξ when η and ξ are not in the same direction. j=1 α j

16 16 W. Huang and K. A. Gallivan and P.-A Absil 4.1. Method 1: from a retraction and an isometric vector transport. Given a retraction R, if an associated isometric vector transport, T I, for which there is an efficient implementation, is nown then T I can be modified so that it satisfies condition (2.8). Consider x M, η T x M, y = R x (η) and define the tangent vectors ξ 1 = T Iη η and ξ 2 = βt Rη η with the normalizing scalar β = η T Rη η. The desired isometric vector transport is ( (4.3) T Sη ξ = id 2ν 2ν2 )( ν2 ν id 2ν 1ν ) 1 2 ν1 ν T Iη ξ, 1 where ν 1 = ξ 1 ω and ν 2 = ω ξ 2. ω could be any tangent vector in T y M which satisfies ω = ξ 1 = ξ 2. If ω is any unit-norm vector in the space spanned by {ξ 1,ξ 2 }, e.g., ω = ξ 1 or ξ 2, then P y is the well-nown direct rotation from ξ 1 to ξ 2 in the inner product that defines. The use of the negative sign avoids numerical cancelationasξ 1 approachesξ 2, i.e., nearconvergence. TwoHouseholderreflectorsare used to preserve the orientation, which is sufficient to mae T S satisfy the consistency condition (ii) of vector transport. It can be shown that this T S satisfies conditions (2.5) and (2.6) [17, Theorem 4.4.1] Method 2: from a retraction and bases of tangent spaces. Method 1 modifies a given isometric vector transport T I. In this section, we show how to construct an isometric vector transport T I from a field of orthonormal tangent bases. Let d denote the dimension of manifold M and let the function giving a basis of T x M be B : x B(x) = (b 1,b 2,...b d ), where b i,1 i d form an orthonormal basis of T x M. Consider x M, η,ξ T x M, y = R x (η), B 1 = B(x) and B 2 = B(y). Define [ ] T, B1 : T x M R d : η x g x (b (1) 1,η x)... g x (b (1) d,η (1) x) where b i is the i-th element of B 1 and liewise for B2. Then T I in Method 1 can be chosen to be defined by T Iη = B 2 B1. Simplifying (4.3) yields the desired isometric vector transport ( T Sη ξ = B 2 I 2v 2v2 T )( v2 Tv I 2v 1v T ) 1 2 v1 Tv B1ξ, 1 wherev 1 = B 1η w,v 2 = w βb 2T Rη η. w canbeanyvectorsuchthat w = B 1η = βb 2 T R η η, and choosing w = B 1 η or βb 2 T R η η yields a direct rotation. The problem therefore becomes how to build the function B. Absil et al. [2, p. 37] give an approach based on (U,ϕ), a chart of the manifold M, that yields a smooth B defined in the chart domain. E i, the i-th coordinate vector field of (U,ϕ) on U, is defined by (E i f)(x) := i (f ϕ 1 )(ϕ(x)) = D(f ϕ 1 )(ϕ(x))[e i ]. These coordinate vector fields are smooth and every vector field ξ T M admits the decomposition ξ = i (ξϕ i)e i on U, where ϕ i is the i-th component of ϕ. The function B : x B(x) = (E 2,E 2,...,E d ) is a smooth function that builds a basis on U. Finally, any orthogonalization method, such as the Gram-Schmi algorithm or a QR decomposition, can be used to get an orthonormal basis giving the function B : x B(x) = B(x)M(x), where M(x) is an upper triangle matrix with positive diagonal terms Method 3: from a transporter. Let L(TM,TM) denote the fiber bundlewithbasespacem Msuchthatthefiberover(x,y) M MisL(T x M,T y M), the set of all linear maps from T x M to T y M. We define a transporter L on M

17 A Riemannian Broyden Class Method 17 to be a smooth section of the bundle L(TM,TM) that is, for (x,y) M M, L(x,y) L(T x M,T y M) such that, for all x M, L(x,x) = id. Given a retraction R, it can be shown that T defined by (4.4) T ηx ξ x = L(x,R x (η x ))ξ x is a vector transport with associated retraction R. Moreover, if L(x, y) is isometric from T x M to T y M, then the vector transport (4.4) is isometric. (The term transporter was used previously in [24] in the context of embedded submanifolds.) In this section, it is assumed that an efficient transporter L is given, and we show that the retraction R defined by the differential equation (4.5) d R x(tη) = L(x,R x (tη))η, R x (0) = x and the resulting vector transport T defined by(4.4) satisfy the locing condition(2.8). To this end, let η be an arbitrary vector in T x M. We have T η η = L(x,R x (η))η = d R x(tη) t=1 = d dτ R x(η+τη)) τ=0 = T Rη η, where we have used (4.4), (4.5) and (2.7). That is, R and T satisfy the locing condition (2.8) with β = 1. For some manifolds, there exist transporters such that (4.5) has a closed solution, e.g., the Stiefel manifold [17, ] and the Grassmann manifold [17, ]. 5. Limited-memory RBFGS. In the form of RBroyden methods discussed above, explicit representations are needed for the operators B, B, T Sα η, and T 1 S α. These may not be available. Furthermore, even if explicit expressions are η nown, applying them may be unacceptably expensive computationally, e.g., the matrix multiplication required in the update of B. Generalizations of the Euclidean limited-memory BFGS method can solve this problem for RBFGS. The idea of limitedmemory RBFGS (LRBFGS) is to store some number of the most recent s and y and to transport those vectors to the new tangent space rather than the entire matrix H. For RBFGS, the inverse update formula is H +1 = V H V + ρ s s, where H = T Sα η H T 1 1 S α, ρ η = g(y,s ) and V = id ρ y s. If the l+1 most recent s and y are stored then we have H +1 = Ṽ Ṽ 1 Ṽ l H 0 +1Ṽ l Ṽ 1Ṽ +ρ l Ṽ Ṽ 1 Ṽ l+1 s(+1) l s (+1) l + +ρ s (+1) s (+1), Ṽ l+1 Ṽ 1Ṽ where Ṽi = id ρ i y (+1) i s (+1) i, H0 +1 is the initial Hessian approximation for step + 1, s (+1) i represents a tangent vector in T x+1 M given by transporting s i and liewise for y (+1) i. The details of transporting s i,y i to s (+1) i are given later. Note that H +1 0 is not necessarily H l. It can be any positive definite self-adjoint operator. Similar to the Euclidean case, we use (5.1) H0 +1 = g(s,y ) g(y,y ) id.,y (+1) i ItiseasilyseenthatStep4toStep13ofAlgorithm2yieldη +1 = H +1 gradf(x +1 ) (generalized from the two-loop recursion, see e.g., [22, Algorithm 7.4]), and only the action of T S is needed.

18 18 W. Huang and K. A. Gallivan and P.-A Absil Algorithm 2 LRBFGS Input: Riemannian manifold M with Riemannian metric g; a retraction R; isometric vector transport T S that satisfies (2.8); smooth function f on M; initial iterate x 0 M; an integer l > 0. 1: = 0, ε > 0, 0 < c 1 < 1 2 < c 2 < 1, γ 0 = 1, l = 0; 2: while gradf(x ) > ε do 3: H 0 = γ id. Obtain η T x M by the following algorithm: 4: q gradf(x ) 5: for i = 1, 2,...,l do 6: ξ i ρ i g(s () i,q); 7: q q ξ i y () i ; 8: end for 9: r H 0 q; 10: for i = l,l+1,..., 1 do 11: ω ρ i g(y () i,r); 12: r r+s () i (ξ i ω); 13: end for 14: set η = r; 15: find α that satisfies Wolfe conditions f(x +1 ) f(x )+c 1 α g(gradf(x ),η ) d f(r d x(tη )) t=α c 2 f(r x(tη ) t=0 16: Set x +1 = R x (α η ); 17: Defines (+1) = T Sα η α η, y (+1) = gradf(x +1 )/β T Sα η gradf(x ), ρ =,y (+1) ), γ +1 = g(s (+1),y (+1) 1/g(s (+1) )/ y (+1) 2, and β = α η T Rα η α η ; 18: Let l = max{ l,0}. Add s (+1), y (+1) and ρ into storage and if > l, then discard vector pair {s () l 1,y() l 1 } and scalar ρ l 1 from storage; Transport s () l,s () l+1,...,s() 1 and y() l,y () l+1,...,y() get s (+1) l,s (+1) l+1,...,s (+1) 1 and y (+1) l 19: = +1; 20: end while 1 from T x M to T x+1 M by T S, then,y (+1) l+1,...,y (+1) 1 ; Algorithm 2 is a limited-memory algorithm based on this idea. Note Step 18 of Algorithm 2. The vector s (+1) m is obtained by transporting s ( m+1) m m times. If the vector transport is insensitive to finite precision then the approachisacceptable. Otherwise,s (+1) m maybenotint x +1 M. Caremustbetaen to avoid this situation. One possibility is to project s (+1) i,y (+1) i,i = l,l+1,..., 1 to tangent space T x+1 M after every transport. 6. Ring and Wirth s RBFGS update formula. In Ring and Wirth s RBFGS [26] forinfinite dimensionalriemannian manifolds, the directionvectorη is chosenas the solution in T x M of the equation B RW (η,ξ) = Df Rx (0)[ξ] = g( gradf(x ),ξ) for all ξ T x M, where f Rx = f R x.

19 A Riemannian Broyden Class Method 19 The update in [26] is B RW +1(T Sα η ζ,t Sα η ξ) = B RW (ζ,ξ) BRW (s,ζ)b RW (s,ξ) B RW (s,s ) + yrw (ζ)y RW (ξ) y RW (s ) where s = Rx 1 (x +1 ) T x M and y RW = Df Rx (s ) Df Rx (0) is a cotangent vector at x, i.e., Df Rx (s )[ξ] = g(gradf(r x (s )),T Rs ξ) for all ξ T x M. One can equivalently define y RW to be y, where y = TR s gradf(r x (s )) gradf(x ) is in the tangent space T x M. Similarly, in order to ease the comparison by falling bac to the conventions of Algorithm 1, define B to be the linear transformation of T x M by B RW (ζ,ξ) = ζ B ξ. Then the update in [26] becomes T 1 S α B η +1 T Sα η = B B s (B s ) + y y (B s ) s y s. Since the vector transport is isometric, the update can be equivalently applied on T x+1 M and yields B +1 = B B s ( B s ) + ỹỹ ( B s ) s ỹ s, where s and B are defined in Algorithm 1 and (6.1) ỹ = T Sα η T R s gradf(r x (s )) T Sα η gradf(x ) T x+1 M. Comparing the definition of y in Algorithm 1 to (6.1), one can see that the difference between the updates in Algorithm 1 and [26] is that y uses a scalar 1 β while ỹ uses T Sα η TR s. 7. Experiments. To investigate the performance of the RBroyden family of methods, we consider the minimization of the Brocett cost function f : St(p,n) R : X trace(x T AXN), which finds p smallest eigenvalues and corresponding eigenvectors of A, where N = diag(µ 1,...,µ p ) with µ 1 > > µ p > 0, A R n n and A = A T. It has been shown that the columns of a global minimizer, X e i, are eigenvectors for the p smallest eigenvalues, λ i, ordered so that λ 1 λ p [2, 4.8]. The Brocett cost function is not a retraction-convex function on the entire domain. However, for generic choices of A, the cost function is strongly retraction-convex in a sublevel set around any global minimizer. RBroyden algorithms do not require retraction-convexity to be well-defined and all converge to a global minimizer of the Brocett cost function if started sufficiently close. For our experiments we view St(p,n) as an embedded submanifold of the Euclidean space R n p endowed with the Euclidean metric A,B = trace(a T B). The corresponding gradient can be found in [2, p. 80] Methods tested. For problems with sizes moderate enough to mae dense matrix operations acceptable in the complexity of a single step, we tested algorithms in the RBroyden family using the inverse Hessian approximation update (7.1) H +1 = H H y ( H y ) ( H y + s s ) y s y + φ g(y, H y )u u,,

20 20 W. Huang and K. A. Gallivan and P.-A Absil where u = s /g(s,y ) H y /g(y, H y ). It can be shown that if H = B 1, φ [0,1) and (7.2) φ = (1 φ )g 2 (y,s ) (1 φ )g 2 (y,s )+φ g(y, H y )g(s, B s ) (0,1], then H +1 = B The Euclidean version of the relationship between φ and φ can be found, e.g., in [16, (50)]. In our tests, we set variable φ to Davidon s value φ D defined in (7.6) or we set φ = φ = 1.0,0.8,0.6,0.4,0.2,0.1,0.01,0. For these problem sizes the inverse Hessian approximation update tends to be preferred to the Hessian approximation update since it avoids solving a linear system. Our experiments on problems of moderate size also include the RBFGS in [26] with the inverse Hessian update formula [26, (7)]. Finally, we investigate the performance of the LRBFGS for problems of moderate size and problems with much larger sizes. In the latter cases, since dense matrix operations are too expensive computationally and spatially, LRBFGS is compared to a Riemannian conjugate gradient method, RCG, described in [2]. All experiments are performed in Matlab R2014a on a 64 bit Ubuntu platform with 3.6 GHz CPU(Intel(R) Core(TM) i7-4790) Vector transport and retraction. Sections 2.2 and 2.3 in [18] provide methods for representing a tangent vector and constructing isometric vector transports when the d-dimensional manifold M is a submanifold of R w. If the codimension w d is not much smaller than w, then a d-dimensional representation for a tangent vector and an isometric transporter by parallelization L(x,y) = B y B x are preferred, where B : V R w d : z B z is a smooth function defined on an open set V of M and the columns of B z form an orthonormal basis of T z M. Otherwise, a w-dimensional representation for a tangent vector and an isometric transporter by rigging L(x,y) = G 1 2 y (I Q x Q T x Q y Q T x)g 1 2 x are preferred, where G z denotes a matrix expression of g z, i.e., g z (η z,ξ z ) = ηz TG zη z, Q x is given by orthonormalizing (I N x (Nx TN x) 1 Nx T)N y, liewise with x and y interchanged for Q y, and N : V R (w d) d : z N z is a smooth function defined on an open set V of M and the columns of N z form an orthonormal basis of the normal space at z. In particular for St(p,n), w and d are np and np p(p + 1)/2 respectively. Methods of obtaining smooth functions to build smooth bases of the tangent and normal spaces for St(p,n) are discussed in [18, Section 5]. We use the ideas in Section 4.3 applied to the isometric transporter derived from the parallelization B introduced in[18, Section 5] to define a retraction R. The details, wored out in [17, ], yield the following result. Let X St(p,n). We have (7.3) (R X (η) R X (η) ) = (X X ) ( exp ( Ω K T K 0 (n p) (n p) )), where Ω = X T η, K = X T η and given M Rn p, M R n (n p) denotes a matrix that satisfies M T M = I (n p) (n p) and M T M = I p (n p). The function (7.4) Y = R X (η)

A Broyden Class of Quasi-Newton Methods for Riemannian Optimization

A Broyden Class of Quasi-Newton Methods for Riemannian Optimization Wen Huang, K. A. Gallivan, P.-A. Absil FSU15-04 Florida State University Mathematics Department Tech. report FSU-15-04 A BROYDEN CLASS