x M where f is a smooth real valued function (the cost function) defined over a Riemannian manifold

Size: px

Start display at page:

Download "x M where f is a smooth real valued function (the cost function) defined over a Riemannian manifold"

Ashley Jackson
5 years ago
Views:

1 1 THE RIEMANNIAN BARZILAI-BORWEIN METHOD WITH NONMONOTONE LINE SEARCH AND THE MATRIX GEOMETRIC MEAN COMPUTATION BRUNO IANNAZZO AND MARGHERITA PORCELLI Abstract. The Barzilai-Borwein method, an effective gradient descent method with clever choice of the step-length, is adapted from nonlinear optimization to Riemannian manifold optimization. More generally, global convergence of a nonmonotone line-search strategy for Riemannian optimization algorithms is proved under some standard assumptions. By a set of numerical tests, the Riemannian Barzilai-Borwein method with nonmonotone line-search is shown to be competitive in several Riemannian optimization problems. When used to compute the matrix geometric mean, known as the Karcher mean of positive definite matrices, it notably outperforms existing first-order optimization methods. Riemannian optimization; manifold optimization; Barzilai-Borwein algorithm; nonmonotone line-search; Karcher mean; matrix geometric mean; positive definite matrix. 1. Introduction. Globally convergent Barzilai-Borwein (BB) methods are well-known optimization methods for the solution of unconstrained and constrained optimization problems formulated in the Euclidean space. These algorithms are rather appealing for their simplicity, low-cost per iteration (gradient-type algorithms) and good practical performance due to a clever choice of the step-length ([2, 11, 13, 9]). The key of success of BB methods lies in the explicit use of first-order information of the cost function on one side and, on the other side, in the implicit use of second-order information embedded in the step-length through a rough approximation of the Hessian of the cost function. This is crucial in the solution of problems where computing the Hessian represents a heavy burden, as when the problem dimension is large. For strictly convex quadratic problems, global convergence of the BB method has been established in [28] while, in the general nonquadratic case, this property is guaranteed if BB is incorporated in a globalization strategy, see [29]. Due to the nonmonotone behaviour of the cost function through the BB steps, BB methods are generally combined with nonmonotone linesearch strategies that do not impose a sufficient decrease condition on the cost function at each step, but rather on the maximum of the cost functions over the last, say, M steps (with M > 1) ([16, 29, 17]). It has been observed that if M is sufficiently large this strategy does not spoil the BB average rate of convergence ([29, 17]) and yields impressive good practical performance, especially in comparison with classical conjugate gradient methods, see [29]. Motivated by the nice theoretical and numerical properties of the (globalized) BB methods in the Euclidean space, we consider here a more general setting: BB methods for Riemannian manifold optimization based on retractions. Therefore, in this work we address the following optimization problem min f(x), (1.1) x M where f is a smooth real valued function (the cost function) defined over a Riemannian manifold M. To the best of our knowledge, globally convergent nonmonotone BB type methods have been already proposed in Riemannian optimization only for the special case of Stiefel manifolds, Version of February 14, Dipartimento di Matematica e Informatica, Università degli Studi di Perugia, Via Vanvitelli 1, Perugia, Italy (bruno.iannazzo@dmi.unipg.it). Dipartimento di Ingegneria Industriale, Università degli Studi di Firenze, Viale Morgagni 40/44, Firenze, Italy (margherita.porcelli@unifi.it). 1

2 2 B. Iannazzo and M. Porcelli see [14, 33, 21]. More precisely, in [14] the global convergence of a trust-region nonmonotone Levenberg-Marquardt algorithm for minimization problems on closed domains is provided and the application of the proposed method for minimization on Stiefel manifolds results in a BB method with a nonmonotone trust-region. The papers [21] and [33] are concerned with the BB method with nonmonotone line-search and the global convergence of the procedure is proved in [21] for Stiefel manifolds. As a first contribution of this paper, under mild assumptions we prove global convergence of general (gradient-related) Riemannian optimization algorithms, when equipped with a nonmonotone line-search strategy. Line-search will be suitably adjusted to fit the framework of Riemannian manifold optimization and will be performed on the tangent space at a point. We therefore extend to Riemannian optimization the convergence results proved for the Euclidean space in [16, 29, 8]. Secondly, we introduce the Riemannian Barzilai-Borwein algorithm, which generalizes the Euclidean BB method, and discuss the corresponding algorithm embedded in a nonmonotone line-search strategy. Thirdly, we apply the proposed algorithm to the computation of the geometric mean of positive definite matrices, the so called Karcher mean, that is very popular in many areas of applied mathematics and engineering, see, e.g., part II of [26]. The proposed method is a first-order algorithm, nevertheless, as pointed-out in [20], the Riemannian Hessian (or its full approximation) of the cost function defining the Karcher mean of positive definite matrices has a high computational complexity which makes the use of second-order algorithms prohibitive. Finally, we benchmark the Riemannian BB over a set of problems involving different manifold structures and compare it with the trust-region and the steepest-descent line-search algorithms. Remarkably, in all our experiments, the performance of the proposed algorithm is superior to those of the existing first-order optimization algorithms for computing the Karcher mean of positive definite matrices ([31, 20, 5, 34]). The paper is organized as follows. For later convenience, we recall in Section 2 some concepts from Riemannian optimization using the language and the tools described in the book by [1]. Section 3 is devoted to the global convergence analysis of the nonmonotone line-search strategy for gradient-related Riemannian optimization algorithms; then, in Section 4 we introduce the Riemannian BB algorithm and its globalization via nonmonotone line-search. Section 5 is devoted to the Karcher mean of positive definite matrices and to the implementation of the Riemannian BB method for its computation. In Section 6 numerical tests are performed to confirm the good behaviour of the Riemannian BB method for computing the Karcher mean of positive definite matrices and the possibility to apply the method to other problems of Riemannian optimization. Conclusions are drawn in Section Preliminaries on Riemannian optimization. The subject of Riemannian optimization is the computation of the minima of a differentiable function f : M R, where M is a real Riemannian manifold. The standard nonlinear optimization, which will be referred as Euclidean optimization, can be interpreted as a special case of Riemannian optimization, where M = R n. For the ease of the reader, we briefly recall some of the basic concepts of Riemannian optimization based on retractions. We refer the reader to the book [1] for a thorough treatise of the topic. In Riemannian optimization algorithms, the iterate x k belongs to a manifold M while the gradient at x k belongs to the tangent space T xk M at x k, isomorphic to an Euclidean space. The gradient of f related to the geometry of M is denoted by (R) f and defined as follows: for any point x M, (R) f(x) is the unique element of T x M that satisfies Df(x)[h] = (R) f(x), h x, h T x M,

3 The Riemannian Barzilai-Borwein algorithm 3 where, x is the Riemannian scalar product in T x M and Df(x) : T x M T x M is the differential of f at x. In the paper, we refer to (R) f(x) as the Riemannian gradient of f at x and frequently use the notation g(x) := (R) f(x) and g k := (R) f(x k ). A step of a gradient-like method in R n involves summing points and gradient vectors. While in R n, the concept of moving in the direction of the negative gradient is straightforward, on a manifold, using the gradient (R) f(x k ) to update x k is tricky since x k lies on the manifold while the gradient belongs to T xk M. Indeed, the next iterate x k+1 ought to be constructed following a geodesic starting at x k and with tangent vector (R) f(x k ), that is a curve γ(t) : [0, t 0 ) M, with t 0 (0, ]. In contrast with the standard gradient-like methods for Euclidean unconstrained optimization, where t is an arbitrary positive number, in the Riemannian case, t must be chosen between 0 and t 0. Nevertheless, there are special manifolds, said to be geodesically complete, for which t 0 can be always chosen to be. Geodesics can be followed using the exponential map exp x (v), which gives the point γ(1) of a geodesic γ(t) starting at x = γ(0) and with tangent vector v. Since it is not always easy to find and follow geodesics, in manifold optimization the exponential map is approximated with a retraction map and we refer to retraction-based Riemannian optimization. The retraction is defined as a smooth map R from the tangent bundle T M to M, such that R x (0 x ) = x and DR x (0 x ) = id TxM, where R x denotes the restriction of R to T x M, 0 x is the zero element of T x M, id TxM is the identity map on T x M and we have identified the tangent space to T x M with itself. The exponential map is a special case of retraction since it satisfies the above properties too. In practice using a retraction, in lieu of the exponential map, yields the same convergence results in an optimization algorithm, being possibly much easier to compute. In general, the domain of R is not necessarily the whole tangent bundle but for each point x we may assume that there is a neighbourhood of 0 x in T x M such that R x is defined at any point of this neighbourhood. The step of a gradient-like method based on a retraction is of the form x k+1 = R xk ( α k g k ), where α k is a step-length such that α k g k belongs to the domain of R xk. The extension of the Euclidean BB algorithm to the Riemannian setting requires the sum of gradients evaluated at different points (see Section 4), that is vectors belonging to different tangent spaces. In retraction-based optimization algorithms, the standard tool to move vectors from a tangent space to another is the vector transport. Roughly speaking, given two tangent vectors at x, say v x, w x T x M, a vector transport T associated with a retraction R, is a smooth map that generates a point T wx (v x ) that belongs to T Rx(w x)m, and that is linear with respect to the argument. For the precise definition of the vector transport, we refer the reader to Sec. 8.1 of [1]. In many algorithms, a vector v T x M has to be transported to the tangent space T y M, with y x. Given a vector transport T, for the sake of simplicity, we use the notation T x y (v) to indicate the transport T wx (v), where w x is such that R x (w x ) = y. To ensure the existence of a w x such that R x (w x ) = y, we have to assume that the retraction has a sufficiently large domain so as to allow the search step and the vector transport. A special case of vector transport, well-known in geometry, is the parallel transport that can be interpreted as a vector transport where the underlying retraction is the exponential map. 3. Nonmonotone line-search in Riemannian optimization. The Armijo line-search is a standard optimization strategy that consists in successively reducing the step-length (backtracking) until a sufficient decrease criterion, called Armijo s condition, is met. When combined with a gradient-type algorithm, this strategy guarantees global convergence of the overall procedure under mild assumptions, see e.g. [27].

4 4 B. Iannazzo and M. Porcelli In Euclidean geometry, given a point x k R n and a search direction d k, the Armijo linesearch strategy is as follows: starting from a fixed step-length λ k > 0, compute the smallest nonnegative integer h k such that f(x k + σ h k λ k d k ) f(x k ) + γσ h k λ k f(x k ) T d k, for a suitable γ (0, 1) and reduction factor σ (0, 1), set the current step-length α k = σ h k λ k and update x k+1 = x k + α k d k. We have denoted by f(x k ) the (Euclidean) gradient of f at x k, that is the unique vector such that f(x k ) T h = Df(x k )[h] for h R n. A generalization of the Armijo line-search to Riemannian manifold optimization can be found in [1, Sec. 4.2], where the condition to find h k is f(r xk (σ h k λ k d k )) f(x k ) + γσ h k λ k g k, d k xk, (3.1) where R is a retraction on M and g k is the Riemannian gradient of f at x k. Then the iterate is updated as x k+1 = R xk (α k d k ) with α k = σ h k λ k. This strategy is sometimes referred to as curvilinear search since, in some sense, it is a search on a curve in M ([33]). Alternatively, it can be seen as a line-search on the tangent space. In certain optimization algorithms, such as the BB method, global convergence is still ensured if one allows the cost function to increase from time to time, while asking that the maximum of the cost function over the last M > 1 steps decreases. This can be achieved using a nonmonotone Armijo line-search strategy and choosing h k as the smallest nonnegative integer such that f(x k + σ h k λ k d k ) max {f(x k+1 j)} + γσ h k λ k f(x k ) T d k, 1 j min{k+1,m} where γ (0, 1) and σ (0, 1). The adaptation of the nonmonotone line-search in Riemannian optimization consists in finding h k as the smallest nonnegative integer such that f(r xk (λ k σ h k d k )) max 1 j min{k+1,m} {f(x k+1 j)} + γσ h k λ k g k, d k xk. (3.2) We now give our main result on the global convergence of the nonmonotone Armijo linesearch in Riemannian optimization when the search direction d k satisfies a general set of suitable conditions as, e.g., that of being gradient-related. The proof closely follows and generalizes the ones of [16], [8] and [29], where the case M = R n is considered. Theorem 3.1. Let M be a Riemannian manifold with metric, x at x M, and let R be a retraction on M whose domain is the whole tangent bundle. Let f : M R be a smooth function and x 0 M be such that M (x0) := {x M : f(x) f(x 0 )} is a compact set. Let {x k } M be a sequence defined by x k+1 = R xk (α k d k ), k = 0, 1, 2,... with α k R and d k T xk M, and let g k denote the Riemannian gradient of f at x k. Let σ (0, 1), γ (0, 1), M be a nonnegative integer and {λ k } be a sequence such that 0 < a λ k b, where a and b are independent of k. Assume that: (a) d k is a descent direction, that is, g k, d k xk < 0 for each k; (b) the sequence {d k } is gradient-related, namely, for any subsequence {x k } k K of {x k } that converges to a nonstationary point of f, the corresponding subsequence {d k } k K is bounded and satisfies lim sup g k, d k xk < 0; k K

5 The Riemannian Barzilai-Borwein algorithm 5 (c) for any subsequence {x k } k K of {x k } that converges to a stationary point, then there exists a subsequence {d k } k K2 of {d k }, with K 2 K, such that lim d k = 0; k K 2 (d) α k = σ h k λ k, where, for each k, h k is the smallest nonnegative integer such that f(r xk (σ h k λ k d k )) Then every limit point of the sequence {x k } is stationary. max {f(x k+1 j)} + γσ h k λ k g k, d k xk. (3.3) 1 j min{k+1,m} We prove the theorem using a few lemmas. Lemma 3.2. In the hypotheses of Theorem 3.1, for j = 1, 2, 3,... let V j = max{f(x jm M+1 ), f(x jm M+2 ),..., f(x jm )}, and ν(j) {jm M + 1, jm M + 2,..., jm} be such that Then, f(x ν(j) ) = V j. V j+1 V j + γα ν(j+1) 1 g ν(j+1) 1, d ν(j+1) 1 xν(j+1) 1, (3.4) for all j = 1, 2,.... Proof. We prove that, for l = 1, 2,..., M, and for all j = 1, 2,..., we have f(x jm+l ) V j + γα jm+l 1 g jm+l 1, d jm+l 1 xjm+l 1 V j. (3.5) Observe that from assumption (a) and the definition of α jm+l 1 and γ we have the latter inequality. For any j, the case l = 1 follows by the nonmonotone line-search assumption (d), while assuming that the statement is true for any l < l, with 1 < l M, we can write f(x jm+l ) max{f(x jm M+l ),..., f(x jm ), f(x jm+1 ),..., f(x jm+l 1 )} + γα jm+l 1 g jm+l 1, d jm+l 1 xjm+l 1. Since f(x jm M+l ),..., f(x jm ) V j by definition and f(x jm+1 ),..., f(x jm+l 1 ) V j by inductive hypothesis, we have that (3.5) holds for l. Since x ν(j+1) = x jm+l for some l {1,..., M}, the inequality (3.4) follows from (3.5). Let us define and observe that ν(j) < ν(j + 1) < ν(j) + 2M. K = {ν(1) 1, ν(2) 1,... }, (3.6) Lemma 3.3. In the hypotheses of Theorem 3.1 and with K as in (3.6), every limit point of {x k } k K is stationary. Proof. Since f is bounded in M (x0) = {x M : f(x) f(x 0 )} and it is monotone in {x k+1 } k K, there exists f such that lim k K f(x k+1 ) = f. Taking the limit of both sides of (3.4), and using the assumption (a) of Theorem 3.1, we get lim γα ν(j+1) 1 g ν(j+1) 1, d ν(j+1) 1 xν(j+1) 1 = 0, j

6 6 B. Iannazzo and M. Porcelli that is lim α k g k, d k xk = 0. (3.7) k K By contradiction, assume that there exists a nonstationary limit point, namely, there exists K 2 K such that lim k K2 x k = x and g(x ) 0. Since d k is gradient-related by assumption (b), there exists K 3 K 2 such that lim k K3 g k, d k xk exists and is a negative number. Thus, lim k K3 α k = 0, in view of (3.7). Since λ k a > 0, the condition lim k K3 α k = 0 implies that, for a sufficiently large k K 3, say for k K 4 K 3, we have h k 1, and thus condition (3.3) is not verified for α k = α k/σ. We have found a sequence {α k } k K 4 such that lim k K4 α k = 0 and f ( R xk (α kd k ) ) > max {f(x k+1 j)} + γα k g k, d k xk f(x k ) + γα k g k, d k xk, k K 4. 1 j min{k+1,m} (3.8) By taking a subsequence K 5 K 4 we may assume that every point {x k } k K5 and x belong to a single coordinate chart. By the previous relation (3.8), the fact that R x (0) = x for x M, and the mean value theorem applied to the smooth function f R xk : T xk M R, there exists ξ k (0, α k ), such that f ( R xk (α k d k) ) f ( R xk (0) ) = D(f R xk )(ξ k d k )[d k ] α k = Df(R xk (ξ k d k ))[DR xk (ξ k d k )[d k ]] = g(r xk (ξ k d k )), DR xk (ξ k d k )[d k ] Rxk (ξ k d k ) > γ g k, d k xk. Since d k is bounded by assumption (b), there exists K 6 K 5 such that lim k K6 d k = d 0 and thus lim k K6 ξ k d k = 0, and since the retraction R x (v) is smooth as a function of both x and v, we have lim k K6 R xk (ξ k d k ) = R x (0) = x and lim k K6 DR xk (ξ k d k )[d k ] = DR x (0)[d] = d. Finally, since f and the Riemannian metric are smooth, using the chart where the sequence and x lie, and taking limits on the last line of (3.9) we get lim k K 6 g(r xk (ξ k d k )), DR xk (ξ k d k )[d k ] Rxk (ξ k d k ) = g(x ), d x γ g(x ), d x. Now we have (1 γ) g(x ), d x 0, with γ (0, 1), and on the other hand, by the choice of K 3, g(x ), d x < 0 which leads to a contradiction. Lemma 3.4. In the hypotheses of Theorem 3.1, let K(l) for l 0 be the set {k + l, k K}. For each l 0, every limit point of {x k } k K(l) is stationary. Proof. We give a proof by induction on l. Since K(0) = K, the case l = 0 is Lemma 3.3. We assume that the property is true for K(l) and we prove it for K(l + 1). Let x be a limit point of {x k } k K(l+1), namely, there exists H 1 K(l + 1) such that lim k H1 x k = x ; we want to prove that x is stationary. The sequence {x k 1 } k H1 belongs to the compact set M (x0) and thus it has a subsequence {x k 1 } k H2, with H 2 H 1, converging to a certain limit point y. By inductive hypothesis, since {x k 1 } k H2 is a subsequence of {x k } k K(l), we have that y is a stationary point. By assumption (c), there exists H 3 H 2 such that lim k H3 d k 1 = 0, and since α j λ j b for any j, we have hence, x is stationary. x = lim k H 3 x k = lim k H 3 R xk 1 (α k 1 d k 1 ) = R y (0) = y; (3.9)

7 The Riemannian Barzilai-Borwein algorithm 7 Proof. (of Theorem 3.1). Since {1, 2,...} = 2M l=0 K(l), one can see that for any sequence {x k } k K0 converging to x, there exists 0 l 2M such that infinitely many terms of {x k } k K0 belong to {x k } k K( l), that is, there exists K 00 K 0 K( l) such that lim k K00 x k = x. Since the sequence {x k } k K00 belongs to {x k } k K( l), using Lemma 3.4, we conclude that x is a stationary point. A few remarks on the assumptions of Theorem 3.1 are in order. Provided that x k is not stationary for all k, assumptions (a)-(c) are always fulfilled for non-finite sequences generated by a gradient-type algorithm, being d k = g k. The requirement that M (x0) is a compact set is natural in nonlinear minimization. The strongest hypothesis is that the retraction R has to be defined in the whole tangent bundle. This occurs in several practical cases, but it is not always the case. On the other hand, in order to apply the algorithm and prove its global convergence, we only need that for each k, R xk (td k ) is defined for t [0, λ k ]; this might be true also if R is not globally defined. 4. The Riemannian Barzilai-Borwein method. In this section we propose an adaptation of the Euclidean BB method to the solution of the Riemannian manifold optimization problem (1.1). The BB method belongs to the class of first-order (or gradient-type) optimization algorithms, namely algorithms that only use the gradient of the cost function, while disregarding the Hessian (second-order information). Given the problem min f(x), x R n where f : R n R is a given smooth cost function, the simplest gradient-type method is a linesearch method based on the steepest descent direction: given x 0 R n, the algorithm generates a sequence {x k } k with x k+1 = x k α k f(x k ), (4.1) where α k is a suitable step-length. The majority of gradient-type methods are also descent methods, that is, {x k } k is built so that the cost function decreases monotonically on {x k } k, i.e., f(x k+1 ) < f(x k ) for all k, unless the sequence converges in a finite number of steps. The ideal step-length α k satisfies α k = argmin f(x k α f(x k )) (4.2) α ( exact line-search ); however this value is usually unnecessarily expensive to compute for a general nonlinear cost function f. More practical strategies perform an inexact line-search to identify a step-length that achieves adequate reductions in f at minimal cost. The steepest-descent method with exact line-search is commonly considered ineffective because of its slow convergence rate and its oscillatory behaviour, that has been completely understood and studied in the convex quadratic case, see [27]. The BB method provides an alternative strategy for the choice of the step-length; while it does not guarantee the cost function decrease at each step, in some problems, it yields impressive good practical performance. The basic idea of the BB method is to solve, for k 1, the leastsquares problem min t s k t y k 2, (4.3) with s k := x k+1 x k and y k := f(x k+1 ) f(x k ), which, assuming that x k+1 x k, has the

8 8 B. Iannazzo and M. Porcelli unique solution t = st k y k s T k s k. When st k y k > 0, the BB step-length is chosen to be α BB k+1 = st k s k s T k y. (4.4) k The overdetermined system s k t = y k is in fact a Quasi-Newton secant equation of the type A k+1 (x k+1 x k ) = f(x k+1 ) f(x k ), where we require the matrix A k+1 R n n to be a scalar multiple of the identity matrix, and then treat it as a linear least-squares problem. The Quasi-Newton method thus results in a gradient-type method where second-order information is implicitly embedded in the step-length through a cheap approximation of the Hessian. As in the Euclidean case, the idea of the Riemannian BB method is to approximate the action of the Riemannian Hessian of f at a certain point by a multiple of the identity. At the (k + 1)-st step, the Hessian is a linear map from T xk+1 M to T xk+1 M. Instead of the increment x k+1 x k, we can consider the vector η k = α k g k belonging to T xk M, and transport it to T xk+1 M, yielding s k := T ηk (η k ) = T xk x k+1 ( α k g k ) = α k T xk x k+1 (g k ). (4.5) We assume here that a vector transport between x k and x k+1 exists. To obtain y k we need to subtract two gradients that lie in two different tangent spaces. To be coherent with the manifold structure and to work on T xk+1 M, this difference should be made after g k is transported to T xk+1 M so that y k := g k+1 T xk x k+1 (g k ) = g k α k T ηk (η k ). (4.6) Imposing a scalar multiple of the identity as the approximation of the Hessian, the Riemannian counterpart of the secant equation (4.3) can be written again as s k t = y k, in the unknown t R. The least-squares approximation with respect to the metric of T xk+1 M yields t = s k,y k xk+1 s k,s k xk+1. Therefore the Riemannian BB step-length has the form α BB k+1 = s k, s k xk+1 s k, y k xk+1, (4.7) provided that s k, y k xk+1 > 0. Remark 4.1. Alternative strategies for the BB step-length have been proposed in Euclidean optimization, see e.g. [2, 11, 30, 15]. A standard choice is the step-length obtained by considering the least-square problem min t s k y k t 2 ([2]) that, by symmetry with respect to (4.3), yields the Riemannian BB step-length or the one obtained by alternating αk+1 BB and αbb k+1. αk+1 BB s k, y k xk+1 =, (4.8) y k, y k xk+1

9 The Riemannian Barzilai-Borwein algorithm 9 Remark 4.2. As opposed to the Euclidean case, different alternatives are available for y k. For instance, we can do the subtraction at T xk M after g k+1 is transported over there. Or, we can transport the two vectors to the tangent space at any point of the geodesic joining x k and x k+1, or even at any point of the manifold reachable from x k and x k+1 through a vector transport. All these alternatives for y k lead to different versions of the Riemannian BB method. Yet another possibility is to avoid completely the vector transport, identifying T xk M with T xk+1 M and then approximating the transport with the identity. This approximation is justified by the property T 0x (v) = v, where v, 0 x T x M (see [1]), but it is not recommended if one wants to preserve the structure The globalized Riemannian Barzilai-Borwein algorithm. Algorithm 1 summarizes the main steps of the Riemannian BB method with a nonmonotone line-search strategy. Here αk BB plays the role of λ k used in Section 3 and its updating at Step 5 follows [8]. Moreover, we assume that a retraction R and a transport vector T mapping are supplied for the manifold under consideration. Algorithm 1 (The Riemannian Barzilai-Borwein with nonmonotone line-search (RBB- NMLS) algorithm). Set the line-search parameters: the step-length reduction factor σ (0, 1); the sufficient decrease parameter γ (0, 1); the integer parameter for the nonmonotone linesearch M > 0; the upper and lower bounds for the step-length α max > α min > 0. Set the initial values: the starting point x 0 M; the starting step-length α0 BB [α min, α max ]. 1. Compute f 0 = f(x 0 ) and g 0 = (R) f(x 0 ). 2. For k = 0, 1, 2,... find the smallest h = 0, 1, 2,... such that f(r xk ( σ h αk BB g k )) max f k+1 j γσ h αk BB g k, g k xk 1 j min{k+1,m} and set α k := σ h α BB k. 3. Compute x k+1 = R xk ( α k g k ), f k+1 = f(x k+1 ) and g k+1 = (R) f(x k+1 ). 4. Compute y k and s k as in (4.6) and (4.5), respectively. 5. Set τ k+1 = s k,s k xk+1 s k,y k xk Set { { αk+1 BB min αmax, max { }} α = min, τ k+1 if sk, y k > 0, xk+1 α max otherwise. We point out that τ k+1 can be choose using alternative strategies, as mentioned in in Remark 4.1. Moreover, the backtracking strategy considered in Step 2 of Algorithm 1 could be slightly generalized without affecting the global convergence. In particular, the contraction factor σ could vary at each iteration of the line-search and, for example, could be chosen by safeguarded interpolation, see Sec. 3.5 in [27]. If x k is not a stationary point, then assuming that R xk ( αg k ) is defined for α > 0, by the mean value theorem (compare equation (3.9)), there exists ξ (0, α), such that f(r xk ( αg k )) f(x k ) α + γ g k, g k xk = g(r xk ( ξg k )), DR xk ( ξg k )[ g k ] R xk ( ξg k ) + γ g k, g k xk α 0 (γ 1) g k, g k xk < 0, since γ < 1. Thus, for α sufficiently small, we have f(r xk ( αg k )) f(x k ) γ α g k, g k xk max 1 j min{k+1,m} {f(x k+1 j)} γ α g k, g k xk,

10 10 B. Iannazzo and M. Porcelli that is, the nonmonotone Armijo s condition at Step 2 of Algorithm 1 is verified after a finite number of step reductions. The practical implementation of the various steps strictly depends on the structure of the cost function f and of the underlying manifold M; further details will be discussed in the next section for the computation of the matrix geometric mean. Other implementation issues are postponed to Section 6. We remark that the use of the nonmonotone strategy requires the additional cost of evaluating the cost function at each trial point x k (Step 3). Indeed, a pure BB algorithm (i.e., without line-search) consists in setting x k+1 = R xk ( αk BB g k ) without computing α k at Step 2. We will refer to RBB-NMLS and to RBB to indicate the Riemannian BB algorithm with and without nonmonotone line-search, respectively. Finally, note that if M = 1 then Algorithm 1 corresponds to the standard (monotone) line-search with Armijo s rule. 5. Computing the Karcher mean of positive definite matrices. In this section we recall the Riemannian optimization problem whose solution is the Karcher mean of positive definite matrices, and propose the adaptation of the RBB method for this problem. First, we describe the set of positive definite matrices as a Riemannian manifold. Let P n denote the set of positive definite matrices of size n and H n denote the real vector space of Hermitian matrices. The set P n is an open subset of the Euclidean space H n (with the scalar product E, F = trace(ef ) for E, F H n ) and thus it is a differentiable manifold with T (P n ) X isomorphic to H n for each X P n. Two main Riemannian structures on P n are commonly used in the literature: the first one is obtained by the Euclidean scalar product of H n at each point X P n, the second is the one induced by the scalar product E, F X = trace(x 1 EX 1 F ), E, F H n, X P n (5.1) see Ch. XII in [22] and Sec. 6.3 in [25]. Since the first structure is trivial and the second one does not have an established name, in the following we will refer to the second geometry as the Riemannian structure of P n. The scalar product (5.1) induces a metric on P n such that the distance between A, B P n is δ(a, B) = log(a 1/2 BA 1/2 ) F, (5.2) where F is the Frobenius norm. The resulting metric space is complete [3], simply connected and with nonpositive curvature, thus it is an example of Cartan-Hadamard manifold. There exists a unique geodesic joining A and B in P n and whose natural parametrization is γ(t) = A(A 1 B) t = A 1/2 (A 1/2 BA 1/2 ) t A 1/2 =: A# t B, t [0, 1], see [23]. Notice that the geodesic can be indefinitely extended for t R. A nice feature of this geometry is that for any set of matrices A 1,..., A m P n, there exists unique X P n that minimizes the function f(x) := f(x; A 1,..., A m ) = m δ 2 (X, A k ). (5.3) This point is called matrix geometric mean or Karcher mean of A 1,..., A m and it is established as the geometric mean of matrices. Therefore, we state the problem of computing the Karcher mean as the problem of minimizing the above function f over the manifold P n. A class of algorithms developed for this problem can be found in [5] and in [20]. k=1

11 The Riemannian Barzilai-Borwein algorithm 11 Since the cost function f in (5.3) is smooth, the corresponding derivative can be computed and two different gradients can be defined, one for each considered geometry, i.e. the Euclidean and Riemannian geometry. A tedious computation proves that the gradient with respect to the Euclidean geometry, that is the unique matrix (E) f(x) such that Df(X)[H] = trace( (E) f(x)h), is (E) f(x) = 2 m k=1 X 1 log(xa 1 k ). While gradient with respect to the Riemannian structure, namely the Riemannian gradient, defined as the unique Hermitian matrix (R) f(x) such that Df(X)[H] = (R) f(x), H X, for H H n, turns out to be (R) f(x) = 2 m X log(x 1 A k ) (5.4) k=1 see [24]. Notice that (E) f(x) = 0 if and only if (R) f(x) = 0. An interesting property of the Riemannian geometry of P n is that the function f in (5.3) is geodesically strictly convex with respect to the Riemannian geometry of P n, that is for any A, B P n, with A B, we have (see [3]) f(a# t B) < (1 t)f(a) + tf(b), t (0, 1); on the contrary, it can be shown that f is not convex in the Euclidean sense. The application of Algorithm 1 for the Karcher mean computation requires the definition of the retraction R and transport vector T mappings for the manifold P n. Since the final point of a geodesic starting at A and with tangent vector V is A exp(a 1 V ), the retraction at point A has the form R A (V ) = A exp(a 1 V ). It follows that the iterate update of RBB is X k+1 = R Xk ( α BB k g k ) = X k exp( αk BB X 1 k g k), (5.5) where αk BB is defined by (4.7) (or in an alternative way as discussed in Remark 4.1) and s k and y k are given below. Notice that the retraction (5.5) is the exponential map with respect to the Riemannian geometry of P n, and thus the vector transport is given by the parallel transport. In the Riemannian geometry of P n, the parallel transport of a vector V H n from the tangent space at A to the tangent space at B is given by (see [32]) T A B (V ) = (BA 1 ) 1/2 V (A 1 B) 1/2. Notice that the parallel transport is a congruence of the type MV M, with M = (BA 1 ) 1/2. Using the vec( ) operator which stacks the columns of a matrix into a long vector and the Kronecker product, we get the simple expression T A B vec(v ) = (M M) vec(v ), M = (BA 1 ) 1/2. Alternatively, we can also write T A B = (A# 1/2 B)A 1 V A 1 (A# 1/2 B). In view of (4.5) and (4.6), it is sufficient to transport a vector along itself, or, equivalently, along the geodesic tangent to it. We now derive the formula for this case that is simpler than those above.

12 12 B. Iannazzo and M. Porcelli Let V A T A P n and let B := R A (V A ) = A exp(a 1 V A ), then A 1 B = exp(a 1 V A ) and T VA (V A ) = T A B (V A ) = A(A 1 B) 1/2 A 1 V A (A 1 B) 1/2 = A exp(a 1 V A ) 1/2 A 1 V A exp(a 1 V A ) 1/2 = V A A 1 R A (V A ) = V A exp(a 1 V A ), (5.6) where we have used standard properties of matrix functions (compare Theorem 1.13 in [18]). The formula (5.6) yields a simple expressions for s k and y k of equations (4.5) and (4.6), respectively, s k = α k g k exp( α k X 1 k g k), y k = g k+1 g k exp( α k X 1 g k), that characterize the Riemannian BB method for the Karcher mean computation. As a final remark we observe that the retraction used in the Riemannian geometry of the positive definite matrices is defined in the whole tangent bundle and that, for any X 0 P n, the level set M (X0) = {X P n : f(x) f(x 0 )} is a compact set (compare the proof of Theorem 2.1 in [7]). Thus, this problem fits the hypotheses of Theorem Implementation of the RBB method for the Karcher mean. The RBB method requires the computation of several quantities of the type Aϕ(A 1 B), where A is positive definite, B is Hermitian and ϕ is a real function. This is indeed the case for the Riemannian gradient (5.4) and the retraction (5.5), while formula (5.6) is fairly similar. Following [19], these quantities can be efficiently computed by forming the Cholesky factorization of A = R A R A. Using the similarity invariance of matrix functions, we write Aϕ(A 1 B) = R AR A ϕ(r 1 A (R A) 1 B) = R Aϕ((R A) 1 BR 1 A )R A, and observe that (RA ) 1 BR 1 A = UDU, where D = (d ij ) is diagonal real and U is unitary, so that we obtain Aϕ(A 1 B) = R AU diag(ϕ(d 11 ),..., ϕ(d nn ))U R A. The cost of this basic operation is approximately γ 1 n 3, where γ 1 is a moderate constant. Since one step of the RBB algorithm requires m + 1 evaluations of the type Aϕ(A 1 B) (m are needed to compute the gradient and one for the retraction), the cost of the algorithm is about γ 1 (m + 1)n 3 k arithmetic operations (ops), where k is the number of steps needed to achieve convergence, while m is the number of matrices and n is their size. For comparison, the Richardson algorithm in [5], the steepest descent with fixed step-length and the Majorization- Minimization algorithm in [34] have essentially the same cost. Regarding the cost of the version of the RBB algorithm with the nonmonotone line-search, we need to take into account the number of ops needed to compute the distance (5.2) during the cost function evaluation. By a similar use of Cholesky factorizations and an eigenvalue computation, the distance can be computed with γ 2 n 3 ops, where γ 2 is smaller than γ 1. Overall, the cost function evaluation requires γ 2 mn 3 ops. This is not negligible with respect to the cost of a single step of the RBB algorithm. Hence algorithms without line-search are usually more effective, when the performed number of steps is the same. 6. Experiments. In this section we explore the behaviour of the Riemannian BB method (RBB) and the Riemannian BB method with nonmonotone line-search (RBB-NMLS) for the solution of various optimization problems on Riemannian manifolds. Section 6.1 is devoted to the computation of the Karcher mean of a set of positive definite matrices. To this purpose, we consider several test matrices constructed with the random function k

13 The Riemannian Barzilai-Borwein algorithm 13 of the Matrix Means Toolbox ([6]). These tests have been performed using MATLAB R2011b on a machine with a Double Intel Xeon X5680 Processor with 24 GB of ram. Experiments in Section 6.2 are carried out using Manopt ([10]), a toolbox which provides a large library of manifolds and ready-to-use Riemannian optimization algorithms together with the flexibility of including user-defined solvers. The experiments are run using MATLAB R2015a on a machine with Double Intel Xeon E v2 Processor and 128 GB of DDR3 ram Tests on the Karcher Mean computation. We investigate the performance of RBB (RBB) in comparison with four other first-order optimization algorithms for computing the Karcher mean of positive definite matrices: the Richardson iteration (Richardson) proposed in [5], the Majorization-Minimization algorithm (MajMin) of [34] and the standard Steepest-Descent (SD) and the Fletcher-Reeves variant of the Conjugate Gradients (CG FR) algorithms described in [10, 20]. Comparisons with the (Euclidean) Barzilai-Borwein algorithm (BB NoRiem) are also included. For the SD and the CG FR algorithms we implemented the Riemannian Armijo s line search (3.1), where the parameters λ k = 1, σ = 0.5 and γ = 0.5, are chosen in accordance with [20]; the maximum number of reductions has been set to 10. For the Conjugate Gradient algorithm we have chosen the Fletcher-Reeves variant because in our experience it showed lightly better results. The implementation of the standard Barzilai-Borwein algorithm (BB NoRiem) consists in using the Euclidean gradient as a search direction, avoiding the parallel transport and using the Euclidean scalar product in the definition of α k ; moreover, in order to ensure that all iterates are positive definite, the iteration update has been performed using the retraction, i.e. setting X k+1 = R xk exp( α k g k ). We compare the number of iterations needed for the convergence since all algorithms require roughly the same computational cost at each step, with the exception of SD and CG FR where some extra cost function evaluations are needed. RBB has been implemented following Algorithm 1. We have considered only the RBB without nonmonotone line-search, since we have noticed that for this problem the line-search is not required. The first step-length α0 BB for the RBB has been chosen using the strategy of the Richardson iteration in [5]. The step-length has been computed using (4.7) in both RBB and RBB NoRiem. The updating (4.8) and the alternating strategy for the step-length described in Remark 4.1 have been tested as well, giving essentially the same results (not report here). As a first test, we consider the three matrices A 1 = , A 2 = , A 3 = , (6.1) and we run the algorithms for the Karcher mean for a fixed number of steps and with no stopping criterion. At each step of each algorithm, we compute the relative error of the current value X k, that is ε k = X k K 2 K 2, where K is a reference value of the Karcher mean obtained using variable precision arithmetic in MATLAB and rounded to 16 significant digits. The results are presented in Figure 6.1 and show that RBB works very well in this example and outperforms the other first-order algorithms. Notice that BB NoRiem does not exploit the geometric structure of the problem and then gives worse results, converging in a nonmonotone way and in a larger number of steps. As a second test, we investigate how convergence depends on the initial value. We consider the matrices (6.1) with four different initial values: the arithmetic mean of the data, the Cheap

14 14 B. Iannazzo and M. Porcelli Richardson RBB MajMin SD CG FR BB NoRiem 10 6 Error Iteration Fig Comparison of different first-order Riemannian optimization methods and the (Euclidean) Barzilai- Borwein algorithm for the Karcher mean of the matrices in (6.1). mean of the data, known to be a good approximation of the Karcher mean ([4]), one of the data matrices and an ill-conditioned matrix of norm 1. The results are presented in Figure 6.2. The relative behaviour of the methods is mostly independent of the initial value in the later phase, while there can be some difference in the early steps. In all cases RBB performs better than the other methods. The step-length choice of the Richardson method is based on optimizing the rate of convergence near the fixed point (see [5]) and therefore the method is penalized whenever the initial value is far from the Karcher mean. As a third test, we run the algorithms on 4 different data sets: random matrices, obtained with the command random(10) of the Matrix Means Toolbox; random matrices, obtained with the command random(100); ill-conditioned matrices, with condition number 10 5, obtained with the command random(10,2,1e5); ill-conditioned matrices clustered far from the identity, obtained with the command random(10,2,1e5)*0.01+a, where A is a fixed matrix generated as in data set 3. In Figure 6.3, we show the relative error with respect to a reference approximation of the Karcher mean computed with variable precision arithmetic. We observe that RBB convergence behaviour is always the best, while performance of Richardson, MajMin, SD and CG FR depends on the test case. In particular, failures of SD and CG FR are due to the computation of tiny steps that prevent progress in the iteration. We remark that, since the data are randomly generated, each test has been repeated several times and we have always obtained the qualitative behavior shown in Figure 6.3.

15 The Riemannian Barzilai-Borwein algorithm Richardson RBB MajMin SD CG FR Richardson RBB MajMin SD CG FR Richardson RBB MajMin SD CG FR Richardson RBB MajMin SD CG FR Fig Comparison of different first-order optimization methods for the Karcher mean of the matrices in (6.1), using different initial values: the arithmetic mean (top-left), the Cheap mean (top-right), one of the data matrices (bottom-left) and a matrix with large condition number and unit norm (bottom-right) Testing RBB using Manopt. We consider the trust-region (TR) and steepest-descent (with monotone line-search and Armijo step-length) (SD) algorithms available in Manopt ([10]) and, for comparison, we consider an implementation of our RBB algorithm obtained by modifying the steepest descent implementation of Manopt. This allows us to use the execution time as a fair measure of algorithmic comparison. In our tests, we use the RBB method without line-search (RBB) and with nonmonotone line-search (RBB-NMLS) as described in Algorithm 1 both using the step-length (4.7). Default parameters, stopping criteria and starting guess choices provided in Manopt have been used. Regarding the parameter setting of RBB-NMLS, we set M = 10, γ = 10 4, α min = 10 3, α max = 10 3 and σ = 0.5 in Algorithm 1, as indicated in [8]. The vectors s k and y k at Step 6 are computed using the vector transport functions provided by Manopt for both RBB and RBB-NMLS. Moreover, the initial α0 BB is chosen so that the first trial point x 1 is the same as the one computed by SD. We consider all test problems available in [10] and follow the formulations and implementations provided therein. In particular, Table 6.1 summarizes the test problems together with a brief description and the various tested sizes. Table 6.2 collects the results in terms of average number of iterations Av-it, average execution time in seconds Av-cpu and number of failures F computed over 10 repeated runs. In

16 16 B. Iannazzo and M. Porcelli Richardson RBB MajMin SD CG FR Richardson RBB MajMin SD CG FR Richardson RBB MajMin SD CG FR Richardson RBB MajMin SD CG FR Fig Comparison of different first-order optimization methods for the Karcher mean using different data sets: 100 random matrices (top-left), 10 random matrices (top-right), 10 ill-conditioned matrices (bottom-left) and 10 ill-conditioned matrices clustered far from the identity (bottom-right). the presence of failures within the 10 runs, the average values are computed taking into account successful runs for all solvers only. The overall number of runs is 240. The comments below follow from Table 6.2: As expected, the use of the nonmonotone strategy makes the RBB procedure much more robust: overall RBB and RBB-NMLS fail 33 and 3 times, respectively. Clearly, when RBB is successful, that is, the nonmonotone strategy is not needed and therefore not activated in RBB-NMLS, RBB is faster than RBB-NMLS since there is no need to compute and store the cost function at each iteration. Focusing on first-order procedures, the use of the BB step-length significantly enhances the speed of convergence of the gradient-type algorithms: on average SD takes much more iterations than RBB and RBB-NMLS resulting in a much higher computational time. This behaviour is especially evident in the solution of KM where Av-cpu of SD is at least a factor 6 of the same value of RBB. Regarding the comparison of RBB with the second-order procedures, we note that the behaviour of TR and RBB-NMLS is comparable. For KM and increasing values of n, RBB is faster than TR since, as expected, the computation of the Hessian of the problem cost function becomes increasingly costly.

17 The Riemannian Barzilai-Borwein algorithm 17 Name Size Description Manifold KM n = 50 Computing the Karcher Mean of a n = 100 set of n n positive definite matrices. SPD n = 200 SPCA n = 100, p = 10, m = 2 Computing the m Sparse Principal n = 500, p = 15, m = 5 Components of a p n matrix Stiefel n = 1000, p = 30, m = 10 encoding p samples of n variables. SP n = 8 Finding the largest diameter of n equal circles Product n = 12 that can be placed on the sphere without of n = 24 overlap (the Sphere Packing problem). Spheres DIS n = 128 Finding an orthonormal basis of the n = 500 Dominant Invariant 3-Subspace of an Grassmann n = 1000 n n matrix. LRMC n = 100, p = 10, m = 2 Low-Rank Matrix Completion: given partial fixed-rank n = 500, p = 15, m = 5 observation of an m n matrix matrices n = 1000, p = 30, m = 10 of rank p, attempting to complete it. MC n = 20 Max-Cut: given an n n Laplacian matrix SPD n = 100 of a graph, finding the max-cut, or an matrices of n = 300 upper-bound on the maximum-cut value. rank 2 TSVD n = 60, m = 42, p = 5 Computing the SVD decomposition of an n = 100, m = 60, p = 7 m n matrix truncated to rank p. Grassmann n = 200, m = 70, p = 8 GP m = 10, N = 50 Generalized Procrustes: rotationally align Product of rotations m = 50, N = 100 clouds of points. Data: matrix A R 3 m N, with the Euclidean m = 50, N = 500 each slice is a cloud of m points in R 3. space for A Table 6.1 Test problems and manifold structure provided in Manopt. TR SD RBB RBB-NMLS Av-it Av-cpu F Av-it Av-cpu F Av-it Av-cpu F Av-it Av-cpu F KM SPCA SP DIS LRMC MC TSVD GP Table 6.2 Results on problems in Table 6.1.

18 18 B. Iannazzo and M. Porcelli 1 CPU time performance profile π(τ) TR (100%) SD LS (59.6%) RBB (85.8%) RBB NMLS (98.7%) τ Fig CPU time performance profiles on problems in Table 6.1. We also summarize the results plotting the performance profile function π S defined as π S (τ) = number of problems s.t. q P,S τ q P number of problems, τ 1, that is the probability for solver S that a performance ratio q P,S /q P is within a factor τ of the best possible ratio ([12]). Here q P,S denotes computational effort of the solver S to solve problem P and q P is computational effort of the best solver to solve problem P. Note that π S (1) is the fraction of problems for which solver S performs the best, π S (2) gives the fraction of problems for which the algorithm is within a factor of 2 of the best algorithm, and that for t sufficiently large, π S (t) is the fraction of problems solved by S. In Figure 6.4 we consider the total CPU time as measure of computational effort for the compared solvers and plot π S (τ) for τ 6 in order to zoom the behaviour of the solver for small value of τ. Therefore, we include in the legend the percentage of tests solved by each solver (retrievable in the plot for large values of τ). We observe that RBB is the most efficient for the 60% of the runs, TR and RBB-NMLS are comparable in both robustness and efficiency and SD shows the worst performance. 7. Conclusions. We have adapted the Barzilai-Borwein method to the framework of Riemannian optimization. The resulting algorithm is competitive in several cases, since it requires just first-order information, while it converges usually faster than other first-order algorithms such as the steepest-descent method. We have provided also a general convergence result for gradient-related Riemannian optimization algorithms, when a nonmonotone line-search is considered. This strategy, can be applied, in

Unconstrained optimization

Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout