Max-Planck-Institut für Mathematik in den Naturwissenschaften Leipzig

Similar documents
Computing and decomposing tensors

SYMPLECTIC GEOMETRY: LECTURE 5

Pencil-based algorithms for tensor rank decomposition are unstable

Lecture III: Neighbourhoods

Max-Planck-Institut für Mathematik in den Naturwissenschaften Leipzig

DEVELOPMENT OF MORSE THEORY

Math 147, Homework 5 Solutions Due: May 15, 2012

KUIPER S THEOREM ON CONFORMALLY FLAT MANIFOLDS

arxiv: v1 [math.oc] 22 May 2018

CALCULUS ON MANIFOLDS. 1. Riemannian manifolds Recall that for any smooth manifold M, dim M = n, the union T M =

QUALIFYING EXAMINATION Harvard University Department of Mathematics Tuesday 10 February 2004 (Day 1)

Numerical Methods. Elena loli Piccolomini. Civil Engeneering. piccolom. Metodi Numerici M p. 1/??

fy (X(g)) Y (f)x(g) gy (X(f)) Y (g)x(f)) = fx(y (g)) + gx(y (f)) fy (X(g)) gy (X(f))

COMPLEX VARIETIES AND THE ANALYTIC TOPOLOGY

Tangent bundles, vector fields

Math Topology II: Smooth Manifolds. Spring Homework 2 Solution Submit solutions to the following problems:

LIMITING CASES OF BOARDMAN S FIVE HALVES THEOREM

1.4 The Jacobian of a map

arxiv:math/ v1 [math.dg] 23 Dec 2001

LECTURE 25-26: CARTAN S THEOREM OF MAXIMAL TORI. 1. Maximal Tori

Maria Cameron. f(x) = 1 n

B 1 = {B(x, r) x = (x 1, x 2 ) H, 0 < r < x 2 }. (a) Show that B = B 1 B 2 is a basis for a topology on X.

Linear Algebra Massoud Malek

Course Summary Math 211

LECTURE 10: THE ATIYAH-GUILLEMIN-STERNBERG CONVEXITY THEOREM

On the simplest expression of the perturbed Moore Penrose metric generalized inverse

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

arxiv: v2 [math.oc] 31 Jan 2019

LECTURE: KOBORDISMENTHEORIE, WINTER TERM 2011/12; SUMMARY AND LITERATURE

Principal Component Analysis

LECTURE 7: STABLE RATIONALITY AND DECOMPOSITION OF THE DIAGONAL

We have the following immediate corollary. 1

In English, this means that if we travel on a straight line between any two points in C, then we never leave C.

Navigation and Obstacle Avoidance via Backstepping for Mechanical Systems with Drift in the Closed Loop

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms

SYMPLECTIC MANIFOLDS, GEOMETRIC QUANTIZATION, AND UNITARY REPRESENTATIONS OF LIE GROUPS. 1. Introduction

Variational inequalities for set-valued vector fields on Riemannian manifolds

INVERSE FUNCTION THEOREM and SURFACES IN R n

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

The Geometry of Matrix Rigidity

Chapter 3. Riemannian Manifolds - I. The subject of this thesis is to extend the combinatorial curve reconstruction approach to curves

Multipoint secant and interpolation methods with nonmonotone line search for solving systems of nonlinear equations

SELF-ADJOINTNESS OF SCHRÖDINGER-TYPE OPERATORS WITH SINGULAR POTENTIALS ON MANIFOLDS OF BOUNDED GEOMETRY

Geometry and topology of continuous best and near best approximations

Functional Analysis Review

Generalized Shifted Inverse Iterations on Grassmann Manifolds 1

Lecture Notes 1: Vector spaces

COMP 558 lecture 18 Nov. 15, 2010

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Essential Matrix Estimation via Newton-type Methods

The nonsmooth Newton method on Riemannian manifolds

Math 215B: Solutions 1

BEN KNUDSEN. Conf k (f) Conf k (Y )

II. DIFFERENTIABLE MANIFOLDS. Washington Mio CENTER FOR APPLIED VISION AND IMAGING SCIENCES

REGULAR TRIPLETS IN COMPACT SYMMETRIC SPACES

Lecture notes: Applied linear algebra Part 1. Version 2

LECTURE 6: J-HOLOMORPHIC CURVES AND APPLICATIONS

DIFFERENTIAL GEOMETRY, LECTURE 16-17, JULY 14-17

Lecture Fish 3. Joel W. Fish. July 4, 2015

Qualifying Exams I, 2014 Spring

NORMS ON SPACE OF MATRICES

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

LECTURE 15: COMPLETENESS AND CONVEXITY

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

Tensors. Notes by Mateusz Michalek and Bernd Sturmfels for the lecture on June 5, 2018, in the IMPRS Ringvorlesung Introduction to Nonlinear Algebra

9. Birational Maps and Blowing Up

Peak Point Theorems for Uniform Algebras on Smooth Manifolds

A derivative-free nonmonotone line search and its application to the spectral residual method

LECTURE 9: THE WHITNEY EMBEDDING THEOREM

TRANSITIVE HOLONOMY GROUP AND RIGIDITY IN NONNEGATIVE CURVATURE. Luis Guijarro and Gerard Walschap

Introduction to Arithmetic Geometry Fall 2013 Lecture #17 11/05/2013

The Lusin Theorem and Horizontal Graphs in the Heisenberg Group

Normed & Inner Product Vector Spaces

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

LECTURE 2. (TEXED): IN CLASS: PROBABLY LECTURE 3. MANIFOLDS 1. FALL TANGENT VECTORS.

CHAPTER 0 PRELIMINARY MATERIAL. Paul Vojta. University of California, Berkeley. 18 February 1998

j=1 r 1 x 1 x n. r m r j (x) r j r j (x) r j (x). r j x k

Analysis II: The Implicit and Inverse Function Theorems

Homework Assignment #5 Due Wednesday, March 3rd.

Solutions to the Hamilton-Jacobi equation as Lagrangian submanifolds

COMMON COMPLEMENTS OF TWO SUBSPACES OF A HILBERT SPACE

Detecting submanifolds of minimum volume with calibrations

Theorem 3.11 (Equidimensional Sard). Let f : M N be a C 1 map of n-manifolds, and let C M be the set of critical points. Then f (C) has measure zero.

Linear Algebra. Session 12

DIFFERENTIAL TOPOLOGY AND THE POINCARÉ-HOPF THEOREM

LECTURE 28: VECTOR BUNDLES AND FIBER BUNDLES

Pseudoinverse & Moore-Penrose Conditions

c 2007 Society for Industrial and Applied Mathematics

Optimization on the Grassmann manifold: a case study

Applied Linear Algebra

Derivative-Free Trust-Region methods

BERGMAN KERNEL ON COMPACT KÄHLER MANIFOLDS

October 25, 2013 INNER PRODUCT SPACES

Math Advanced Calculus II

Introduction to Real Analysis Alternative Chapter 1

The Symmetric Space for SL n (R)

Math 210B. Artin Rees and completions

Robust Principal Component Pursuit via Alternating Minimization Scheme on Matrix Manifolds

Math 5210, Definitions and Theorems on Metric Spaces

2 Garrett: `A Good Spectral Theorem' 1. von Neumann algebras, density theorem The commutant of a subring S of a ring R is S 0 = fr 2 R : rs = sr; 8s 2

Transcription:

Max-Planck-Institut für Mathematik in den Naturwissenschaften Leipzig Convergence analysis of Riemannian GaussNewton methods and its connection with the geometric condition number by Paul Breiding and Nick Vannieuwenhoven Preprint no.: 69 2017

Convergence analysis of Riemannian Gauss Newton methods and its connection with the geometric condition number Paul Breiding 1 Max-Planck-Institute for Mathematics in the Sciences, Leipzig, Germany. Nick Vannieuwenhoven 2 KU Leuven, Department of Computer Science, Leuven, Belgium. Abstract We obtain estimates of the multiplicative constants appearing in local convergence results of the Riemannian Gauss Newton method for least squares problems on manifolds and relate them to the geometric condition number of [P. Bürgisser and F. Cucker, Condition: The Geometry of Numerical Algorithms, 2013]. Keywords: Riemannian Gauss Newton method, convergence analysis, geometric condition number, CPD 1. Introduction Many problems in science and engineering are parameter identification problems (PIPs). Herein, there is a parameter domain M R M and a function Φ : M R N. Given a point y in the image of Φ, the PIP asks to identify parameters x M such that y = Φ(x); note that there could be several such parameters. For example, computing QR, LU, Cholesky, polar, singular value and eigendecompositions of a given matrix A R m n R mn are examples of this. In other cases we have a tensor A R m1 m d and need to compute CP, Tucker, block term, hierarchical Tucker, or tensor trains decompositions [7]. If the object ỹ R N whose parameters should be identified originates from applications, then usually ỹ Φ(M). Nevertheless, in this setting one seeks parameters x M such that y := Φ(x) is as close as possible to ỹ, e.g., in the Euclidean norm. This can be formulated as a nonlinear least squares problem: ỹ arg min x M 1 2 Φ(x) ỹ 2. (1) Here, we deal with functions Φ that offer differentiability guarantees, so that continuous optimization methods can be employed for solving (1). Specifically, we assume that M is a smooth embedded submanifold 3 of R M and that Φ is a smooth function on M [11, Chapters 1 and 2]. Hence, (1) is a Riemannian optimization problem that can be solved using, e.g., Riemannian Gauss Newton (RGN) methods [2]; see Section 2. The sensitivity of x M with respect to perturbations of y = Φ(x) might impact the performance of these RGN methods. Let Ψ : X Y be a smooth map between manifolds X and Y, and let T x X Email addresses: breiding@mis.mpg.de (Paul Breiding), nick.vannieuwenhoven@cs.kuleuven.be (Nick Vannieuwenhoven) 1 Funding: The author was partially supported by DFG research grant BU 1371/2-2. 2 Funding: The author was supported by a Postdoctoral Fellowship of the Research Foundation Flanders (FWO). 3 Both the optimization problem (1) and the condition number of maps between manifolds can be defined for abstract manifolds. Nevertheless, we consider embedded manifolds because it greatly simplifies the proof of the main theorem, allowing us to compare tangent spaces in the ambient space using Wedin s theorem [13, Chapter III, Theorem 3.9]. This is no longer possible for abstract manifolds, which would make the letter much more difficult to understand. In practice, many manifolds are naturally embedded. Preprint submitted to Elsevier October 16, 2017

denote the tangent space to the manifold X at x X. We recall from [5, Section 14.3] that the geometric condition number κ(x) characterizes to first-order the sensitivity of the output y = Ψ(x) to input perturbations as the spectral norm of the derivative operator d x Ψ : T x X T Ψ(x) Y; that is, κ(x) := d x Ψ := max t TxX d x Ψ(t/ t ). In the case of PIPs, the geometric condition number is derived as follows. Assume that there exists an open neighborhood N of x M such that M = Φ(N ) is a smooth manifold with m = dim M = dim N. Since Φ N : N M is a smooth map between manifolds, the inverse function theorem for manifolds [11, Theorem 4.5] entails that there exists a unique inverse function Φ 1 x whose derivative satisfies d Φ(x) Φ 1 x = (d x Φ) 1, provided that d x Φ is injective. Hence, the geometric condition number of the parameters 4 x is κ(x) := d Φ(x) Φ 1 x = (d x Φ) 1 1 = ς m (d x Φ)), (2) where ς m (A) is the mth largest singular value of the linear operator A. If the derivative is not injective, then the condition number is defined to be. In this letter, we show that the condition number of the parameters x in (2) appears naturally in the multiplicative constants in convergence estimates of RGN methods. Our main contribution is Theorem 1. 2. The Riemannian Gauss Newton method Recall that a Riemannian manifold (M,, ) is a smooth manifold M, where for each p M the tangent space T p M is equipped with an inner product, p that varies smoothly with p; see [11, Chapter 13]. The zero element of T p M is denoted by 0 p. Since we deal exclusively with embedded submanifolds M R M, we take a, b p := a T b equal to the standard inner product on R M. In the following we drop the subscript p. The induced norm is v = v, v. The tangent bundle of a manifold M is the smooth vector bundle T M := {(p, v) p M, v T p M}. In the remainder of this letter, we let M R M be an embedded submanifold with m = dim M M equipped with the standard Riemannian metric inherited from R M. Riemannian optimization methods can be applied to the minimization of a least-squares cost function f : M R, p 1 2 F (p) 2 with F : M R N. (3) Recall that Newton s method for minimizing f consists of choosing a x 0 M and then generating a sequence of iterates x 1, x 2,... in M according to the following process: x k+1 R xk (η k ) with ( 2 x k f ) η k = xk f; (4) herein, xk f : T xk M R is the Riemannian gradient, and 2 x k f : T xk M T xk M is the Riemannian Hessian; for details see [2, Chapter 6]. The map R xk : T xk M M is a retraction operator. Definition 1 (Retraction [2, 9]). A retraction R is a map from an open subset T M U M that satisfies all of the following properties for every p M: 1. R(p, 0 p ) = p; 2. U contains a neighborhood N of (p, 0 p ) such that the restriction R N is smooth; 3. R satisfies the local rigidity condition d 0x R(x, ) = id TxM for all (x, 0 x ) N. We let R p ( ) := R(p, ) be the retraction R with foot at p. A retraction is a first-order approximation of the exponential map [2]; the following result is well-known. 4 Note that this is the geometric condition number at the output rather than the input of Φ 1 x. The reason is that the PIP can have several x i M as solutions. Since the RGNs will only output one of these solutions, say x 1, the natural question is whether this computed solution x 1 is stable to perturbations of Φ(x 1 ). 2

Lemma 1. Let R be a retraction. Then for all x M there exists some δ x > 0 such that for all η T x M with η < δ x one has R x (η) = x + η + O( η 2 ). As stated in [2, Section 8.4.1], the RGN method for minimizing f is obtained by replacing the 2 x k f in the Newton process (4) by the Gauss Newton approximation (d xk F ) (d xk F ); herein A denotes the adjoint of the bounded linear operator A with respect to the inner product,. Note that an explicit expression for the update direction η k can be obtained. The Riemannian gradient is xk f = xk 1 2 F (x), F (x) = (d x k F ) ( F (x k ) ) ; (5) see [2, Section 8.4.1]. If d xk F is injective, then the solution of the system in (4) with the Riemannian Hessian replaced by the Gauss Newton approximation is given explicitly by η k = ( (d xk F ) (d xk F ) ) 1 (dxk F ) ( F (x k ) ) =: (d xk F ) ( F (x k ) ). 3. Main result: Convergence analysis of the RGN method We prove in this section that both the convergence rate and radius of the RGN method are influenced by the condition number of the PIP at the local minimizer. In the case of PIPs, we have F (x) := Φ(x) ỹ for some fixed ỹ R N. Hence, d x F = d x Φ, so that the next theorem relates the geometric condition number (2) to the convergence properties of the RGN method for solving the least-squares problem (3). Remark 1. The RGN method is only locally convergent. Practical methods are obtained by adding a globalization strategy [2, 12] such as a line search or trust region scheme. The goal of these strategies is guaranteeing sufficient descent for global convergence, while preserving the local rate of convergence. In the main theorem, we present the analysis without globalization strategy, so as to focus on the main idea of the proof. In case of a trust region scheme, the usual approach for extending the proof consists of showing that close to a local minimizer, the unconstrained Newton step is always contained in the trust region and hence selected. This will be true if the starting point is sufficiently close to the local minimizer. Then, the local rate of convergence will be the same as when no trust region scheme is employed. In the remainder of this section, let B τ (x) denote the ball of radius τ centered at x R M. The following is the main theorem of this letter. Theorem 1. Assume that x M is a local minimum of the objective function f from (3), where d x F is injective. Let κ := (ς m (d x F )) 1 > 0. Then, there exists ɛ > 0 such that for all 0 < < 1 there exists a universal constant c > 0 depending on ɛ, F, x, M, and R so that the following holds. 1. (Linear convergence): If cκ2 F (x ) < 1, then for all x 0 B ɛ (x ) M with { 1 ɛ := min cκ, ɛ } 1 + + cκ 2, F (x ) the RGN method generates a sequence x 0, x 1,... that converges linearly to x. In fact, x x k+1 cκ2 F (x ) x x k + O( x x k 2 ). 2. (Quadratic convergence): If x is a zero of the objective function f, then for all x 0 B ɛ (x ) M with { 1 ɛ := min cκ, ɛ }, 1 + the RGN method generates a sequence x 0, x 1,... that converges quadratically to x. In fact, x x k+1 c(κ + 1) x x k 2 + O( x x k 3 ). 3

Remark 2. The order of convergence may also be established from [2, Theorem 8.2.1]. However, intrinsic multiplicative constants are not derived there, as their analysis is founded on coordinate expressions that depend on the chosen chart; they thus only derive chart-dependent multiplicative constants. In the following let P A denote the orthogonal projection onto the linear subspace A R N. Recall the next lemma from [3, Section 2], which we need in the proof of Theorem 1. Lemma 2. Let F : M R N be a smooth function and x M. Then, there exist constants r F > 0 and γ F 0 such that for all y B rf (x) M we have F (y) = F (x) + (d x F )P TxM + v x,y, where = (y x) R N and v x,y γ F 2. We can now prove Theorem 1. Proof of Theorem 1. We begin with some general considerations: In Lemma 1 we choose δ small enough such that it applies to all x B δ (x ) M. Let 0 < ɛ δ. Then, there exists a constant γ R > 0 depending on the retraction operator R, such that for all x B ɛ (x ) M we have R x (η) (x + η) γ R η 2 for every η B ɛ (0) T x M. (6) By applying Lemma 2 to the smooth functions F and id M respectively and using the smoothness of (the derivative of) F, we see that there exist constants γ F, γ I > 0 so that for all x B ɛ (x ) M we have F (x ) F (x) (d x F )P TxM(x x) = v with v γ F x x 2, and (7) Moreover, we define the Lipschitz constant C := x x P TxM(x x) = w with w γ I x x 2. (8) max x B ɛ (x ) M We choose a constant c, depending on R, F and ɛ, that satisfies { 1 ɛ min, 1 ( κγ F Cκ, 1 + 1 + 1 2 (1 + 5)Cκ 2 F (x ) d x F P Tx M d x F P TxM. (9) x x ) 1ɛ } and c max{ 1 2 (1 + 5)C, γ F, γ R, γ I }. (10) The rest of the proof is by induction. Suppose that the RGN method applied to starting point x 0 M generated the sequence of points x 0,..., x k M. First, we show that d xk F is injective, so that the update direction η = (d xk F ) F (x k ), and, hence, x k+1 is defined. Thereafter, we prove the asserted bounds on x k x k+1. For avoiding subscripts, let x := x k M and y := x k+1 = R x ( (d x F ) F (x)) M. By induction, we can assume x k x x 0 x ; indeed the base case k = 0 is trivially true, and we will prove that it also holds for k + 1 at the end of this proof, completing the induction. Hence, x B ɛ (x ) M. Let J R N M denote the matrix of d x F with respect to the standard bases on R M and R N. Let J = UΣV T be its (compact) singular value decomposition (SVD), where U R N m and V R M m have orthonormal columns, the columns of V span T x M and Σ R m m is diagonal matrix containing the singular values. Then, the matrix of (d x F ) with respect to the standard bases is J, i.e., the Moore Penrose pseudoinverse of J, and κ(x) = J. Similarly, let J R N M denote the matrix of d x F, and let U Σ V T be its SVD. By assumption, d x F is injective and thus we have κ 1 = ς min (d x F ) = ς min (J ) > 0. The matrix of P TxM is V V T, and similarly for P Tx M. Then, by the definition of C in (9), we have J J = J (V V T ) J(V V T ) C x x, (11) and hence J J Cɛ C 1 cκ (1 )ς min(j ), because x B ɛ (x ) and the definition of c. From Weyl s perturbation lemma it follows that ς min (J ) ς min (J) J J (1 )ς min (J ). We obtain ς min (J) > ς min (J ) > 0, where the last inequality is by the assumption > 0. It follows that J = ( ς min (J) ) 1 < κ 1 <, (12) 4

so that d x F is indeed injective. This shows that the RGN update direction η is well defined. It remains to prove the bound on x y. First we show that η = J F (x) ɛ < δ, so that the retraction would satisfy (6). By assumption x is a local minimum of (3), so that from (5) we obtain 0 = x f = J T F (x ) = V Σ U T F (x ). By [13, Chapter III, Theorem 1.2 (9)] and the assumption that d x F is injective, we have J = V Σ 1 U T from which we conclude J F (x ) = 0. Let P = P TxM. From (7), so that J F (x) = J F (x ) P (x x) J v, (13) η = (J J )F (x ) + P (x x) + J v J J F (x ) + P x x + J v. (14) From Wedin s theorem [13, Chapter III, Theorem 3.9] we obtain J J 1 + 5 J J 2 J J (1 + 5) Cκ 2 x x, (15) 2 where the last step is because of (11) and (12). Using P = 1 for orthogonal projectors, the assumption x x (κγ F ) 1, (12), (15), and the bound on v in (7), it follows from (14) that η (1 + κγ F x x + 1 2 (1 + 5)Cκ 2 F (x ) ) x x. (16) By the definition of ɛ and the assumption x x < ɛ, we have x x < ɛ < 1 κγ F, so that by (16), η (1 + 1 + 1 2 (1 + 5)Cκ 2 F (x ) ) x x. (17) Using the third bound on ɛ in (10), we have x x < ɛ < (1 + 1 + 1 2 (1 + 5)Cκ 2 F (x ) ) 1ɛ, which when plugged into (17) yields η < ɛ. From the foregoing discussion, we conclude that (6) applies to R x (η) = R x ( J F (x)), so that y x = R x ( J F (x)) x x J F (x) x + γ R η 2. (18) Let ζ := x J F (x) x. We use J F (x ) = 0 and the formula from (13) to derive that ζ = x x (J F (x) J F (x )) = x x (J J )F (x ) + P (x x) + J v = (J J )F (x ) w + J v J J F (x ) + γ I x x 2 + J γ F x x 2, where the second-to-last equality is due to (8), and in the last line we have used the triangle inequality and the bounds on v and w from (7) and (8). Combining this with (15) and (12) yields ζ 1 2 (1 + 5) Cκ 2 F (x ) ( x x + γ I + γ F κ ) x x 2, (19) Note that we have chosen the constant c large enough, so that 1 2 (1 + 5) C < c. Plugging (19) and (17) into (18) yields the first bound. For the second assertion we have the additional assumption that x is a zero of the objective function f(x) = 1 2 F (x) 2. From (19) we obtain ζ ( κγ F ) +γ I x x 2. From (14) we get η = P (x x)+j v so that we can bound η 2 by P (x x) 2 + 2 P (x x), J v + J v 2 x x 2 + 2γ F J x x 3 + γ 2 F J 2 x x 4, where the inequality is by the Cauchy Schwartz inequality and the fact that P = 1 for orthogonal projectors. As before, plugging these bounds for ζ and η into (18) and exploiting that c max {γ F, γ I, γ R }, the second bound is obtained. 5

A reviewer asked how critical the injectivity assumption on the derivative d x F in the above theorem is. The brief answer is that it is usually a very weak assumption in practice. First, we need a lemma. Lemma 3. Let M R M be an embedded manifold whose projectivization is a smooth projective variety, and let and Φ : M R N be a regular map. Let N denote the R-variety that is the Zariski closure of the image Φ(M). If the dimensions satisfy dim M = dim N, then the locus of points G := {x M d x Φ is injective and Φ(x) is a smooth point of N } is a dense subset of M in the Euclidean topology. Proof. This is essentially a restatement of [8, Theorem 11.12]. The following proposition shows, under the assumptions of Lemma 3, that the local optimizer x in Theorem 1 has an injective derivative d x F = d x Φ on a set of inputs (in R N ) of positive measure. Proposition 2. Let M, G N, and Φ be as in Lemma 3. Assume that we have the equality dim M = dim N. Let B := { y R N arg minx M Φ(x) y G } be the set of points y all of whose closest approximations on Φ(M), i.e., C y := arg min x M Φ(x) y, lie in G. Then, B has positive Lebesgue measure, i.e., it is open in the Euclidean topology. Moreover, N B is Euclidean dense in Φ(M). Proof. Let W 1 M be the locus where the dimension of the fiber Φ 1 (Φ(x)) is strictly positive, i.e., W 1 = {x M dim Φ 1 (Φ(x)) > 0}. By [8, Theorem 11.12] W 1 is a Zariski-closed set. The last claim of the proposition is also a corollary of this theorem and the assumption that the generic fiber is 0-dimensional. It remains to show the first claim. Let W 2 M be the Zariski-closed subset of points x M for which Φ(x) lies in the singular locus of N. Set W := (W 1 W 2 ) M, where the overline denotes the closure in the Zariski topology. Note that M \ W G. Let x W. Since the derivative d x Φ is injective, there exists a local diffeomorphism between an open neighborhood M 0 M of x and an open neighborhood N 0 N of Φ(x). By restricting neighborhoods, we can assume that the Euclidean closure of N 0 is contained in the smooth locus of N and that M 0 is contained in M \ W. Take a tubular neighborhood T of N 0 R N that does not intersect N \ N 0, and let h be its height; note that h > 0, because the closure of N 0 does not contain singular points of N. Then, there exists an open ball B δ (Φ(x)) in R N of positive radius 0 < δ < h, centered at Φ(x), whose intersection with N is contained in N 0. By construction B δ (Φ(x)) (N 0 T ). It follows from the triangle inequality that the closest point on N to any point of B x := B δ/2 (Φ(x)) is contained in N 0 Φ(M) N. Since N 0 = Φ(M 0 ) and because M 0 G, it follows that B x B for all x M \ W. 4. Numerical experiments Here we experimentally verify the dependence of the multiplicative constant on the geometric condition number for a special case of PIP (1), namely the tensor rank decomposition (TRD) problem. The model is Φ : S S R N, (a 1 1 a d 1,..., a 1 r a d r) r a 1 i a d i, where d 3, N = m 1 m d, and S R N is the manifold of m 1 m d rank-1 tensors [8]. The image of Φ is called a join set and the PIP is a special case of the join decomposition problem [3]. To put emphasis on the join structure of the image of Φ, we denote J := Φ(S S). In the numerical experiments of this section we apply a RGN method to min x M 1 2 Φ(x) A 2, where M := S S is the r-fold product manifold of S, and A R N is the given tensor to approximate. We choose the retraction operator R : T M M from [4]. The projectivization of the manifold S is called the Segre variety; it is a smooth, irreducible projective variety with affine dimension dim S = 1 + d k=1 (m k 1). The problem of computing the dimension of the Zariski-closure J, which is called the r-secant variety of S, has been classically studied; see [10, Section 5.5] for an overview. The results of [6] entail that the dimension equality dim M = dim J is satisfied for all r dim S < N and N 15000, subject to a few theoretically characterized exceptions. In the example below, 6 i=1

we take r = 2 for which the dimension equality is always satisfied [1]. Hence, Proposition 2 entails that the injectivity assumption in Theorem 1 is satisfied at least on a set of positive Lebesgue measure. Therefore, the convergence rate of the RGN method is influenced by the geometric condition number of the optimal parameters x M that minimizes the objective function. We showed in [3, Section 5.1] that the condition number of the above PIP at x = (a 1 i ad i )r i=1 is κ(x) = ( ς m (U) ) 1, where m = r dim S, and the matrix U R N m is given by U = [ U 1 U r ] with U i := [ a 1 i a 1 i ad i a d i Q 1,i a2 i a 2 i ad i a d i a 1 i a 1 i ad 1 i a d 1 i ] Q d,i, where Q k,i R m k (m k 1) is a matrix containing an orthonormal basis of the orthogonal complement of a k i in R m k. These expressions allow us to compute the condition number at any given decomposition x M. All of the following computations were performed in Matlab R2016b. For clearly illustrating the rates of convergence, we used variable precision arithmetic (vpa) with 400 digits of accuracy. Since performing experiments in vpa is very expensive, we consider only the tiny example of a rank-2 tensor of size 3 3 3. We showed in [4] that an implementation of the RGN method with trust region globalization strategy applied to the above PIP formulation, can outperform state-of-the-art optimization methods for the tensor rank approximation problem on small-scale, dense problems with r d k=1 m k 1000. 4.1. Experiment 1: Random perturbations Consider the following parametrized tensors in R 3 R 3 R 3. For s 0 we let x(s) = (x(s), e 3 2 ) S S where x(0) := e 3 1 and x(s) := (e 2 2 s e 1 ) 3 for s > 0 and e k R 3 is the kth standard basis vector. Then, we define A(s) := Φ(x(s)) = x(s) + e 3 2. For every s = 0, 1, 3, 5, we created a perturbed decomposition x (s) = R(x(s), 10 20 X X ), where R is the aforementioned retraction and the entries of X are chosen from the standard normal distribution. We also sampled a perturbed tensor A (s) := A(s) + 10 10 Z Z, where the entries of Z are also standard normal. For verifying the linear convergence, the RGN method was applied to A (s) while the quadratic convergence was checked by applying the RGN method to A(s), both starting from x (s). In all tested cases, the RGN method generated a sequence x 1 (s), x 2 (s),... in M converging to a local minimizer x (s) M. The residual F (x (s)) was approximately 7 10 11 in all cases. The results are shown in Figures 1(a) and 1(c), illustrating respectively the predicted linear and quadratic convergence. The graphs confirm the prime message of this letter: the convergence speed of the RGN method deteriorates when the geometric condition number increases, as Theorem 1 predicts. As the full lines in Figure 1(a) show, the multiplicative constants derived in Theorem 1 can be pessimistic, especially when the condition number is large. We attribute this to bound (15); while it is sharp [13, p. 152], it is very pessimistic in this experiment. A qualitatively better description of the convergence is shown as the dashed lines in Figure 1(a), where the constant in (15) was estimated heuristically as E(s) := where J is the matrix of d x1(s)f and J is the matrix of d x (s)f as in the proof of Theorem 1. 4.2. Experiment 2: Adversarial perturbations. J J x 1(s) x (s), For illustrating the sharpness of the bound (15), we performed an additional experiment with tensors in R 3 R 3 R 3 = R 27. This time we constructed an adversarially perturbed starting point x (s) by generating a random tensor N R 27 with entries sampled from the standard normal distribution, then computing numerically the gradient g of the function f(x) = 1 2 ((d x(s)φ) (d x Φ) )N 2, and finally setting x (s) = R(x(s), 10 20 g). As adversarial perturbation of A(s) = Φ(x(s)), we chose Z R 27 equal to the left singular vector u 14 R 27 corresponding to the smallest nonzero singular value ς 14 of d x (s)φ; note that dim M = r dim S = 2(1 + 2 + 2 + 2) = 14. As before, we set A (s) = A(s) + 10 10 Z Z. The result of applying the RGN method to A (s) from starting point x (s) is shown in Figure 1(b). In all cases, the method converged. The condition numbers at the local minimizers x (s) are about the same as in the previous experiment: the respective relative differences were less than 10 2. The final residuals F (x (s)) depended on s, however; they were 6.90 10 46, 8.60 10 34, 3.58 10 31 and 1.10 10 26 for 7

10-20 10-40 10-60 10-80 10-100 10-120 10-140 s = 0 s = 1 s = 3 s = 5 (a) 2 4 6 8 10 12 14 k 10 0 10-40 10-80 10-120 10-160 10-200 10-240 10-280 10-320 (b) 2 4 6 8 10 12 14 k s=0 s=1 s=3 s=5 (c) 10-20 10-40 10-80 10-120 10-160 10-200 s = 0 10-240 s = 1 s = 3 10-280 s = 5 10-320 1 2 3 4 5 k Figure 1: The data points show the distance x k (s) x (s) for the sequence of points x k (s), k = 1, 2,..., computed by the RGN method in function of k for s = 0, 1, 3, 5. The condition numbers of the local optimizers, rounded to two significant digits, were κ = 1.0 10 0, 2.7 10 1, 1.5 10 3, and 9.3 10 4 for respectively s = 0, 1, 3, and 5. In (a) and (b) linear convergence rates are illustrated when F (x (s)) = 0, and (c) shows the quadratic convergence rate when the residual F (x ) vanishes. Figures (a) and (c) show instances of random perturbations, while (b) employed adversarial perturbations. In figure (b), the linear rate of convergence is not immediately observed; therefore the theoretical estimates were applied starting from the least k where x k (s) x (s) 10 50. In figures (a) and (b), the full lines (,,, ) indicate the theoretical upper bounds from Theorem 1, i.e., (1+ 5)Cκ 2 from (15). The dashed lines (,,, ) indicate the upper 2 bounds obtained from Theorem 1, where the constants on the right-hand sides of (15) are estimated heuristically as E(s). respectively s = 0, 1, 3, and 5. This is why the convergence may appear at first sight to be better than in the case of random perturbations. Nevertheless, it is observed that the theoretical estimate in Theorem 1 is indeed much closer to the observed convergence. In fact, the bounds involving the heuristic estimate E(s) are visually indistinguishable from the actual data. Note in particular for s = 0, where κ = 1, that also the theoretical convergence rate from Theorem 1 is visually indistinguishable from the data, illustrating the sharpness of the bound in (15). Acknowledgements We thank two anonymous reviewers for their insightful and critical remarks that improved this letter. References [1] Abo, H., Ottaviani, G., Peterson, C., 2009. Induction for secant varieties of Segre varieties. Trans. Amer. Math. Soc. 361, 767 792. [2] Absil, P.-A., Mahony, R., Sepulchre, R., 2008. Optimization Algorithms on Matrix Manifolds. Princeton University Press. [3] Breiding, P., Vannieuwenhoven, N., 2017. The condition number of join decompositions. arxiv:1611.08117. Submitted. [4] Breiding, P., Vannieuwenhoven, N., 2017. A Riemannian trust region method for the canonical tensor rank approximation problem. arxiv:1709.00033. Submitted. [5] Bürgisser, P., Cucker, F., 2013. Condition: The Geometry of Numerical Algorithms. Springer, Heidelberg. [6] Chiantini, L., Ottaviani, G., Vannieuwenhoven, N., 2014. An algorithm for generic and low-rank specific identifiability of complex tensors. SIAM J. Matrix Anal. Appl. 35 (4), 1265 1287. [7] Grasedyck, L., Kressner, D., Tobler, C., 2013. A literature survey of low-rank tensor approximation techniques. GAMM Mitteilungen 36 (1), 53 78. [8] Harris, J., 1992. Algebraic Geometry, A First Course. Vol. 133 of Graduate Text in Mathematics. Springer-Verlag. [9] Kressner, D., Steinlechner, M., Vandereycken, B., 2014. Low-rank tensor completion by Riemannian optimization. BIT Numer. Math. 54 (2), 447 468. [10] Landsberg, J. M., 2012. Tensors: Geometry and Applications. Vol. 128 of Graduate Studies in Mathematics. AMS, Providence, Rhode Island. [11] Lee, J. M., 2013. Introduction to Smooth Manifolds, 2nd Edition. Springer, New York, USA. [12] Nocedal, J., Wright, S. J., 2006. Numerical Optimization, 2nd Edition. Springer Series in Operation Research and Financial Engineering. Springer. [13] Stewart, G. W., Sun, J.-G., 1990. Matrix Perturbation Theory. Academic Press. 8