Handling High Dimensional Problems with Multi-Objective Continuation Methods via Successive Approximation of the Tangent Space

Similar documents
Unit: Optimality Conditions and Karush Kuhn Tucker Theorem

Multipoint secant and interpolation methods with nonmonotone line search for solving systems of nonlinear equations

Optimality Conditions for Constrained Optimization

A Descent Method for Equality and Inequality Constrained Multiobjective Optimization Problems

Higher-Order Methods

5 Handling Constraints

1. Introduction. We consider the general global optimization problem

Math Advanced Calculus II

Implicit Functions, Curves and Surfaces

Symmetric Matrices and Eigendecomposition

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Numerical Optimization

Convex Optimization. Problem set 2. Due Monday April 26th

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Chapter 2. Solving Systems of Equations. 2.1 Gaussian elimination

Topological properties

Newton-type Methods for Solving the Nonsmooth Equations with Finitely Many Maximum Functions

TMA 4180 Optimeringsteori KARUSH-KUHN-TUCKER THEOREM

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

UC Berkeley Department of Electrical Engineering and Computer Science. EECS 227A Nonlinear and Convex Optimization. Solutions 5 Fall 2009

NORMS ON SPACE OF MATRICES

EC 521 MATHEMATICAL METHODS FOR ECONOMICS. Lecture 1: Preliminaries

Implications of the Constant Rank Constraint Qualification

Numerical optimization

Analysis-3 lecture schemes

Lecture 1: Basic Concepts

On the Local Convergence of an Iterative Approach for Inverse Singular Value Problems

A PROJECTED HESSIAN GAUSS-NEWTON ALGORITHM FOR SOLVING SYSTEMS OF NONLINEAR EQUATIONS AND INEQUALITIES

Tangent spaces, normals and extrema

Search Directions for Unconstrained Optimization

INTEGRATION ON MANIFOLDS and GAUSS-GREEN THEOREM

SECTION C: CONTINUOUS OPTIMISATION LECTURE 9: FIRST ORDER OPTIMALITY CONDITIONS FOR CONSTRAINED NONLINEAR PROGRAMMING

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

l(y j ) = 0 for all y j (1)

Exercise Solutions to Functional Analysis

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

REAL AND COMPLEX ANALYSIS

g 2 (x) (1/3)M 1 = (1/3)(2/3)M.

1. Nonlinear Equations. This lecture note excerpted parts from Michael Heath and Max Gunzburger. f(x) = 0

Topological properties of Z p and Q p and Euclidean models

Stochastic Optimization Algorithms Beyond SG

Quasi-Newton Methods

Analysis and Linear Algebra. Lectures 1-3 on the mathematical tools that will be used in C103

10 Numerical methods for constrained problems

On fast trust region methods for quadratic models with linear constraints. M.J.D. Powell

Math Introduction to Numerical Analysis - Class Notes. Fernando Guevara Vasquez. Version Date: January 17, 2012.

c 2007 Society for Industrial and Applied Mathematics

Nonlinear equations. Norms for R n. Convergence orders for iterative methods

Least Sparsity of p-norm based Optimization Problems with p > 1

COURSE Iterative methods for solving linear systems

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Convex Functions and Optimization

DEVELOPMENT OF MORSE THEORY

5 Quasi-Newton Methods

An Introduction to Correlation Stress Testing

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

Optimality, Duality, Complementarity for Constrained Optimization

Differential Topology Solution Set #2

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms

Numerical Optimization of Partial Differential Equations

Optimization and Root Finding. Kurt Hornik

Iterative Methods for Solving A x = b

Zangwill s Global Convergence Theorem

Lebesgue Measure on R n

MTH 309 Supplemental Lecture Notes Based on Robert Messer, Linear Algebra Gateway to Mathematics

Introduction to Real Analysis Alternative Chapter 1

Lecture notes: Applied linear algebra Part 1. Version 2

ON SPACE-FILLING CURVES AND THE HAHN-MAZURKIEWICZ THEOREM

Numerical Methods I Non-Square and Sparse Linear Systems

MATHEMATICAL ECONOMICS: OPTIMIZATION. Contents

Unconstrained optimization

Gradient Descent. Dr. Xiaowei Huang

The Steepest Descent Algorithm for Unconstrained Optimization

THE INVERSE FUNCTION THEOREM FOR LIPSCHITZ MAPS

Lecture V. Numerical Optimization

NONSMOOTH VARIANTS OF POWELL S BFGS CONVERGENCE THEOREM

In English, this means that if we travel on a straight line between any two points in C, then we never leave C.

Multivariable Calculus

A derivative-free nonmonotone line search and its application to the spectral residual method

EE731 Lecture Notes: Matrix Computations for Signal Processing

A projection algorithm for strictly monotone linear complementarity problems.

MVE165/MMG631 Linear and integer optimization with applications Lecture 13 Overview of nonlinear programming. Ann-Brith Strömberg

fy (X(g)) Y (f)x(g) gy (X(f)) Y (g)x(f)) = fx(y (g)) + gx(y (f)) fy (X(g)) gy (X(f))

MATH 5720: Unconstrained Optimization Hung Phan, UMass Lowell September 13, 2018

Theorem 3.11 (Equidimensional Sard). Let f : M N be a C 1 map of n-manifolds, and let C M be the set of critical points. Then f (C) has measure zero.

Chapter 7. Extremal Problems. 7.1 Extrema and Local Extrema

The following definition is fundamental.

A new nonmonotone Newton s modification for unconstrained Optimization

Handlebody Decomposition of a Manifold

8 Numerical methods for unconstrained problems

arxiv: v1 [math.oc] 1 Jul 2016

CONSTRAINED NONLINEAR PROGRAMMING

Solution of Linear Equations

Linear Algebra Massoud Malek

Real Analysis Problems

Problem set 1, Real Analysis I, Spring, 2015.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

A CHARACTERIZATION OF STRICT LOCAL MINIMIZERS OF ORDER ONE FOR STATIC MINMAX PROBLEMS IN THE PARAMETRIC CONSTRAINT CASE

GENERALIZED CONVEXITY AND OPTIMALITY CONDITIONS IN SCALAR AND VECTOR OPTIMIZATION

Transcription:

Handling High Dimensional Problems with Multi-Objective Continuation Methods via Successive Approximation of the Tangent Space Maik Ringkamp 1, Sina Ober-Blöbaum 1, Michael Dellnitz 1, and Oliver Schütze 1 University of Paderborn Faculty of Computer Science, Electrical Engineering and Mathematics Warburger Strasse 100 Paderborn, Germany {ringkamp,dellnitz,sinaob}@math.uni-paderborn.de CINVESTAV-IPN Computer Science Department Av. IPN 508, C. P. 07360, Col. San Pedro Zacatenco Mexico City, Mexico schuetze@cs.cinvestav.mx Abstract. In many applications, several conflicting objectives have to be optimized concurrently leading to a multi-objective optimization problem. Since the set of solutions, the so-called Pareto set, typically forms a (k 1)-dimensional manifold, where k is the number of objectives considered in the model, continuation methods such as predictor-corrector (PC) methods are in certain cases very efficient tools to rapidly compute a finite size representation of the set of interest. However, their classical implementation leads to trouble when considering higher dimensional models (i.e., for dimension n > 1000 of the parameter space). In this work, it is proposed to perform a successive approximation of the tangent space which allows to find promising predictor points with lower effort in particular for high dimensional models since no Hessians of the objectives have to be calculated. The applicability of the resulting PC variant is demonstrated on a benchmark model for up to n = 100, 000 parameters. Keywords: multi-objective optimization, continuation, tangent space approximation, high dimensional problems. 1 Introduction In a variety of applications in industry and finance the problem arises that several objective functions have to be optimized concurrently. For instance, for a perfect economical production plan, the ultimate desire would be to simultaneously minimize cost and maximize quality. This example already illustrates

a natural feature of these problems, namely that the different objectives typically contradict each other and therefore certainly do not have identical optima. Thus, the question arises how to approximate a particular optimal compromise (see e.g. Miettinen (1999) for an overview of widely used interactive methods) or how to compute or approximate all optimal compromises of this multi objective optimization problem (MOP). The latter will be considered in this article. Mathematically speaking, in a MOP there are given k objective functions f 1,..., f k : R n R which have to be minimized. The set of optimal compromises with respect to the objective functions is called the Pareto set 1. A point x R n in parameter space is said to be optimal or a Pareto point if there is no other point which is at least as good as x in all the objectives and strictly better in at least one objective. This work will concentrate on the approximation of the Pareto set. Multi objective optimization is currently a very active area of research. By far most of the methods for the computation of single Pareto points or the entire Pareto set are based on a scalarization of the MOP (i.e., a transformation of the original MOP into one or a sequence of scalar optimization problems, see, e.g., Das and Dennis (1998), Fliege (004) or Eichfelder (008)). For a survey of these and further methods the reader is referred to Miettinen (1999) for nonlinear MOPs and to Jahn (1986) and Steuer (1986) in the linear case. Another way to attack the problem is by using bio-inspired heuristics such as Evolutionary Algorithms (e.g., Deb (001), Coello Coello et al. (007)). In such methods, an entire set of candidate solutions (population) is considered and iterated (evolved) simultaneously which allows for a finite size representation of the entire Pareto set in one run of the algorithm. These methods work without gradient information of the objectives and are particularly advantageous in the situation where the MOP is not differentiable and/or where the objectives contain many local minima. A method which is based on a stochastic approach is presented in Schäffler et al. (00). In this work, the authors derive a stochastic differential equation which has the property that it is very likely to observe corresponding solutions in a neighborhood of the set of (local) Pareto points. Similar to the evolutionary strategies the idea of Schäffler et al. (00) is to directly approximate the entire solution set and not just single Pareto points on the set. Another way to compute the entire Pareto set is to use subdivision techniques (Dellnitz et al. 005, Jahn 006). These algorithms start with a compact subset Q R n of the domain and generate outer approximations of the Pareto set which get finer under iteration (see Dellnitz et al. (00) for a convergence result). The approach is of global nature and hence in practice restricted to moderate dimensions of the parameter space. Typically that is, under mild regularity conditions on the model the set of Pareto points forms locally a (k 1)-dimensional manifold if there are k smooth objective functions. This is the reason why continuation methods such as predictor- 1 Named after the economist Vilfredo Pareto, 1848-193.

3 corrector (PC) methods for the computation of general implicitly defined manifolds (e.g., Rheinboldt (1986), Allgower and Georg (1990)) can be successfully applied to the context of multi-objective optimization, see e.g. Guddat et al. (1985), Rakowska et al. (1991), Hillermeier (001) or Schütze et al. (005). In the following the working principle of a PC method is described. For sake of a better understanding the particular case of a bi-objective problem is considered here (i.e., a MOP with k = objectives, where it is assumed that the Pareto set can be expressed by a curve. See Figure 1 for such an example). For a more general and thorough description of PC methods the reader is referred e.g. to Allgower and Georg (1990) or Section. Given a Pareto optimal solution x i P Q, where P Q denotes the Pareto set of the given problem, a further solution x i+1 P Q near to x i is computed in the following two steps: (P) Predict a point p i+1 R n such that p i+1 x i points along the solution set. Typically, this is done by linearizing P Q at x i. That is, one can choose p i+1 := x i + tν, where t R\{0} is a step size and ν is the tangent vector of P Q at x i. (C) Compute a solution x i+1 P Q in the vicinity of p i+1 (p i+1 is corrected to the solution set). It is widely accepted (e.g., Allgower and Georg (1990)) that the additional selection of the predictor is beneficial for the overall computational efficiency than directly computing x i+1 from x i due to the locality in the search in step (C) (in case t is chosen properly p i+1 is already very close to P Q ). Proceeding in the same manner one obtains a method that generates solutions along the Pareto set starting from the initial solution x i. Though PC methods are at least locally typically quite effective, they are, however, based on some assumptions. First, an initial solution has to be known or computed before the process can start. Further, it can be the case that P Q falls into several connected components (which may happen, for instance, if one or more objectives contain discontinuities). Due to their local nature, PC methods are restricted to the connected component that contains the given initial solution. Hence, in order to be able to approximate the entire Pareto set, the PC method has to be fed with multiple initial solutions. As a possible remedy for both problems PC methods can be combined with global search strategies such as evolutionary algorithms. This has been done e.g. in Schütze et al. (003), Harada et al. (007), Schütze et al. (008) and Lara et al. (010). One problem remaining is that PC methods may run into trouble for the treatment of higher dimensional MOPs as it is addressed here. Given a solution x, the main requirements of a classical (multi-objective) PC method to obtain a further solution are as follows: (P) In the predictor step, the Hessians f i (x) of the objective functions f i have to be computed. Further, a QR-decomposition of the Jacobian F of an auxiliary map F has to be computed. F is at least in the unconstrained case

4 Fig. 1. Given x i P Q a further solution is computed in two steps by PC methods: (P) a predictor is generated by linearizing P Q at x i ( p i+1), and (C) this point is corrected back to the solution set ( x i+1). mainly given by a weighted sum H(x, α) := k i=1 α if i (x) of the Hessians f i. This yields a linearization of the solution set at x. (C) For the correction of the predicted solution p obtained via linearization, typically the Gauss-Newton method applied on F is used which requires H(x (i), α (i) ) in each iteration i and the solution of a linear system of equation (of dimension m > n, where x R n, however, for large values of n one can assume m n). Hence, the cost to obtain a further solution is O(n ) in terms of memory and O(n 3 ) in terms of flops for full matrices H(x, α) and the algorithm runs into trouble on a standard PC for n > 1000. One possible remedy for high dimensional MOPs is certainly to exploit the sparsity of the model (if given). Here, an alternative approach is followed by changing the PC method as follows: (P) perform a successive approximation of the tangent space of the implicitly defined manifold at a given solution x which is based on some analysis performed on the geometry of the problem (and which is also the main contribution of this work), and (C) change the Gauss-Newton method against the limited memory BFGS method proposed in Shanno (1978) which is designed and approved for large scale problems. The cost of the novel method is O(n) in terms of memory and O(n ) in terms of flops. The remainder of this paper is organized as follows: In Section, the required background for the understanding of the sequel is given. In Section 3, the analysis for the successive approximation of the tangent space, and in Section 4 the resulting algorithms are presented. In Section 5, some numerical results on an

5 academic model are shown with up to n = 100, 000 dimensions, and finally, some conclusions are drawn in Section 6. Background This section gives the required background for the predictor-corrector algorithm which is described in the next section: the basic idea of continuaton methods are addressed (following mainly Rheinboldt (1986) and Allgower and Georg (1990)), and further the connection to multi-objective optimization is given. Continuation Methods Assume a differentiable map H : S R N R M, d := N M 1, (1) of class C r, r 1, is given on an open subset S R N. A point x S is called regular if the first derivative, H (x) R M N, has maximal rank M. Further, assume one is interested in the following system of equations: In the case the regular solution set H(x) = 0, x S. () M = {x S H(x) = 0, x regular} (3) is non-empty, it is well-known that M is a d-dimensional C r -manifold in R N without boundary (e.g., Rheinboldt (1986)). One possible way to tackle problem () numerically is to use continuation methods such as PC methods. Given an initial (approximated) solution x M further solutions x (i) M near x are found by PC methods via the following two steps: (P) Predict a set {p 1,..., p s } of distinct (and well distributed) points which are near both to x and to M. (C) For i = 1,..., s Starting with the predicted point p i, compute by some (typically few) iterative steps an approximated element x (i) of M, i.e. H(x (i) ) 0. One way to obtain well distributed predictors near to a solution x M is to compute an orthonormal basis of the tangent space at x via a QR-decomposition of H (x) T : The tangent space at a point x M is given by T x M = kerh (x) = {u R N H (x)u = 0}. (4)

6 The normal space N x M at x M is the orthogonal complement of T x M: N x M = (T x M) = (kerh (x)) = rgeh (x) T. (5) Let Q = (Q 1 Q ) R N N be an orthogonal matrix and R = where R 1 R M M is an upper triangular matrix, such that H (x) T = QR = (Q 1 Q ) ( ) R1 R 0 N M, ( ) R1. (6) 0 If x is regular it follows that the diagonal elements of R 1 do not vanish and hence it is straightforward to see that the columns of Q 1 R N M provide an orthonormal basis of rgeh (x) T = N x M. Thus, an orthonormal basis of T x M is given by the columns of Q R N d. Hence, one may choose predictors p R n e.g. by p := x + tq, (7) where t R\{0} is a step size and q a column vector of Q (compare to the example related to Figure 1). For particular choices to spread the predictors along the tangent space as well as for step size strategies refer to Ringkamp (009). The efficient approximation of Q is the main focus in this work. For the realization of the corrector step (C), typically the Gauss-Newton method x (i+1) = x (i) H (x (i) ) + H(x (i) ), i = 0, 1,..., (8) where H (x (i) ) + R N M is the Moore-Penrose inverse of H (x (i) ), is applied. It is well-known that this method converges quadratically to a point x M if the starting vector x (0) R N is chosen close enough to M. Refer e.g. to Deuflhard (004) for a local convergence result. In case of higher dimensions it is suggested here to use the limited memory BFGS method introduced by Shanno (1978) x (i+1) := x (i) + t i d (i), i = 0, 1,... (9) where t i is an exact, Powell- or Armijo-step size. With f(x) = H(x) it holds d (0) = f(x (0) ) and for i = 0, 1,... d (i+1) := (y(i) ) T s (i) y (i) g (i+1) ( (s(i) ) T g (i+1) (y(i) ) T g (i+1) ) (y (i) ) T s (i) y (i) s (i) + (s(i) ) T g (i+1) y (i) y (i) where g (i) := f(x (i) ), s (i) := x (i+1) x (i) and y (i) := f(x (i+1) ) f(x (i) ).

7 Multi-Objective Optimization In a multi objective optimization problem (MOP) the task is to simultaneously optimize k objective functions f 1,..., f k : R n R. More precisely, a general MOP can be stated as follows: min {F (x)}, Q := {x x Q Rn h(x) = 0, g(x) 0}, (MOP) where the function F is defined as the vector of the objective functions F : Q R k, F (x) = (f 1 (x),..., f k (x)), and h : Q R m, m n, and g : Q R q. The optimality of a MOP is defined by the concept of dominance (Pareto 1971). Definition 1. (a) Let v, w R k. Then the vector v is less than w (v < p w), if v i < w i for all i {1,..., k}. The relation p is defined analogously. (b) A vector y R n is dominated by a vector x R n (x y) with respect to (MOP) if F (x) p F (y) and F (x) F (y), else y is called non-dominated by x. (c) A point x Q is called (Pareto) optimal or a Pareto point if there is no y Q which dominates x. The set of all Pareto optimal solutions is called the Pareto set, and is denoted by P Q. The image F (P Q ) of the Pareto set is called the Pareto front. Fundamental for many methods for the numerical treatment of MOPs is the following theorem of Kuhn and Tucker (1951) which states a necessary condition for Pareto optimality for MOPs with equality constraints. Theorem 1. Let x be a Pareto point of (MOP) with q = 0. Suppose that the set of vectors { h i (x) i = 1,..., m} is linearly independent. Then there exist vectors λ R m and α R k with α i 0, i = 1,..., k, and k i=1 α i = 1 such that k m α i f i (x ) + λ j h j (x ) = 0 (10) i=1 j=1 h i (x ) = 0, i = 1,..., m. In the unconstrained case i.e. for m = 0 the theorem says that the vector of zeros can be written as a convex combination of the gradients of the objectives at every Pareto point. Obviously, (10) is not a sufficient condition for (local) Pareto optimality. On the other hand, points satisfying (10) are certainly Pareto candidates, and thus, following Miettinen (1999), their relevance is now emphasized by the following Definition. A point x R n is called a substationary point or Karush Kuhn Tucker point 3 (KKT point) if there exist scalars α 1,..., α k 0 and λ R m such that (10) is satisfied. Without loss of generality only equality constraints are considered here. For a more general formulation of the theorem refer e.g. to Miettinen (1999). 3 Named after the works of Karush (1939) and Kuhn and Tucker (1951) for scalar valued optimization problems.

8 Having stated Theorem 1, one is (following Hillermeier (001)) in the position to give a qualitative description of the set of Pareto optimal solutions which gives at the same time the link to (): Denote by F : R n+m+k R n+m+1 the following auxiliary map: k α i f i (x) + m λ j h j (x) i=1 j=1 F (x, λ, α) = h(x). (11) k α i 1 By Theorem 1 it follows that for every substationary point x R n there exist vectors λ R m and α R k such that i=1 F (x, λ, α ) = 0. (1) Hence, one expects that the set of KKT-points defines a (k 1)-dimensional manifold. This is indeed the case under certain smoothness assumptions, see Hillermeier (001) for a thorough discussion of this topic. To estimate the approximation quality of a candidate set generated by an optimization procedure to the Pareto front the Hausdorff distance will be used in this work which is defined as follows: Definition 3. Let u, v R n, A, B R n, and d(, ) be a metric on R n. Then, the Hausdorff distance d H (, ) is defined as follows: (a) dist(u, A) := inf d(u, v) v A (b) dist(b, A) := sup dist(u, A) u B (c) d H (A, B) := max {dist(a, B), dist(b, A)} 3 Approximation of T x M In this section, the geometry of the problem will be analyzed. The results will be the basis for the successive approximation of the tangent space which will be done in the next section. In the following, assume that M as defined in (3) is a sufficiently smooth d- dimensional manifold, and that a point x (0) M is given. In the sequel, matrices are used for the representation of approximations of the tangent space T x (0)M which are defined as follows: Definition 4. Let c, δ R with c 1 and δ > 0. Denote by T x (0)M(c, δ) the set of all matrices A R N d with rank(a) = d, condition number κ (A) c,

9 and A i + x (0) M B δ (x (0) ) i = 1,..., d, where A i are the columns of A, i.e., T x (0)M(c, δ) := {A = (A 1,..., A d ) R N d rank(a) = d, κ (A) c, A i + x (0) M B δ (x (0) ) i = 1,..., d} (13) Remark 1. (a) Given A T x (0)M(c, δ), the columns A i can be interpreted as secants that intersect M in the two points x (0) and x (i) := A i + x (0). In case the A i s are linearly independent, the image of the linear map A : R N R d, is a d-dimensional subspace of R N. In this way, A(R N ) can be viewed as an approximation of T x (0)M (or the matrix Q as defined in Equation (6)). In the context of PC methods it means that if A is accepted as a suitable approximation of T x (0)M predictors can be chosen e.g. as p := x (0) + t A i A i, (14) where A i is a column vector of A and t is chosen as in (7). (b) For all δ > 0 and 1 c 1 c it holds T x (0)M(c 1, δ) T x (0)M(c, δ). Analog for all c 1 and 0 < δ 1 δ it is T x (0)M(c, δ 1 ) T x (0)M(c, δ ). Lemma 1. There exists δ > 0 such that for all δ R with 0 < δ < δ there exists a matrix A = (A 1,..., A d ) T x (0)M(1, δ) with A i = δ i = 1,..., d and A i A j i j. Proof. The proof is separated into two parts. In (a), the existence of x M l = (H l ) 1 (0) with x x (0) = δ will be shown under some requirements on M l. In (b), it will be first proven for all l {0,..., d} that these requirements hold for some points x (1),..., x (l) M. After that, part (a) will be repeatedly used to construct these points and they will finally be used to create the matrix A. (a): For 1 l d let M l := M M l be a (d l + 1) dimensional submanifold given by M l = (H l ) 1 (0) where H l : B δ (x (0) ) R N R N d+l 1 and x (0) M l. Further let (i) ϕ : B δ (x 0 ) V be a local chart for the d-dimensional submanifold M with ϕ(b δ (x 0 ) M) = V (R d {0}), (ii) M l be a (N l + 1)-dimensional submanifold with a chart ϕ l : R N R N, ϕ l ( M l ) = R N l+1 {0}, (iii) the N d + l vectors (x x (0) ), H1(x), l..., HN d+l 1 l (x) be linearly independent for all x M l Ḃδ(x (0) ) where Ḃδ(x (0) ) := B δ (x (0) ) \ {x (0) }. Then, for all δ > 0 with δ < δ it follows the existence of x M l with x x (0) = δ.

10 Proof: Let K := M l B δ(x (0) ). First it is proven that K is compact by showing that each sequence in K has a convergent subsequence with limit in K. Let (x n ) n N K be a sequence, (x n ) n N is bounded and due to the Bolzano- Weierstrass theorem it has a convergent subsequence. In abuse of notation the same index n is used for the subsequence, since the entire sequence is not needed any longer. Thus, it is x n x B δ(x (0) ) B δ (x (0) ) and according to (i) and (ii) it follows that ϕ(x) V and ϕ l (x) N N l+1. By continuity of ϕ and using x n M l n N it is 0 = lim n ϕ d+i(x n ) = ϕ d+i (x) i {1,..., N d}, (15) which implies ϕ(x) V (R d {0}) and so x M. By continuity of ϕ l it follows analogously that x M l, and hence, also that x K which shows the compactness of K. By this it follows the existence of a vector x K with x x (0) = max x K x x(0). (16) In the case x x (0) = δ the claim follows. Now assume that x x (0) < δ. Consider the following optimization problem min x R N x x (0) s.t. H l (x) = 0 g(x) := x x (0) δ 0. (17) It is x x (0) and so x Ḃδ(x (0) ) since M l is a (d l + 1)-dimensional submanifold with x (0) M l. As a result the inequality constraint is not active at x and using (iii) it follows that the vectors H1( x), l..., HN d+l 1 l ( x) are linearly independent. That means that the Mangasarian-Fromowitz Constraint Qualification (e.g., Nocedal and Wright (006)) is fulfilled and since x is a local minimizer for the optimization problem (17), there exists λ R N d+l such that ( x, λ) is a Karush-Kuhn-Tucker point. Therefore, it holds λ N d+l g( x) = 0 and N d+l 0 = ( x x (0) ) + λ i Hi( x) l + λ N d+l ( x x (0) ). (18) i=1 Since g( x) 0 it follows that λ N d+l = 0, and thus, that the vectors ( x x (0) ), H1( x), l..., HN d+l 1 l ( x) are linearly dependent. That is a contradiction, and hence, it holds that x x (0) = δ, and the claim follows. (b): Let ϕ : U V be a local chart for the d-dimensional submanifold M and let U, V R N be open sets with x (0) U, ϕ(x (0) ) = 0 V and ϕ(u M) = V (R d {0}) and let l {1,..., d}. First of all, it will be proven by contradiction that a rank criterion holds for some special x (1),..., x (l) M.

Assumption: δ > 0 x (1),..., x (l) Ḃδ(x (0) ) M, x (i) x (0) x (j) x (0) i j and x B δ (x (0) ) with H (x) (x (1) x (0) ) T rank N d + l.. (19) (x (l) x (0) ) T By choosing the sequence (δ n ) n N = ( ) 1 n the assumption leads to the existence of sequences x (1)n) ( ( n N,..., x (l)n) Ḃ 1 (x(0) ) M and (x n ) n N n N n N n B 1 ) with x (n), x (1)n,..., x (l)n x (0) and x (i) x (0) x (j) x (0) i j. n Since the rank in (19) has the upper bound N d + l, it is H (x n ) (x (1)n x (0) ) T rank < N d + l. n N. (0) (x (l)n x (0) ) T Equation (0) is used in the following to get a contradiction. In fact, a sequence of zeros with non-zero limit will be constructed. According to standard analysis (e.g. Königsberger (1997)) there exists an embedding γ : V R N with an open set V R d and γ(v ) = M U. W.l.o.g. assume 0 V, γ(0) = x (0) and B 1 (x (0) ) U. Therefore, it holds: (i) γ : V M U is bijective, (ii) γ is continuously differentiable, (iii) γ (0) R N d has maximal rank and T x (0)M = γ (0)R d, (iv) γ 1 : M U V exists and is continuous. Define t (i)n := γ 1 (x (i)n ), i {1,..., l}, n N. By γ 1 (x (0) ) = 0 and (i) it follows that t (i)n 0. Simple calculations show that the following sequence is bounded (more concrete, its Euclidean norm is bounded by l) t (1)n t (1)n. t (l)n t (l)n n N 11. (1) Therefore one can apply the theorem of Bolzano-Weierstrass to obtain a convergent subsequence. In abuse of notation the same index n is used for that subsequence, since the entire sequence is not needed ( any longer. Subdividing that t subsequence in the l d-dimensional parts (i)n t (i)n leads to l convergent )n N sequences. It is t(i)n t t (i)n = 1 n N, hence, it follows (i)n t (i)n t (i) 0. Due to (iii) it follows 0 b (i) := γ (0)t (i) T x (0)M ()

1 and due to (ii) it is x (i)n x (0) t (i)n Using x (i) x (0) x (j) x (0) By 0 = lim n = γ(t(i)n ) γ(0) t (i)n b (i). (3) i j, it follows x (i)n x (0), x(j)n x (0) t (i)n t (j)n = b (i), b (j) i j. (4) T x (0)M = {v R N H (x (0) ) v = 0} (5) and () it follows that b (i) H l (x (0) ) i {1,..., l}, l {1,..., N d}. Using that and (4) it becomes apparent that the matrix H (x (0) ) b (1)T. R(N d+l) N (6) b (l)t has full rank. Let det be a composition of a function which projects the matrix in (6) to a regular (N d + l) (N d + l) submatrix and the determinant. It holds that det is a continuous function since it is a composition of continuous functions. Therefore, it is H (x (0) ) H (x n ) 0 det b (1)T. b (k)t (3) = lim n det (x (1)n x (0) ) T t (1)n. (x (k)n x (0) ) T t (k)n (0) = lim 0 = 0, (7) n which is a contradiction. Thus, the initial assumption is false. Hence, the following statement holds: l {1,..., d} δ > 0 such that x (1),..., x (l) Ḃ δ (x (0) ) M, x (i) x (0) x (j) x (0) i j and x B δ (x (0) ), and hence H (x) (x (1) x (0) ) T rank = N d + l. (8). (x (l) x (0) ) T Since there are just finitely many integers l {1,..., d} to consider, it follows the existence of such a radius δ > 0 for all l {1,..., d}. Using such a radius δ with B δ (x (0) ) U mathematical induction will be used to show the existence of points x (1),..., x (l) M with x (i) x (0) = δ i {1,..., l} and x (i) x (0) x (j) x (0) i, j {1,..., l}, i j for all l d. Basis l = 1:

Define M 1 := R N and use some orthogonal vectors v (1),..., v (N) R N to define a chart ϕ 1 : R N R N, (v (1) ) T (x x (0) ) ϕ 1 (x) :=. (9) (v (N) ) T (x x (0) ) for the N-dimensional submanifold M 1. Further define the d-dimensional submanifold M 1 := M M 1 = M = (H 1 ) 1 (0) with H 1 = H. Therefore (a) (ii) and (i) are fulfilled. According to the rank condition (8) it follows the linear independence of (x (1) x (0) ), H 1 1 (x (1) ),..., H 1 N d(x (1) ) (30) for all x (1) M 1 Ḃδ(x (0) ) as desired in (a) (iii). As a result (a) yields a point x (1) M 1 with x (1) x (0) = δ. Inductive step: Let points x (1),..., x (l) M exist with x (i) x (0) = δ i {1,..., l} and x (i) x (0) x (j) x (0) i, j {1,..., l}, i j for l < d. It has to be shown that this is also true for l + 1 d. Let H l+1 : R N R l with ( x (1) x (0) ) T (x x (0) ) H l+1 (x) :=.. (31) ( x (l) x (0) ) T (x x (0) ) According to standard linear algebra there exist orthogonal vectors v (1),..., v (N l) R N with ( x (i) x (0) ) v (j) for all i {0,..., l}, j {1,..., N l}. Hence, it follows that ϕ l+1 : R N R N, (v (1) ) T (x x (0) ). ϕ l+1 (v (N l) ) T (x x (0) ) (x) := ( x (1) x (0) ) T (x x (0) (3) ). ( x (l) x (0) ) T (x x (0) ) is a chart for the (N l)-dimensional submanifold M l+1 := ( H l+1 ) 1 (0). Further, let H l+1 : B δ (x (0) ) R N d+l with ( ) H(x) H l+1 (x) := H l+1 (33) (x) and define M l+1 := (H l+1 ) 1 (0) = H 1 (0) ( H l+1 ) 1 (0) = M M l+1. According to (8) it is rank ( (H l+1 ) (x) ) = N d + l for all x B δ (x (0) ) and 13

14 with x (0) M l+1 it follows with the regular value theorem that M l+1 is a (d l)-dimensional submanifold. Therefore (a) (i) and (ii) are fulfilled. Because of Hl+1 (x (l+1) ) = 0 x (l+1) M l+1 it follows x (l+1) x 0 x (i) x 0 i {1,..., l}. Again, using the rank condition (8) yields the linear independence of H 1 (x (l+1) ),..., H N d (x (l+1) ), ( x (1) x 0 ),..., ( x (l) x 0 ), (x (l+1) x (0) ) = H l+1 1 (x (l+1) ),..., H l+1 N d+l (x(l+1) ), (x (l+1) x (0) ) (34) for all x (l+1) M l+1 Ḃδ(x (0) ) as desired in (a) (iii). As a result (a) yields a point x (l+1) M l+1 with x (l+1) x (0) = δ and x (i) x (0) x (j) x (0) {1,..., l + 1}, i j. That completes the mathematical induction. Finally, define A i := x (i) x (0) i 1,..., d, then with i, j A T A = ( A T ) d i A j = diag(δ,..., δ) (35) i,j=1 it follows κ (A) = λmax(a T A) λ min(a T A) = δ δ = 1, where λ max(a T A) is the maximal and λ min (A T A) the minimal singular value of A T A. Thus, A T x (0)M(1, δ), and the claim follows. The following result shows that for every c 1 there exists a δ > 0 such that the image of every matrix A T x (0)M(c, δ) approximates the tangent space within a pre-defined tolerance ɛ R +. Theorem. Let N, d N, d N and M R N be a d-dimensional submanifold with tangent space T x (0)M in x (0) M. Then it holds: (a) δ > 0, c 1 it is T x (0)M(c, δ). (b) ɛ > 0, c 1 there exists a δ > 0 such that A T x (0)M(c, δ) it holds: B R N d with rank(b) = d, B(R d ) = T x (0)M and At Bt ɛ Bt t R d (36) Proof. Ad (a): Let δ > 0 and c 1. Due to Lemma 1 there exists δ > 0 such that δ 1 R with 0 < δ 1 < δ there exists a matrix A = (A 1,..., A d ) T x (0)M(1, δ ) with A i = δ 1 i = 1,..., d. Hence, it is A T x (0)M(1, δ 1 ), and for δ 1 < δ it follows with Remark 1 (b): T x (0)M(1, δ 1 ) T x (0)M(c, δ). Ad (b): According e.g. to Königsberger (1997) there exists an embedding γ : V R N, U R N, V R d with γ(v ) = M U and x (0) M U. W.l.o.g. assume 0 V and γ(0) = x (0). Therefore, it holds: (i) γ (0) R N d has maximal rank, (ii) γ is continuously differentiable, (iii) γ 1 : M U V exists and is continuous.

15 Given a matrix C R N d with rank(c) = d, it follows that also the pseudo inverse C + R d N has maxmial rank. Further, since C + C = I R d d, it holds: t = C + Ct C + Ct, t R d. (37) Because of (ii) for each a > 0 there exists a constant δ > 0, such that t R d B δ(0) it holds γ(t) γ(0) γ (0)t a t. (38) Due to (i) it holds γ (0) + > 0 and one can define a := ɛ (c + 1) d γ (0) +. (39) By (iii) and γ 1 (x (0) ) = 0 it follows that for every δ > 0 there exists a δ > 0 such that B δ (x (0) ) U and for all x M B δ (x (0) ) it holds γ 1 (x) < δ. Defining t := γ 1 (x) it follows x x (0) γ (0)t = γ(t) γ(0) γ (0)t (38) (37) ɛ (c + 1) d γ (0)t. ɛ (c + 1) d γ (0) + t (40) Let A := (A 1,..., A d ) T x (0)M(c, δ) and define x (i) := A i + x (0) i = 1,..., d, then it is x (i) M B δ (x (0) ). Further, define t (i) := γ 1 (x (i) ) and B := (B 1,..., B d ) = (γ (0)t (1),..., γ (0)t (d) ) R N d, then it follows by (40) A i B i ɛ (c + 1) d B i, i = 1,..., d. (41) W.l.o.g. assume that ɛ < 1. Using the Frobenius norm. F : R N d R, d (C 1,..., C d ) F = i=1 C i, and the inequalities C C F d C (e.g., Golub and Loan (1996)) it follows A B A B F = d A i B i = i=1 ɛ (c + 1) d B F (41) ɛ (c + 1) B ɛ (c + 1) d B i d i=1 ɛ (c + 1) ( A B + A ) ɛ<1 A B + ɛ A (c + 1). (4)

16 By this it follows that A B ɛ A c κ (A) c ɛ A A + A = ɛ A + (43) which leads to A + B A < 1. Using the perturbation lemma (e.g., Golub and Loan (1996)) it follows and rank(b) = rank(a + (B A)) = d (44) B + = (A + (B A)) + By (44) it follows B(R d ) = T x (0)M. Further, it holds A + 1 A + B A. (45) (43) A B ɛ A + A B ( 1 A + A B = ɛ A + ɛ<1 ɛ A + ɛ A B ) (45) ɛ B +. (46) For all t R d with t = 1 it follows (A B)t max (A B) t = A B t =1 (46) ɛ B + = ɛ t (37) B + ɛ Bt. (47) Finally, since (A B) is a linear function, it holds t R d At Bt ɛ Bt. (48) That is, roughly speaking, for every c > 1 there exists a δ > 0 such that the relative deviation v ṽ v of every vector v T x (0)M \ {0} to a vector ṽ A(R d ) is less than a given tolerance ɛ > 0. Here, A T x (0)M(c, δ) can be chosen arbitrarily. However, in general, no δ > 0 can be determined such that this property holds for all c > 1 which will be demonstrated in the following example: In each neighborhood around x 0 M there exist further points x (1), x () M such that A 1 := x (1) x (0) and A := x () x (0) are linearly dependent, however, in the vector space A(R d ) there exist always vectors which are perpendicular to vectors in T x (0)M. Example 1. Consider the -dimensional manifold M := {x R 3 x 1 + x + x 3 = 0, x i < 1, i = 1, } (49)

17 with the embedding ϕ : U R 3, ϕ(t) := (t 1, t, t 1 t ) T, (50) ), where U := {t R t i < 1}. It is ϕ(u) = M. Let B := ϕ (0) = ( 1 0 0 1 0 0 then it is T x (0)M = B(R ). For 1 > δ > 0 and t (0) := (0, 0) T, t (1) := ( δ, 0)T, t () := ( δ 4, 0)T let x (0) := ϕ(t (0) ) = (0, 0, 0) T, x (1) := ϕ(t 1 ) = ( δ, 0, δ 4 )T, and x () := ϕ(t () ) = ( δ δ 4, 0, 16 )T. Then, it is x (0), x (1), x () M B δ (x 0 ), and the vectors A 1 := x (1) x (0) = ( δ, 0, δ 4 A := x () x (0) δ4 δ = (, 0, 16 ) T ) T (51) are linearly independent for every value of δ 0. However, the subspace spanned by A := (A 1, A ) contains vectors that are orthogonal to T x (0)M such as v = (0, 0, 1) T (compare to Figure ). It holds A + = (A + 1, A+, A+ 3 ), with A+ 1 = ( δ, 8 δ )T, A + = (0, 0)T, A + 3 = ( 8 δ, 16 δ ) T. Thus, for the condition number of A it holds κ (A) = A A + Ae A + e 3 = δ + 4 δ + 1 (5) 8 + 1 for δ 0 and thus there does not exist any constant c 1 with κ (A) c required in Theorem. δ > 0 as Corollary 1. For all ɛ > 0, δ > 0, c 1 there exists δ > 0 such that A T x (0)M(c, δ) it holds v T x (0)M B δ(0) ṽ A(R d ) : ṽ v ɛ. (53) Proof. Let v T x (0)M B δ(0). By Theorem there exists a matrix B R N d such that v T x (0)M B δ(0) t R d with v = Bt. Let ṽ := At, it follows ṽ v = At Bt ɛ δ Bt = ɛ δ v ɛ. (54) The above result gives a hint of how to approximate the tangent space: Given x (0) M, one can compute d further solutions x (i) M, i = 1,..., d, in the vicinity of x (0). If for the matrix A of secants it holds rank(a) = d (55) κ (A) c (56) A i + x (0) M B δ (x (0) ), (57)

18 Fig.. The vectors v i, i = 1, (which are multiples of the secants A i for sake of a better visualization) span a subspace that contains vectors that are orthogonal to the (exact) tangent space T x (0)M. then it is A T x (0)M(c, δ). If further c and δ are small enough, then one can expect due to Corollary 1 that rge(a) serves as a good approximation of T x (0)M. However, so far nothing is gained from the practical point of view since it is still unclear how to choose the neighboring solutions x (i), and for a given set of solutions the conditions (55) to (57) have to be checked. In the following, a result is stated that is the basis for the successive approximation of the tangent space that is proposed in the next section. As an additional bonus, the verification of the conditions (55) to (57) will get superfluous. Theorem 3. Let c 1, then it holds: (a) There exists δ > 0 such that δ l, δ u R [ ] with 0 < δ u < δ and δ l (1+c )(d 1)+ (1+c )(d 1)+c δ u, δ u there exist vectors x (i) M B δ (x (0) ) i = 1,..., d such that (a1) x (i) x (0) [δ l, δ u ] i = 1,..., d (a) 1 x (i) x (0) (x (j) x (0) ) ] [δ u c δ l δ u (1+c )(d 1), δ l + c δ l δ u (1+c )(d 1) i j. [ ] (b) δ, δ l, δ u R with 0 < δ u < δ and δ l (1+c )(d 1)+ (1+c )(d 1)+c δ u, δ u and x (i) M B δ (x (0) ) such that (a1) and (a) are satisfied, it holds A := (x (1) x (0),..., x (d) x (0) ) T x (0)M(c, δ). (58) Proof. Ad (a): Due to Lemma 1 there exists δ > 0, such that δ R with 0 < δ < δ there exists a matrix A = (A 1,..., A d ) T x (0)M(1, δ) with A i = δ,

19 i = 1,..., d, and A i A j i j. Let δ l, δ u R with 0 < δ u < δ and δ l 0 < δ l δ u, since 0 < (1+c )(d 1)+ (1+c )(d 1)+c [ (1+c )(d 1)+ (1+c )(d 1)+c δ u, δ u ], then it holds (1+c )(d 1)+ (1+c )(d 1)+ = 1. It follows that [δ l, δ u ]. Let x (i) := A i + x (0), then it is x (i) M B δ (x (0) ) i = 1,..., d. Choosing δ δl = +δ u δl it follows that δ l = +δ l δ δu +δ u = δ u, and it is x (i) x (0) = A i = δ [δ l, δ u ] i = 1,..., d, (59) i.e., condition (a1) holds for the chosen solutions x (i). Further, it holds for all i j Furthermore, it is 1 x (i) x (0) (x (j) x (0) ) = 1 A i A j (A i A j) = = δ. 1 ( A i + A j ) (60) (1 + c )(d 1) + (1 + c )(d 1) + c δ u δ l (61) (1 + c )(d 1)δ u + δ u (1 + c )(d 1)δ l + c δ l (6) δ u + δ u (1 + c )(d 1) δ l + c δ l (1 + c )(d 1) (63) δ u δ l + c δ l δ u (1 + c )(d 1). (64) It follows that δ l + δ u δ l + c δ l δ u (1 + c )(d 1) (65) and δu c δl δ u (1 + c )(d 1) δ l + δ u. (66) Combining the above results it follows i j 1 x (i) x (0) (x (j) x (0) ) = δ = δ l + δ u [ δu c δ l δ u (1 + c )(d 1), δ l + c δ l δ u (1 + c )(d 1) ].

0 Hence, condition (a) holds for the chosen x (i), and the claim follows. [ ] Ad (b): Let δ, δ l, δ u R with 0 < δ u < δ and δ l (1+c )(d 1)+ (1+c )(d 1)+c δ u, δ u and let x (i) M B δ (x (0) ) i = 1,..., d. Let A i := x (i) x (0) and A := (A 1,..., A d ) R N d, then A T A = ( A i, A j ) d i,j=1 R d d is symmetric and has d real eigenvalues λ i, i = 1,..., d. Let K i := z R z A i, A i d j=1,j i A i, A j, (67) then it follows by the Theorem of Gerschgorin (e.g., Atkinson (1989)) that every eigenvalue λ i of A T A is contained in d i=1 K i. By condition (a1) it holds A i, A i = A i [δl, δ u], i = 1,..., d, and by condition (a)it is 1 A i A j [δ u c δ l δ u (1+c )(d 1), δ l + Putting this together it follows i j c δ l δ u (1+c )(d 1) ] i j. A i, A j = 1 Ai + A j A i A j = { 1 1 (a1),(a) = ( ) Ai + A j A i A j, if Ai + A j A i A j ( ) Ai A j A i A j, if Ai + A j < A i A j δu δl + ( δ u c δ l δ u (1 + c )(d 1). ) c δ l δ u (1+c )(d 1), if A i + A j A i A j c δ l δ u (1+c )(d 1) δ l, if A i + A j < A i A j For all z K i with z A i it follows z A i = z A i, A i z A i + d j=1,j i = δ u + c δ l δ u (1 + c ). d j=1,j i A i, A j A i, A j δu c δl + (d 1) δ u (1 + c )(d 1)

1 Furthermore, for all z K i with z A i it is d A i z = z A i, A i A i, A j d z A i j=1,j i j=1,j i A i, A j δl c δl (d 1) δ u (1 + c )(d 1) Hence, it is K i = δ l c δ l δ u (1 + c ). [ ] δl c δ l δ u (1+c ), δ u + c δ l δ u (1+c ) i = 1,..., d and it follows K := d K i i=1 [ δl c δl δ u (1 + c ), δ u + c δl ] δ u (1 + c. (68) ) The following consideration shows that all eigenvalues λ 1... λ d of A T A are strictly positive. It is Hence, λ1 λ d λ i inf z K z δ l c δ l δ u (1 + c ) > δ l c δ l + δ l (1 + c ) = δ l (c + 1) (1 + c ) δ l = 0. is defined and it holds c δ l δ u (1+c ) λ 1 δ u + λ d δl c δl δ u (1+c ) = (1 + c )δ u + c δ l δ u (1 + c )δ l c δ l + δ u = c (δ u + δ l ) δ l + δ u = c. Since A T A has only strictly positive eigenvalues it is rank(a T A) = d and it follows that also rank(a) = d. Let σ 1... σ d be the singular values of A, then it holds σ i = λ i, and it follows κ (A) = σ 1 λ1 = c. σ d λ d Hence, it is A T x (0)M(c, δ), and the proof is complete.

4 The Algorithms Here, three different strategies for the successive approximation of the tangent space are presented that are based on the considerations made in the previous section. Given an initial solution z (0) = (x (0), α (0) ) M R N, N = n + k, where M = F 1 (0) and F the map defined in (11), all methods aim to find suitable neighboring solutions z (i) M in the vicinity of z (0). While the first two approaches work directly in the complete (x, α)-space, the third approach splits the x- and the α-space for the successive approximation. The first method considered here (see Algorithm 1) is straightforward: successively d neighboring solutions z (i) M B δ (z (0) ) are computed starting from randomly chosen starting points in the vicinity of z (0). If the resulting matrix A satisfies (a1) and (a) from Theorem 3, then rge(a) can be viewed as a suitable approximation of T x (0)M. Algorithm 1 (Randomly chosen solutions z (i) ) (S1) Choose δ > 0, set i := 1. (S) Choose z (i) B δ (z (0) ) R N uniformly at random. (S3) Solve F ( ) starting with z (i). (S4) If no acceptable solution has been found in (S3), go to (S), else proceed with the solution z (i) with F (z (i) ) 0. (S5) Set A i := z (i) z (0). (S6) If i < d + 1 go to (S), else STOP. Here, a solution z is defined to be acceptable if the value F (z ) is below a given (low) threshold. In practice, it has been observed that Algorithm 1 already yields sufficient approximations of the tangent spaces even though it does not check the conditions (a1) and (a) from Theorem 3 (refer to the numerical results presented in the next section). However, as Example 1 shows, an acceptable approximation cannot be expected in general, even for arbitrarily small values of δ. To prevent such cases, the next algorithm is constructed. For this, the following penalty functions will be needed: ( ) z (i) z (0) δ l, if z (i) z (0) < δl h 0 (δ l, δ u, z (0), z (i) ( ) ) := z (i) z (0) δu, if z (i) z (0) > δu 0, else (69) h 1 (d, c, δ l, δ u, z (i), z (j) ) := ( ( )) 1 z(i) z (j) δu c δ l δ u (1+c )(d 1), if 1 ( ( 1 z(i) z (j) δl + c δ l δ u (1+c )(d 1) 0, else z(i) z (j) < δ u c δ l δ u (1+c )(d 1) )), if 1 z(i) z (j) > δ l c δ l δ u (1+c )(d 1) (70)

3 By construction of h 0 and h 1 it holds: h 0 (δ l, δ u, x (0), x (i) ) = 0 z (i) z (0) satisfies (a1) from Theorem 3. h 1 (d, c, δ l, δ u, z (i), z (j) ) = 0 j < i 1 z(i) z (j) satisfies (a) from Theorem 3. Algorithm is based on the result in the previous section since it aims to find a distribution of the solutions z (i) as discussed in Theorem 3. Algorithm (Distribution of solutions z (i) via penalization) [ ] (S1) Choose δ u > 0, c 1, δ l (1+c )(d 1)+ (1+c )(d 1)+c δ u, δ u and set i := 1. (S) Choose z (i) B δu (z (0) ) R N uniformly at random. (S3) Solve F ( ) + w 0 h 0 (δ l, δ u, z (0), ( )) + w 1 j<i h 1(d, c, δ l, δ u, ( ), z (j) ) with weights w 0, w 1 R + starting with z (i). (S4) If no acceptable solution has been found in (S3), go to (S), else proceed with the solution z (i) with F (z (i) ) +w 0 h 0 (δ l, δ u, z (0), z (i) )+w 1 j<i h 1(d, c, δ l, δ u, z (i), z (j) ) = 0. (S5) Set A i := z (i) z (0) and i := i + 1. (S6) If i < d + 1 go to (S), else STOP. Crucial is of course the computation of the minimizers of g MOP (x) = F ( ). Note that F already contains the gradient information of each objective. Hence, if the gradient g MOP ( z) is evaluated directly, the Hessians H i (z) of each objective at x have to be computed. To prevent this, it is suggested here to approximate g MOP ( z) by finite differences or (which is more accurate) to compute it using automatic differentiation (Griewank 000). In that case and assuming that n >> k, the cost to obtain g MOP ( z) scales basically linearly with n in terms of memory (if not the entire matrix is stored at once but evaluated one by one) and quadratic in terms of flops. If a derivative free solver is used, the number of flops grows only linearly with n. These costs hold ideally also for the entire approximation of the tangent space. The above methods can in principle be applied to any given problem of the form (1). Based on numerical experiments on high dimensional MOPs the authors of this work have, however, observed a different sensibility in x and α space of the map F leading to the problem of finding a proper value of δ for the neighborhood search. As a possible remedy, it is here suggested to split the two different spaces as follows: instead of choosing a neighbor solution z (i) = ( x (i), α (i) ) B δ (z (i) ) which is corrected to the solution set (steps (S) and (S3) of Algorithm 1), one computes neighboring solution by varying the weight vector α (i) once in the beginning and tackling F α (i)(x) := F (x, α (i) ) for fixed α (i). Algorithm 3 details this procedure. Algorithm 3 (Distribution of solutions z (i) via variation of α)

4 (S1) Choose δ α, δ z > 0, i := 1. (S) Choose α (i) B δα (α (0) ) R k with α (i) 1 = 1 and α (i) 0 at random. (S3) Solve F α (i)(x) starting with x (0). (S4) If (x (i), α (i) ) / B δz (x (0), α (0) ), set δ α := δα and go to (S ). (S5) If no acceptable solution could be computed go to (S ), else set x (i) as the obtained solution. (S6) Set A i := (x (i), α (i) ) (x (0), α (0) ). (S7) If i < d + 1 go to (S ), else STOP. 5 Results In this section, first the mechanism of Algorithm to select new solutions z (i) is demonstrated. Further on, the performances of the resulting PC methods are tested and compared against the classical implementation. 5.1 Revisit of Example 1 In Example 1, two points x (1) and x () have been chosen such that the space spanned by x (1) x (0) and x () x (0) is orthogonal to the tangent space T x (0)M, and one could find such points in every ball B δ (x (0) ). The example demonstrated what can go wrong if one does not take care of the condition constraint in Theorem. The following discussion shows that for a given point x (1) Algorithm computes another point x () which in case δ is small enough prevents the spanned space to be orthogonal to T x (0)M and which even serves as an approximation of T x (0)M. Consider the -dimensional manifold from Example 1 and the vector given therein. For δ = 3 Setting δ u := 1, c := M := {x R 3 x 1 + x + x 3 = 0, x i < 1, i = 1, } (71) ) T A 1 := x (1) x (0) δ = (, 0, δ (7) 4 it holds A 1 = ( 3 4, 0, 9 ) T. (73) 16 3 3 and d =, it follows that (1+c )(d 1)+ (1+c )(d 1)+c = 3. Choosing δ l := 3 δu+δu = 5 6 leads to δ l [ 3 δ ] u, δ u as required in Algorithm, and also A 1 satisfies A 1 [δ l, δ u ] = [ 5 6, 1]. Similar to Example 1, one can choose another vector A := x () x (0) = x () M, but in contrast to Example 1

5 A is chosen here such that it satisfies conditions (a1) and (a) as Algorithm does. Defining the vector leads [ to A = 15 δu c δ l δ u ( A := 0, 3 4, 9 ) T (74) 16 16 [ 5 6, 1] and 1 A 1 A = 9 16 [ 469 936, ] ] 1117 936 = (1+c )(d 1), δ l + c δ l δ u (1+c )(d 1). Thus, A 1 and A satisfy (a1) and (a). Figure 3 illustrates the areas of points which satisfy (a1) and (a) and the approximated tangent space. Choosing a smaller value for δ leads to a better approximation of T x (0)M, e.g. for δ = 3 4 it holds A 3 = ( 3 8, 0, 9 ) T. (75) 64 Setting δ u := 9 0, c := 3 3, d = and choosing δ l in the same way as above leads to δ l = 3 8 [ 3 δ ] u, δ u. In addition, A3 satisfies A 3 [δ l, δ u ] = [ 3 8, 0] 9. Defining the vector A 4 := leads to A 4 = 3 64 73 [ 3 [ δu c δ l δ u ] 8, 9 0 ( 0, 3 8, 9 ) T (76) 64 ] and 1 A 3 A 4 = 9 64 [ 41 41600, 41600] 10053 =. Thus, A 3 and A 4 satisfy (a1) and (a). Fig- (1+c )(d 1), δ l + c δ l δ u (1+c )(d 1) ure 4 illustrates the areas of points which satisfy (a1) and (a) and the approximated tangent space spanned by A 3 and A 4. 5. Testing the PC Methods Now the performances of the different PC methods when approximating the tangent space successively are tested and compared. As base algorithm it was chosen to use the recovering technique presented in Dellnitz et al. (005) and Schütze et al. (005). This method uses boxes as a tool to maintain a spread of the solutions: The domain R is partitioned by a set of boxes. Every solution z of F is associated with the box which contains z, and only one solution is associated with each box. The idea of the recovering algorithm is to detect from a given box which contains a solution of F neighboring boxes which contain further solutions of F, and so on. By this, the solution set is represented by a box collection C. Ideally, i.e., for a perfect outcome set, the associated box collection C covers the entire Pareto set tightly. In the following, it will be distinguished between the classical recover algorithm R QR as described in Schütze et al. (005) which uses a QR-decomposition of F to obtain the tangent space, and the modifications R Alg.1, R Alg., and R Alg.3 which are obtained via a successive approximation of the tangent space via Algorithm 1 to 3, respectively.

6 Fig. 3. The vectors v i, i = 1, (which are multiples of the secants A i for sake of a better visualization) span a subspace that approximates the (exact) tangent space T x (0)M. The horizontal area marks the points which satisfy (a1), the vertical area marks the points wich satisfy (a) and their intersection marks the points wich satisfy both. To compare the performance of the three different PC algorithms the following scalable MOP taken from Schütze et al. (005) is used: f 1, f, f 3 : R R n R n f i (x) = (x j a i j) + (x i a i i) 4, j=1 j i (77) where a 1 = (1, 1, 1, 1,...) R n a = ( 1, 1, 1, 1,...) R n a 3 = (1, 1, 1, 1,...) R n. For the application of the recovering techniques the domain R = [ 1.5, 1.5] n has been chosen. Table 1 shows a comparison of the algorithms R QR, R Alg.1, and R Alg. for different values of n on the benchmark problem. Here, all procedures have been started with one single solution z = (a 1, α 1 ), where α 1 = (1, 0, 0) T, i.e., with the solution of the first objective f 1. For all required scalar optimization problems in both the predictor and the corrector step the derivative free Quasi Newton method e04jyf of the NAG Fortran package 4 has been used. For all cases in Table 1, it holds for two box collections C 1 and C with C 1 > C that the 4 http://www.nag.co.uk/

7 Fig. 4. The vectors v i, i = 3, 4 (which are multiples of the secants A i for sake of a better visualization) span a subspace that approximates the (exact) tangent space T x (0)M. The horizontal area marks the points which satisfy (a1), the vertical area marks the points wich satisfy (a) and their intersection marks the points wich satisfy both. collection C 1 is indeed a superset of C (though the differences do not play a major role in this case since additional boxes are mostly neighboring solutions, and no significant difference could be observed when considering the Pareto fronts. In all cases, the box collections are near to a perfect covering of the Pareto set.). Though the approximation qualities are basically equal, this does not hold for the computational times. In all cases, R Alg.1 is the fastest method, and for n = 1, 000, R QR is the slowest method (which holds as well for all larger values of n, where R QR is applicable). Among the two novel methods R Alg.1 and R Alg., it can be observed as anticipated that R Alg.1 is a bit faster while R Alg. tends to find more solutions which is probably due to the better approximation of the tangent space. Figure 5 shows the result obtained by R Alg. for n = 1, 000. In order to treat parameter dimensions n > 1, 000 it was observed that the best strategy is to approximate the tangent space via a split of x- and α-space as done in Algorithm 6. Further, it is advantageous to use a solver that uses gradient information such as the limited memory BFGS method (Shanno 1978). Table lists the number of function and derivative calls as well as the CPU time of R Alg.3 applied on MOP 77 for n = 100 up to n = 100, 000, where # F is the number of derivative calls of F α (i)(x). Figure 6 shows the result for n = 100, 000. None of the other methods could obtain similar results. Finally, an attempt has been made to estimate the Hausdorff distance between the x-part of the candidate set obtained by the different PC methods (X QR, X Alg.1, X Alg., X Alg.3 ) and the x-part of the Pareto set P Q = (X PQ, α PQ ). To approximate X PQ, all the algorithms described above have been run using a much smaller box size with different starting values, the resulting

8 Table 1. Comparison of the recovering methods R QR, R Alg.1, and R Alg. on MOP (77), where the derivative free routine e04jyf has been used to solve the required scalar optimization problems. Listed are the number of boxes generated by the recovering algorithms, the CPU time (in seconds), the numbers of function calls # F and derivative calls # F. n R QR R Alg.1 R Alg. #boxes 3 33 39 CPU.9.9 5. 100 # F 895, 59 1, 83, 57 1, 848, 806 # F 3 0 0 00 500 1, 000 #boxes 37 41 50 CPU 14 11.9 1 # F, 169, 144, 89, 431 4, 178, 00 # F 37 0 0 #boxes 30 30 37 CPU 134 91 144 # F 5, 765, 70 8, 69, 507 11, 70, 643 # F 30 0 0 #boxes 36 57 41 CPU 965 500 709 # F 1, 654, 54 0, 700, 411 5, 561, 88 # F 36 0 0 non-dominated solutions have been merged to get a reference set X ref X PQ (this led to an amount of 70,000 non-dominated solutions). Since boxes are used to represent the sets of interest, the metric induced by the -norm has been chosen to calculate the Hausdorff distance (compare to Definition 3). The boxes used to calculate the box collections as shown in Table 1 have a side length of d b = 0.1875. In Table 5. the resulting Hausdorff distances obtained by the different methods for dimensions n = 100 and n = 1, 000 are listed. Since all values are below d b, all approximations can be considered as good enough according to the given precision induced by the boxes. lar approximation results for 6 Conclusions and Future Work In this paper, the numerical treatment of high dimensional multi-objective optimization problems has been addressed by means of continuation methods. The bottleneck for such problems is the approximation of the tangent space. Given a solution, the cost to obtain a further solution is for full Hessians of the objectives O(n ) in terms of memory and O(n 3 ) in terms of flops, where n is the dimension of the parameter space. Alternatively, is was suggested to perform a successive approximation of the tangent space which is based on some analysis

Table. Numerical results for R Alg.3 on MOP (77) for n = 1, 000 to n = 100, 000 parameters. n # F # F CPU boxes 100, 947 7, 390 0.09 67 00 1, 893 7, 050 0.3 76 500 1, 7 6, 984 0.47 7 1, 000 3, 155 7, 387 0.9 63, 000, 15 7, 073 7 5, 000 1, 170 6, 754 4.5 61 10, 000, 069 7, 03 9.4 6 0, 000 1, 3 6, 796 1 60 50, 000, 385 7, 158 70 6 100, 000 1, 67 6, 91 183 69 9 Table 3. Hausdorff distance of the results of all PC methods to the estimated Pareto front of MOP (77) for n = 100 and n = 1, 000 parameters. n d H(X ref, X QR) d H(X ref, X Alg.1 ) d H(X ref, X Alg. ) d H(X ref, X Alg.3 ) 100 0.1348 0.1481 0.1485 0.160 1, 000 0.1709 0.17 0.154 0.1543 on the geometry of the problem as presented in Section 3. The cost of the novel method is O(n) in terms of memory and O(n ) in terms of flops. Finally, the new approach has been tested within a particular predictor corrector method on a benchmark model with up to n = 100, 000 dimensions yielding superior results to the classical implementation. The presented approach is not restricted to the solution of high dimensional multi-objective optimization problems. In fact, any parameter-dependent rootfinding problem of high dimension can be efficiently tackled using the presented continuation method (under the assumptions discussed in Section 1). For the future, it would be interesting to apply the novel method to a real world problem which will also allow for an improved comparison to other methods. The techniques are of importance for various technical and engineering applications. For example, for the numerical solution of optimal control problems for complex systems such as multi-body systems, one is faced with high dimensional multiobjective optimization problems. Using a direct method, the optimal control problem is transformed to a nonlinear optimization problem by discretizing all state and control variables in time. The discrete states and controls defined on a discrete time grid are the optimization variables for the optimization problem (see e.g. Leyendecker et al. (009) or Ober-Blöbaum et al. (010) for singleobjective optimal control problems). To meet accuracy requirements, the time discretization has to be fine enough which leads especially for long time spans as e.g. for the trajectory design for space missions to a high number of optimization

30 variables. Since the minimization of not only one but rather multiple conflicting objectives is of interest (e.g., minimal fuel consumption and minimal flight time), the high dimensional optimization problem results in a multi-objective problem Dellnitz et al. (009). Considering partial differential equation constrained optimization problems (for an overview see Biegler et al. (003)), a discretization in space and time leads to an even higher number of optimization parameters. Using the presented algorithms, one is able to compute the entire Pareto set for these multi-objective optimal control problems in an efficient way. Acknowledgments The last author acknowledges support from CONACyT project no. 18554. References Allgower, E. L. and Georg, K. (1990), Numerical Continuation Methods, Springer. Atkinson, K. E. (1989), An Introduction to Numerical Analysis, Springer, New York. Biegler, L. T., Ghattas, O., Heinkenschloss, M. and van Bloemen Waanders, B. (003), Large-Scale PDE-Constrained Optimization, Springer Lecture Notes in Computational Science and Engineering. Coello Coello, C. A., Lamont, G. B. and Van Veldhuizen, D. A. (007), Evolutionary Algorithms for Solving Multi-Objective Problems, second edn, Springer, New York. ISBN 978-0-387-3354-3. Das, I. and Dennis, J. (1998), Normal-boundary intersection: A new method for generating the Pareto surface in nonlinear multicriteria optimization problems., SIAM Journal of Optimization 8, 631 657. Deb, K. (001), Multi-Objective Optimization Using Evolutionary Algorithms, Wiley. Dellnitz, M., Ober-Blöbaum, S., Post, M., Schütze, O. and Thiere, B. (009), A multiobjective approach to the design of low thrust space trajectories + using optimal control., Celestial Mechanics and Dynamical Astronomy 105(1), 33 59. Dellnitz, M., Schütze, O. and Hestermeyer, T. (005), Covering Pareto sets by multilevel subdivision techniques, Journal of Optimization Theory and Applications 14, 113 155. Dellnitz, M., Schütze, O. and Sertl, S. (00), Finding zeros by multilevel subdivision techniques, IMA Journal of Numerical Analysis (), 167 185. Deuflhard, P. (004), Newton Methods for Nonlinear Problems. Affine Invariance and Adaptive Algorithms, Springer. Eichfelder, G. (008), Adaptive Scalarization Methods in Multiobjective Optimization, Springer, Berlin Heidelberg. ISBN 978-3-540-79157-7. Fliege, J. (004), Gap-free computation of Pareto-points by quadratic scalarizations, Mathematical Methods of Operations Research 59, 69 89. Golub, G. H. and Loan, C. F. V. (1996), Matrix Computations, Johns Hopkins University Press. Griewank, A. (000), Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, number 1 in Frontiers in Appl. Math., SIAM, Philadelphia, PA. Guddat, J., Vasquez, F. G., Tammer, K. and Wendler, K. (1985), Multiobjective and Stochastic Optimization based on Parametric Optimization, Akademie-Verlag.

Harada, K., Sakuma, J., Kobayashi, S. and Ono, I. (007), Uniform sampling of local Pareto-optimal solution curves by pareto path following and its applications in multi-objective GA, in GECCO, pp. 813 80. Hillermeier, C. (001), Nonlinear Multiobjective Optimization - A Generalized Homotopy Approach, Birkhäuser. Jahn, J. (1986), Mathematical Vector Optimization in Partially Ordered Linear Spaces, Verlag Peter Lang GmbH, Frankfurt am Main. Jahn, J. (006), Multiobjective search algorithm with subdivision technique, Computational Optimization and Applications 35(), 161 175. Karush, W. E. (1939), Minima of functions of several variables with inequalities as side conditions, PhD thesis, University of Chicago. Königsberger, K. (1997), Analysis, Springer. Kuhn, H. and Tucker, A. (1951), Nonlinear programming, in J. Neumann, ed., Proceeding of the nd Berkeley Symposium on Mathematical Statistics and Probability, pp. 481 49. Lara, A., Sanchez, G., Coello, C. A. C. and Schütze, O. (010), HCS: A new local search strategy for memetic multiobjective evolutionary algorithms, IEEE Transactions on Evolutionary Computation 14(1), 11 13. Leyendecker, S., Ober-Blöbaum, S., Marsden, J. and Ortiz, M. (009), Discrete mechanics and optimal control for constrained systems, Optimal Control Applications & Methods, DOI: 10.100/oca.91. Miettinen, K. (1999), Nonlinear Multiobjective Optimization, Kluwer Academic Publishers. Nocedal, J. and Wright, S. (006), Numerical Optimization, Springer Series in Operations Research and Financial Engineering, Springer. Ober-Blöbaum, S., Junge, O. and Marsden, J. (010), Discrete mechanics and optimal control: an analysis, ESAIM: Control, Optimisation and Calculus of Variations. DOI: 10.1051/cocv/01001. Pareto, V. (1971), Manual of Political Economy, The MacMillan Press. Rakowska, J., Haftka, R. T. and Watson, L. T. (1991), Tracing the efficient curve for multi-objective control-structure optimization, Computing Systems in Engineering (6), 461 471. Rheinboldt, W. C. (1986), Numerical Analysis of Parametrized Nonlinear Equations, Wiley. Ringkamp, M. (009), Fortsetzungsalgorithmen für hochdimensionale Mehrzieloptimierungsprobleme, Diploma thesis, University of Paderborn. Schäffler, S., Schultz, R. and Weinzierl, K. (00), A stochastic method for the solution of unconstrained vector optimization problems, Journal of Optimization Theory and Applications 114(1), 09. Schütze, O., Coello, C. A. C., Mostaghim, S. and Talbi, E.-G. (008), Hybridizing evolutionary strategies with continuation methods for solving multi-objective problems, Engineering Optimization 40(5), 383 40. Schütze, O., Dell Aere, A. and Dellnitz, M. (005), On continuation methods for the numerical treatment of multi-objective optimization problems, in J. Branke, K. Deb, K. Miettinen and R. E. Steuer, eds, Practical Approaches to Multi- Objective Optimization, number 04461 in Dagstuhl Seminar Proceedings, Internationales Begegnungs- und Forschungszentrum (IBFI), Schloss Dagstuhl, Germany. <http://drops.dagstuhl.de/opus/volltexte/005/349>. Schütze, O., Mostaghim, S., Dellnitz, M. and Teich, J. (003), Covering Pareto sets by multilevel evolutionary subdivision techniques, in C. M. Fonseca, P. J. Fleming, 31

3 E. Zitzler, K. Deb and L. Thiele, eds, Evolutionary Multi-Criterion Optimization, Lecture Notes in Computer Science. Shanno, D. F. (1978), On the convergence of a new conjugate gradient algorithm., SIAM J. Numer. Anal. 15, 147 157. Steuer, R. E. (1986), Multiple Criteria Optimization: Theory, Computation, and Applications, John Wiley & Sons, Inc.

33 (a) Parameter Space (b) Image Space Fig. 5. Numerical result of R Alg. on MOP (77) for n = 1, 000. Above: the projection of the final box collecion for x 1, x, and x 3. Below: the obtained Pareto front.