Handling High Dimensional Problems with Multi-Objective Continuation Methods via Successive Approximation of the Tangent Space

Size: px

Start display at page:

Download "Handling High Dimensional Problems with Multi-Objective Continuation Methods via Successive Approximation of the Tangent Space"

Louisa Fowler
5 years ago
Views:

1 Handling High Dimensional Problems with Multi-Objective Continuation Methods via Successive Approximation of the Tangent Space Maik Ringkamp 1, Sina Ober-Blöbaum 1, Michael Dellnitz 1, and Oliver Schütze 1 University of Paderborn Faculty of Computer Science, Electrical Engineering and Mathematics Warburger Strasse 100 Paderborn, Germany {ringkamp,dellnitz,sinaob}@math.uni-paderborn.de CINVESTAV-IPN Computer Science Department Av. IPN 508, C. P , Col. San Pedro Zacatenco Mexico City, Mexico schuetze@cs.cinvestav.mx Abstract. In many applications, several conflicting objectives have to be optimized concurrently leading to a multi-objective optimization problem. Since the set of solutions, the so-called Pareto set, typically forms a (k 1)-dimensional manifold, where k is the number of objectives considered in the model, continuation methods such as predictor-corrector (PC) methods are in certain cases very efficient tools to rapidly compute a finite size representation of the set of interest. However, their classical implementation leads to trouble when considering higher dimensional models (i.e., for dimension n > 1000 of the parameter space). In this work, it is proposed to perform a successive approximation of the tangent space which allows to find promising predictor points with lower effort in particular for high dimensional models since no Hessians of the objectives have to be calculated. The applicability of the resulting PC variant is demonstrated on a benchmark model for up to n = 100, 000 parameters. Keywords: multi-objective optimization, continuation, tangent space approximation, high dimensional problems. 1 Introduction In a variety of applications in industry and finance the problem arises that several objective functions have to be optimized concurrently. For instance, for a perfect economical production plan, the ultimate desire would be to simultaneously minimize cost and maximize quality. This example already illustrates

2 a natural feature of these problems, namely that the different objectives typically contradict each other and therefore certainly do not have identical optima. Thus, the question arises how to approximate a particular optimal compromise (see e.g. Miettinen (1999) for an overview of widely used interactive methods) or how to compute or approximate all optimal compromises of this multi objective optimization problem (MOP). The latter will be considered in this article. Mathematically speaking, in a MOP there are given k objective functions f 1,..., f k : R n R which have to be minimized. The set of optimal compromises with respect to the objective functions is called the Pareto set 1. A point x R n in parameter space is said to be optimal or a Pareto point if there is no other point which is at least as good as x in all the objectives and strictly better in at least one objective. This work will concentrate on the approximation of the Pareto set. Multi objective optimization is currently a very active area of research. By far most of the methods for the computation of single Pareto points or the entire Pareto set are based on a scalarization of the MOP (i.e., a transformation of the original MOP into one or a sequence of scalar optimization problems, see, e.g., Das and Dennis (1998), Fliege (004) or Eichfelder (008)). For a survey of these and further methods the reader is referred to Miettinen (1999) for nonlinear MOPs and to Jahn (1986) and Steuer (1986) in the linear case. Another way to attack the problem is by using bio-inspired heuristics such as Evolutionary Algorithms (e.g., Deb (001), Coello Coello et al. (007)). In such methods, an entire set of candidate solutions (population) is considered and iterated (evolved) simultaneously which allows for a finite size representation of the entire Pareto set in one run of the algorithm. These methods work without gradient information of the objectives and are particularly advantageous in the situation where the MOP is not differentiable and/or where the objectives contain many local minima. A method which is based on a stochastic approach is presented in Schäffler et al. (00). In this work, the authors derive a stochastic differential equation which has the property that it is very likely to observe corresponding solutions in a neighborhood of the set of (local) Pareto points. Similar to the evolutionary strategies the idea of Schäffler et al. (00) is to directly approximate the entire solution set and not just single Pareto points on the set. Another way to compute the entire Pareto set is to use subdivision techniques (Dellnitz et al. 005, Jahn 006). These algorithms start with a compact subset Q R n of the domain and generate outer approximations of the Pareto set which get finer under iteration (see Dellnitz et al. (00) for a convergence result). The approach is of global nature and hence in practice restricted to moderate dimensions of the parameter space. Typically that is, under mild regularity conditions on the model the set of Pareto points forms locally a (k 1)-dimensional manifold if there are k smooth objective functions. This is the reason why continuation methods such as predictor- 1 Named after the economist Vilfredo Pareto,

3 3 corrector (PC) methods for the computation of general implicitly defined manifolds (e.g., Rheinboldt (1986), Allgower and Georg (1990)) can be successfully applied to the context of multi-objective optimization, see e.g. Guddat et al. (1985), Rakowska et al. (1991), Hillermeier (001) or Schütze et al. (005). In the following the working principle of a PC method is described. For sake of a better understanding the particular case of a bi-objective problem is considered here (i.e., a MOP with k = objectives, where it is assumed that the Pareto set can be expressed by a curve. See Figure 1 for such an example). For a more general and thorough description of PC methods the reader is referred e.g. to Allgower and Georg (1990) or Section. Given a Pareto optimal solution x i P Q, where P Q denotes the Pareto set of the given problem, a further solution x i+1 P Q near to x i is computed in the following two steps: (P) Predict a point p i+1 R n such that p i+1 x i points along the solution set. Typically, this is done by linearizing P Q at x i. That is, one can choose p i+1 := x i + tν, where t R\{0} is a step size and ν is the tangent vector of P Q at x i. (C) Compute a solution x i+1 P Q in the vicinity of p i+1 (p i+1 is corrected to the solution set). It is widely accepted (e.g., Allgower and Georg (1990)) that the additional selection of the predictor is beneficial for the overall computational efficiency than directly computing x i+1 from x i due to the locality in the search in step (C) (in case t is chosen properly p i+1 is already very close to P Q ). Proceeding in the same manner one obtains a method that generates solutions along the Pareto set starting from the initial solution x i. Though PC methods are at least locally typically quite effective, they are, however, based on some assumptions. First, an initial solution has to be known or computed before the process can start. Further, it can be the case that P Q falls into several connected components (which may happen, for instance, if one or more objectives contain discontinuities). Due to their local nature, PC methods are restricted to the connected component that contains the given initial solution. Hence, in order to be able to approximate the entire Pareto set, the PC method has to be fed with multiple initial solutions. As a possible remedy for both problems PC methods can be combined with global search strategies such as evolutionary algorithms. This has been done e.g. in Schütze et al. (003), Harada et al. (007), Schütze et al. (008) and Lara et al. (010). One problem remaining is that PC methods may run into trouble for the treatment of higher dimensional MOPs as it is addressed here. Given a solution x, the main requirements of a classical (multi-objective) PC method to obtain a further solution are as follows: (P) In the predictor step, the Hessians f i (x) of the objective functions f i have to be computed. Further, a QR-decomposition of the Jacobian F of an auxiliary map F has to be computed. F is at least in the unconstrained case

4 4 Fig. 1. Given x i P Q a further solution is computed in two steps by PC methods: (P) a predictor is generated by linearizing P Q at x i ( p i+1), and (C) this point is corrected back to the solution set ( x i+1). mainly given by a weighted sum H(x, α) := k i=1 α if i (x) of the Hessians f i. This yields a linearization of the solution set at x. (C) For the correction of the predicted solution p obtained via linearization, typically the Gauss-Newton method applied on F is used which requires H(x (i), α (i) ) in each iteration i and the solution of a linear system of equation (of dimension m > n, where x R n, however, for large values of n one can assume m n). Hence, the cost to obtain a further solution is O(n ) in terms of memory and O(n 3 ) in terms of flops for full matrices H(x, α) and the algorithm runs into trouble on a standard PC for n > One possible remedy for high dimensional MOPs is certainly to exploit the sparsity of the model (if given). Here, an alternative approach is followed by changing the PC method as follows: (P) perform a successive approximation of the tangent space of the implicitly defined manifold at a given solution x which is based on some analysis performed on the geometry of the problem (and which is also the main contribution of this work), and (C) change the Gauss-Newton method against the limited memory BFGS method proposed in Shanno (1978) which is designed and approved for large scale problems. The cost of the novel method is O(n) in terms of memory and O(n ) in terms of flops. The remainder of this paper is organized as follows: In Section, the required background for the understanding of the sequel is given. In Section 3, the analysis for the successive approximation of the tangent space, and in Section 4 the resulting algorithms are presented. In Section 5, some numerical results on an

5 5 academic model are shown with up to n = 100, 000 dimensions, and finally, some conclusions are drawn in Section 6. Background This section gives the required background for the predictor-corrector algorithm which is described in the next section: the basic idea of continuaton methods are addressed (following mainly Rheinboldt (1986) and Allgower and Georg (1990)), and further the connection to multi-objective optimization is given. Continuation Methods Assume a differentiable map H : S R N R M, d := N M 1, (1) of class C r, r 1, is given on an open subset S R N. A point x S is called regular if the first derivative, H (x) R M N, has maximal rank M. Further, assume one is interested in the following system of equations: In the case the regular solution set H(x) = 0, x S. () M = {x S H(x) = 0, x regular} (3) is non-empty, it is well-known that M is a d-dimensional C r -manifold in R N without boundary (e.g., Rheinboldt (1986)). One possible way to tackle problem () numerically is to use continuation methods such as PC methods. Given an initial (approximated) solution x M further solutions x (i) M near x are found by PC methods via the following two steps: (P) Predict a set {p 1,..., p s } of distinct (and well distributed) points which are near both to x and to M. (C) For i = 1,..., s Starting with the predicted point p i, compute by some (typically few) iterative steps an approximated element x (i) of M, i.e. H(x (i) ) 0. One way to obtain well distributed predictors near to a solution x M is to compute an orthonormal basis of the tangent space at x via a QR-decomposition of H (x) T : The tangent space at a point x M is given by T x M = kerh (x) = {u R N H (x)u = 0}. (4)

6 6 The normal space N x M at x M is the orthogonal complement of T x M: N x M = (T x M) = (kerh (x)) = rgeh (x) T. (5) Let Q = (Q 1 Q ) R N N be an orthogonal matrix and R = where R 1 R M M is an upper triangular matrix, such that H (x) T = QR = (Q 1 Q ) ( ) R1 R 0 N M, ( ) R1. (6) 0 If x is regular it follows that the diagonal elements of R 1 do not vanish and hence it is straightforward to see that the columns of Q 1 R N M provide an orthonormal basis of rgeh (x) T = N x M. Thus, an orthonormal basis of T x M is given by the columns of Q R N d. Hence, one may choose predictors p R n e.g. by p := x + tq, (7) where t R\{0} is a step size and q a column vector of Q (compare to the example related to Figure 1). For particular choices to spread the predictors along the tangent space as well as for step size strategies refer to Ringkamp (009). The efficient approximation of Q is the main focus in this work. For the realization of the corrector step (C), typically the Gauss-Newton method x (i+1) = x (i) H (x (i) ) + H(x (i) ), i = 0, 1,..., (8) where H (x (i) ) + R N M is the Moore-Penrose inverse of H (x (i) ), is applied. It is well-known that this method converges quadratically to a point x M if the starting vector x (0) R N is chosen close enough to M. Refer e.g. to Deuflhard (004) for a local convergence result. In case of higher dimensions it is suggested here to use the limited memory BFGS method introduced by Shanno (1978) x (i+1) := x (i) + t i d (i), i = 0, 1,... (9) where t i is an exact, Powell- or Armijo-step size. With f(x) = H(x) it holds d (0) = f(x (0) ) and for i = 0, 1,... d (i+1) := (y(i) ) T s (i) y (i) g (i+1) ( (s(i) ) T g (i+1) (y(i) ) T g (i+1) ) (y (i) ) T s (i) y (i) s (i) + (s(i) ) T g (i+1) y (i) y (i) where g (i) := f(x (i) ), s (i) := x (i+1) x (i) and y (i) := f(x (i+1) ) f(x (i) ).

7 7 Multi-Objective Optimization In a multi objective optimization problem (MOP) the task is to simultaneously optimize k objective functions f 1,..., f k : R n R. More precisely, a general MOP can be stated as follows: min {F (x)}, Q := {x x Q Rn h(x) = 0, g(x) 0}, (MOP) where the function F is defined as the vector of the objective functions F : Q R k, F (x) = (f 1 (x),..., f k (x)), and h : Q R m, m n, and g : Q R q. The optimality of a MOP is defined by the concept of dominance (Pareto 1971). Definition 1. (a) Let v, w R k. Then the vector v is less than w (v < p w), if v i < w i for all i {1,..., k}. The relation p is defined analogously. (b) A vector y R n is dominated by a vector x R n (x y) with respect to (MOP) if F (x) p F (y) and F (x) F (y), else y is called non-dominated by x. (c) A point x Q is called (Pareto) optimal or a Pareto point if there is no y Q which dominates x. The set of all Pareto optimal solutions is called the Pareto set, and is denoted by P Q. The image F (P Q ) of the Pareto set is called the Pareto front. Fundamental for many methods for the numerical treatment of MOPs is the following theorem of Kuhn and Tucker (1951) which states a necessary condition for Pareto optimality for MOPs with equality constraints. Theorem 1. Let x be a Pareto point of (MOP) with q = 0. Suppose that the set of vectors { h i (x) i = 1,..., m} is linearly independent. Then there exist vectors λ R m and α R k with α i 0, i = 1,..., k, and k i=1 α i = 1 such that k m α i f i (x ) + λ j h j (x ) = 0 (10) i=1 j=1 h i (x ) = 0, i = 1,..., m. In the unconstrained case i.e. for m = 0 the theorem says that the vector of zeros can be written as a convex combination of the gradients of the objectives at every Pareto point. Obviously, (10) is not a sufficient condition for (local) Pareto optimality. On the other hand, points satisfying (10) are certainly Pareto candidates, and thus, following Miettinen (1999), their relevance is now emphasized by the following Definition. A point x R n is called a substationary point or Karush Kuhn Tucker point 3 (KKT point) if there exist scalars α 1,..., α k 0 and λ R m such that (10) is satisfied. Without loss of generality only equality constraints are considered here. For a more general formulation of the theorem refer e.g. to Miettinen (1999). 3 Named after the works of Karush (1939) and Kuhn and Tucker (1951) for scalar valued optimization problems.

8 8 Having stated Theorem 1, one is (following Hillermeier (001)) in the position to give a qualitative description of the set of Pareto optimal solutions which gives at the same time the link to (): Denote by F : R n+m+k R n+m+1 the following auxiliary map: k α i f i (x) + m λ j h j (x) i=1 j=1 F (x, λ, α) = h(x). (11) k α i 1 By Theorem 1 it follows that for every substationary point x R n there exist vectors λ R m and α R k such that i=1 F (x, λ, α ) = 0. (1) Hence, one expects that the set of KKT-points defines a (k 1)-dimensional manifold. This is indeed the case under certain smoothness assumptions, see Hillermeier (001) for a thorough discussion of this topic. To estimate the approximation quality of a candidate set generated by an optimization procedure to the Pareto front the Hausdorff distance will be used in this work which is defined as follows: Definition 3. Let u, v R n, A, B R n, and d(, ) be a metric on R n. Then, the Hausdorff distance d H (, ) is defined as follows: (a) dist(u, A) := inf d(u, v) v A (b) dist(b, A) := sup dist(u, A) u B (c) d H (A, B) := max {dist(a, B), dist(b, A)} 3 Approximation of T x M In this section, the geometry of the problem will be analyzed. The results will be the basis for the successive approximation of the tangent space which will be done in the next section. In the following, assume that M as defined in (3) is a sufficiently smooth d- dimensional manifold, and that a point x (0) M is given. In the sequel, matrices are used for the representation of approximations of the tangent space T x (0)M which are defined as follows: Definition 4. Let c, δ R with c 1 and δ > 0. Denote by T x (0)M(c, δ) the set of all matrices A R N d with rank(a) = d, condition number κ (A) c,

9 9 and A i + x (0) M B δ (x (0) ) i = 1,..., d, where A i are the columns of A, i.e., T x (0)M(c, δ) := {A = (A 1,..., A d ) R N d rank(a) = d, κ (A) c, A i + x (0) M B δ (x (0) ) i = 1,..., d} (13) Remark 1. (a) Given A T x (0)M(c, δ), the columns A i can be interpreted as secants that intersect M in the two points x (0) and x (i) := A i + x (0). In case the A i s are linearly independent, the image of the linear map A : R N R d, is a d-dimensional subspace of R N. In this way, A(R N ) can be viewed as an approximation of T x (0)M (or the matrix Q as defined in Equation (6)). In the context of PC methods it means that if A is accepted as a suitable approximation of T x (0)M predictors can be chosen e.g. as p := x (0) + t A i A i, (14) where A i is a column vector of A and t is chosen as in (7). (b) For all δ > 0 and 1 c 1 c it holds T x (0)M(c 1, δ) T x (0)M(c, δ). Analog for all c 1 and 0 < δ 1 δ it is T x (0)M(c, δ 1 ) T x (0)M(c, δ ). Lemma 1. There exists δ > 0 such that for all δ R with 0 < δ < δ there exists a matrix A = (A 1,..., A d ) T x (0)M(1, δ) with A i = δ i = 1,..., d and A i A j i j. Proof. The proof is separated into two parts. In (a), the existence of x M l = (H l ) 1 (0) with x x (0) = δ will be shown under some requirements on M l. In (b), it will be first proven for all l {0,..., d} that these requirements hold for some points x (1),..., x (l) M. After that, part (a) will be repeatedly used to construct these points and they will finally be used to create the matrix A. (a): For 1 l d let M l := M M l be a (d l + 1) dimensional submanifold given by M l = (H l ) 1 (0) where H l : B δ (x (0) ) R N R N d+l 1 and x (0) M l. Further let (i) ϕ : B δ (x 0 ) V be a local chart for the d-dimensional submanifold M with ϕ(b δ (x 0 ) M) = V (R d {0}), (ii) M l be a (N l + 1)-dimensional submanifold with a chart ϕ l : R N R N, ϕ l ( M l ) = R N l+1 {0}, (iii) the N d + l vectors (x x (0) ), H1(x), l..., HN d+l 1 l (x) be linearly independent for all x M l Ḃδ(x (0) ) where Ḃδ(x (0) ) := B δ (x (0) ) \ {x (0) }. Then, for all δ > 0 with δ < δ it follows the existence of x M l with x x (0) = δ.

10 10 Proof: Let K := M l B δ(x (0) ). First it is proven that K is compact by showing that each sequence in K has a convergent subsequence with limit in K. Let (x n ) n N K be a sequence, (x n ) n N is bounded and due to the Bolzano- Weierstrass theorem it has a convergent subsequence. In abuse of notation the same index n is used for the subsequence, since the entire sequence is not needed any longer. Thus, it is x n x B δ(x (0) ) B δ (x (0) ) and according to (i) and (ii) it follows that ϕ(x) V and ϕ l (x) N N l+1. By continuity of ϕ and using x n M l n N it is 0 = lim n ϕ d+i(x n ) = ϕ d+i (x) i {1,..., N d}, (15) which implies ϕ(x) V (R d {0}) and so x M. By continuity of ϕ l it follows analogously that x M l, and hence, also that x K which shows the compactness of K. By this it follows the existence of a vector x K with x x (0) = max x K x x(0). (16) In the case x x (0) = δ the claim follows. Now assume that x x (0) < δ. Consider the following optimization problem min x R N x x (0) s.t. H l (x) = 0 g(x) := x x (0) δ 0. (17) It is x x (0) and so x Ḃδ(x (0) ) since M l is a (d l + 1)-dimensional submanifold with x (0) M l. As a result the inequality constraint is not active at x and using (iii) it follows that the vectors H1( x), l..., HN d+l 1 l ( x) are linearly independent. That means that the Mangasarian-Fromowitz Constraint Qualification (e.g., Nocedal and Wright (006)) is fulfilled and since x is a local minimizer for the optimization problem (17), there exists λ R N d+l such that ( x, λ) is a Karush-Kuhn-Tucker point. Therefore, it holds λ N d+l g( x) = 0 and N d+l 0 = ( x x (0) ) + λ i Hi( x) l + λ N d+l ( x x (0) ). (18) i=1 Since g( x) 0 it follows that λ N d+l = 0, and thus, that the vectors ( x x (0) ), H1( x), l..., HN d+l 1 l ( x) are linearly dependent. That is a contradiction, and hence, it holds that x x (0) = δ, and the claim follows. (b): Let ϕ : U V be a local chart for the d-dimensional submanifold M and let U, V R N be open sets with x (0) U, ϕ(x (0) ) = 0 V and ϕ(u M) = V (R d {0}) and let l {1,..., d}. First of all, it will be proven by contradiction that a rank criterion holds for some special x (1),..., x (l) M.

11 Assumption: δ > 0 x (1),..., x (l) Ḃδ(x (0) ) M, x (i) x (0) x (j) x (0) i j and x B δ (x (0) ) with H (x) (x (1) x (0) ) T rank N d + l.. (19) (x (l) x (0) ) T By choosing the sequence (δ n ) n N = ( ) 1 n the assumption leads to the existence of sequences x (1)n) ( ( n N,..., x (l)n) Ḃ 1 (x(0) ) M and (x n ) n N n N n N n B 1 ) with x (n), x (1)n,..., x (l)n x (0) and x (i) x (0) x (j) x (0) i j. n Since the rank in (19) has the upper bound N d + l, it is H (x n ) (x (1)n x (0) ) T rank < N d + l. n N. (0) (x (l)n x (0) ) T Equation (0) is used in the following to get a contradiction. In fact, a sequence of zeros with non-zero limit will be constructed. According to standard analysis (e.g. Königsberger (1997)) there exists an embedding γ : V R N with an open set V R d and γ(v ) = M U. W.l.o.g. assume 0 V, γ(0) = x (0) and B 1 (x (0) ) U. Therefore, it holds: (i) γ : V M U is bijective, (ii) γ is continuously differentiable, (iii) γ (0) R N d has maximal rank and T x (0)M = γ (0)R d, (iv) γ 1 : M U V exists and is continuous. Define t (i)n := γ 1 (x (i)n ), i {1,..., l}, n N. By γ 1 (x (0) ) = 0 and (i) it follows that t (i)n 0. Simple calculations show that the following sequence is bounded (more concrete, its Euclidean norm is bounded by l) t (1)n t (1)n. t (l)n t (l)n n N 11. (1) Therefore one can apply the theorem of Bolzano-Weierstrass to obtain a convergent subsequence. In abuse of notation the same index n is used for that subsequence, since the entire sequence is not needed ( any longer. Subdividing that t subsequence in the l d-dimensional parts (i)n t (i)n leads to l convergent )n N sequences. It is t(i)n t t (i)n = 1 n N, hence, it follows (i)n t (i)n t (i) 0. Due to (iii) it follows 0 b (i) := γ (0)t (i) T x (0)M ()

12 1 and due to (ii) it is x (i)n x (0) t (i)n Using x (i) x (0) x (j) x (0) By 0 = lim n = γ(t(i)n ) γ(0) t (i)n b (i). (3) i j, it follows x (i)n x (0), x(j)n x (0) t (i)n t (j)n = b (i), b (j) i j. (4) T x (0)M = {v R N H (x (0) ) v = 0} (5) and () it follows that b (i) H l (x (0) ) i {1,..., l}, l {1,..., N d}. Using that and (4) it becomes apparent that the matrix H (x (0) ) b (1)T. R(N d+l) N (6) b (l)t has full rank. Let det be a composition of a function which projects the matrix in (6) to a regular (N d + l) (N d + l) submatrix and the determinant. It holds that det is a continuous function since it is a composition of continuous functions. Therefore, it is H (x (0) ) H (x n ) 0 det b (1)T. b (k)t (3) = lim n det (x (1)n x (0) ) T t (1)n. (x (k)n x (0) ) T t (k)n (0) = lim 0 = 0, (7) n which is a contradiction. Thus, the initial assumption is false. Hence, the following statement holds: l {1,..., d} δ > 0 such that x (1),..., x (l) Ḃ δ (x (0) ) M, x (i) x (0) x (j) x (0) i j and x B δ (x (0) ), and hence H (x) (x (1) x (0) ) T rank = N d + l. (8). (x (l) x (0) ) T Since there are just finitely many integers l {1,..., d} to consider, it follows the existence of such a radius δ > 0 for all l {1,..., d}. Using such a radius δ with B δ (x (0) ) U mathematical induction will be used to show the existence of points x (1),..., x (l) M with x (i) x (0) = δ i {1,..., l} and x (i) x (0) x (j) x (0) i, j {1,..., l}, i j for all l d. Basis l = 1:

13 Define M 1 := R N and use some orthogonal vectors v (1),..., v (N) R N to define a chart ϕ 1 : R N R N, (v (1) ) T (x x (0) ) ϕ 1 (x) :=. (9) (v (N) ) T (x x (0) ) for the N-dimensional submanifold M 1. Further define the d-dimensional submanifold M 1 := M M 1 = M = (H 1 ) 1 (0) with H 1 = H. Therefore (a) (ii) and (i) are fulfilled. According to the rank condition (8) it follows the linear independence of (x (1) x (0) ), H 1 1 (x (1) ),..., H 1 N d(x (1) ) (30) for all x (1) M 1 Ḃδ(x (0) ) as desired in (a) (iii). As a result (a) yields a point x (1) M 1 with x (1) x (0) = δ. Inductive step: Let points x (1),..., x (l) M exist with x (i) x (0) = δ i {1,..., l} and x (i) x (0) x (j) x (0) i, j {1,..., l}, i j for l < d. It has to be shown that this is also true for l + 1 d. Let H l+1 : R N R l with ( x (1) x (0) ) T (x x (0) ) H l+1 (x) :=.. (31) ( x (l) x (0) ) T (x x (0) ) According to standard linear algebra there exist orthogonal vectors v (1),..., v (N l) R N with ( x (i) x (0) ) v (j) for all i {0,..., l}, j {1,..., N l}. Hence, it follows that ϕ l+1 : R N R N, (v (1) ) T (x x (0) ). ϕ l+1 (v (N l) ) T (x x (0) ) (x) := ( x (1) x (0) ) T (x x (0) (3) ). ( x (l) x (0) ) T (x x (0) ) is a chart for the (N l)-dimensional submanifold M l+1 := ( H l+1 ) 1 (0). Further, let H l+1 : B δ (x (0) ) R N d+l with ( ) H(x) H l+1 (x) := H l+1 (33) (x) and define M l+1 := (H l+1 ) 1 (0) = H 1 (0) ( H l+1 ) 1 (0) = M M l+1. According to (8) it is rank ( (H l+1 ) (x) ) = N d + l for all x B δ (x (0) ) and 13

14 14 with x (0) M l+1 it follows with the regular value theorem that M l+1 is a (d l)-dimensional submanifold. Therefore (a) (i) and (ii) are fulfilled. Because of Hl+1 (x (l+1) ) = 0 x (l+1) M l+1 it follows x (l+1) x 0 x (i) x 0 i {1,..., l}. Again, using the rank condition (8) yields the linear independence of H 1 (x (l+1) ),..., H N d (x (l+1) ), ( x (1) x 0 ),..., ( x (l) x 0 ), (x (l+1) x (0) ) = H l+1 1 (x (l+1) ),..., H l+1 N d+l (x(l+1) ), (x (l+1) x (0) ) (34) for all x (l+1) M l+1 Ḃδ(x (0) ) as desired in (a) (iii). As a result (a) yields a point x (l+1) M l+1 with x (l+1) x (0) = δ and x (i) x (0) x (j) x (0) {1,..., l + 1}, i j. That completes the mathematical induction. Finally, define A i := x (i) x (0) i 1,..., d, then with i, j A T A = ( A T ) d i A j = diag(δ,..., δ) (35) i,j=1 it follows κ (A) = λmax(a T A) λ min(a T A) = δ δ = 1, where λ max(a T A) is the maximal and λ min (A T A) the minimal singular value of A T A. Thus, A T x (0)M(1, δ), and the claim follows. The following result shows that for every c 1 there exists a δ > 0 such that the image of every matrix A T x (0)M(c, δ) approximates the tangent space within a pre-defined tolerance ɛ R +. Theorem. Let N, d N, d N and M R N be a d-dimensional submanifold with tangent space T x (0)M in x (0) M. Then it holds: (a) δ > 0, c 1 it is T x (0)M(c, δ). (b) ɛ > 0, c 1 there exists a δ > 0 such that A T x (0)M(c, δ) it holds: B R N d with rank(b) = d, B(R d ) = T x (0)M and At Bt ɛ Bt t R d (36) Proof. Ad (a): Let δ > 0 and c 1. Due to Lemma 1 there exists δ > 0 such that δ 1 R with 0 < δ 1 < δ there exists a matrix A = (A 1,..., A d ) T x (0)M(1, δ ) with A i = δ 1 i = 1,..., d. Hence, it is A T x (0)M(1, δ 1 ), and for δ 1 < δ it follows with Remark 1 (b): T x (0)M(1, δ 1 ) T x (0)M(c, δ). Ad (b): According e.g. to Königsberger (1997) there exists an embedding γ : V R N, U R N, V R d with γ(v ) = M U and x (0) M U. W.l.o.g. assume 0 V and γ(0) = x (0). Therefore, it holds: (i) γ (0) R N d has maximal rank, (ii) γ is continuously differentiable, (iii) γ 1 : M U V exists and is continuous.

15 15 Given a matrix C R N d with rank(c) = d, it follows that also the pseudo inverse C + R d N has maxmial rank. Further, since C + C = I R d d, it holds: t = C + Ct C + Ct, t R d. (37) Because of (ii) for each a > 0 there exists a constant δ > 0, such that t R d B δ(0) it holds γ(t) γ(0) γ (0)t a t. (38) Due to (i) it holds γ (0) + > 0 and one can define a := ɛ (c + 1) d γ (0) +. (39) By (iii) and γ 1 (x (0) ) = 0 it follows that for every δ > 0 there exists a δ > 0 such that B δ (x (0) ) U and for all x M B δ (x (0) ) it holds γ 1 (x) < δ. Defining t := γ 1 (x) it follows x x (0) γ (0)t = γ(t) γ(0) γ (0)t (38) (37) ɛ (c + 1) d γ (0)t. ɛ (c + 1) d γ (0) + t (40) Let A := (A 1,..., A d ) T x (0)M(c, δ) and define x (i) := A i + x (0) i = 1,..., d, then it is x (i) M B δ (x (0) ). Further, define t (i) := γ 1 (x (i) ) and B := (B 1,..., B d ) = (γ (0)t (1),..., γ (0)t (d) ) R N d, then it follows by (40) A i B i ɛ (c + 1) d B i, i = 1,..., d. (41) W.l.o.g. assume that ɛ < 1. Using the Frobenius norm. F : R N d R, d (C 1,..., C d ) F = i=1 C i, and the inequalities C C F d C (e.g., Golub and Loan (1996)) it follows A B A B F = d A i B i = i=1 ɛ (c + 1) d B F (41) ɛ (c + 1) B ɛ (c + 1) d B i d i=1 ɛ (c + 1) ( A B + A ) ɛ<1 A B + ɛ A (c + 1). (4)

16 16 By this it follows that A B ɛ A c κ (A) c ɛ A A + A = ɛ A + (43) which leads to A + B A < 1. Using the perturbation lemma (e.g., Golub and Loan (1996)) it follows and rank(b) = rank(a + (B A)) = d (44) B + = (A + (B A)) + By (44) it follows B(R d ) = T x (0)M. Further, it holds A + 1 A + B A. (45) (43) A B ɛ A + A B ( 1 A + A B = ɛ A + ɛ<1 ɛ A + ɛ A B ) (45) ɛ B +. (46) For all t R d with t = 1 it follows (A B)t max (A B) t = A B t =1 (46) ɛ B + = ɛ t (37) B + ɛ Bt. (47) Finally, since (A B) is a linear function, it holds t R d At Bt ɛ Bt. (48) That is, roughly speaking, for every c > 1 there exists a δ > 0 such that the relative deviation v ṽ v of every vector v T x (0)M \ {0} to a vector ṽ A(R d ) is less than a given tolerance ɛ > 0. Here, A T x (0)M(c, δ) can be chosen arbitrarily. However, in general, no δ > 0 can be determined such that this property holds for all c > 1 which will be demonstrated in the following example: In each neighborhood around x 0 M there exist further points x (1), x () M such that A 1 := x (1) x (0) and A := x () x (0) are linearly dependent, however, in the vector space A(R d ) there exist always vectors which are perpendicular to vectors in T x (0)M. Example 1. Consider the -dimensional manifold M := {x R 3 x 1 + x + x 3 = 0, x i < 1, i = 1, } (49)

17 17 with the embedding ϕ : U R 3, ϕ(t) := (t 1, t, t 1 t ) T, (50) ), where U := {t R t i < 1}. It is ϕ(u) = M. Let B := ϕ (0) = ( then it is T x (0)M = B(R ). For 1 > δ > 0 and t (0) := (0, 0) T, t (1) := ( δ, 0)T, t () := ( δ 4, 0)T let x (0) := ϕ(t (0) ) = (0, 0, 0) T, x (1) := ϕ(t 1 ) = ( δ, 0, δ 4 )T, and x () := ϕ(t () ) = ( δ δ 4, 0, 16 )T. Then, it is x (0), x (1), x () M B δ (x 0 ), and the vectors A 1 := x (1) x (0) = ( δ, 0, δ 4 A := x () x (0) δ4 δ = (, 0, 16 ) T ) T (51) are linearly independent for every value of δ 0. However, the subspace spanned by A := (A 1, A ) contains vectors that are orthogonal to T x (0)M such as v = (0, 0, 1) T (compare to Figure ). It holds A + = (A + 1, A+, A+ 3 ), with A+ 1 = ( δ, 8 δ )T, A + = (0, 0)T, A + 3 = ( 8 δ, 16 δ ) T. Thus, for the condition number of A it holds κ (A) = A A + Ae A + e 3 = δ + 4 δ + 1 (5) for δ 0 and thus there does not exist any constant c 1 with κ (A) c required in Theorem. δ > 0 as Corollary 1. For all ɛ > 0, δ > 0, c 1 there exists δ > 0 such that A T x (0)M(c, δ) it holds v T x (0)M B δ(0) ṽ A(R d ) : ṽ v ɛ. (53) Proof. Let v T x (0)M B δ(0). By Theorem there exists a matrix B R N d such that v T x (0)M B δ(0) t R d with v = Bt. Let ṽ := At, it follows ṽ v = At Bt ɛ δ Bt = ɛ δ v ɛ. (54) The above result gives a hint of how to approximate the tangent space: Given x (0) M, one can compute d further solutions x (i) M, i = 1,..., d, in the vicinity of x (0). If for the matrix A of secants it holds rank(a) = d (55) κ (A) c (56) A i + x (0) M B δ (x (0) ), (57)

18 18 Fig.. The vectors v i, i = 1, (which are multiples of the secants A i for sake of a better visualization) span a subspace that contains vectors that are orthogonal to the (exact) tangent space T x (0)M. then it is A T x (0)M(c, δ). If further c and δ are small enough, then one can expect due to Corollary 1 that rge(a) serves as a good approximation of T x (0)M. However, so far nothing is gained from the practical point of view since it is still unclear how to choose the neighboring solutions x (i), and for a given set of solutions the conditions (55) to (57) have to be checked. In the following, a result is stated that is the basis for the successive approximation of the tangent space that is proposed in the next section. As an additional bonus, the verification of the conditions (55) to (57) will get superfluous. Theorem 3. Let c 1, then it holds: (a) There exists δ > 0 such that δ l, δ u R [ ] with 0 < δ u < δ and δ l (1+c )(d 1)+ (1+c )(d 1)+c δ u, δ u there exist vectors x (i) M B δ (x (0) ) i = 1,..., d such that (a1) x (i) x (0) [δ l, δ u ] i = 1,..., d (a) 1 x (i) x (0) (x (j) x (0) ) ] [δ u c δ l δ u (1+c )(d 1), δ l + c δ l δ u (1+c )(d 1) i j. [ ] (b) δ, δ l, δ u R with 0 < δ u < δ and δ l (1+c )(d 1)+ (1+c )(d 1)+c δ u, δ u and x (i) M B δ (x (0) ) such that (a1) and (a) are satisfied, it holds A := (x (1) x (0),..., x (d) x (0) ) T x (0)M(c, δ). (58) Proof. Ad (a): Due to Lemma 1 there exists δ > 0, such that δ R with 0 < δ < δ there exists a matrix A = (A 1,..., A d ) T x (0)M(1, δ) with A i = δ,

19 19 i = 1,..., d, and A i A j i j. Let δ l, δ u R with 0 < δ u < δ and δ l 0 < δ l δ u, since 0 < (1+c )(d 1)+ (1+c )(d 1)+c [ (1+c )(d 1)+ (1+c )(d 1)+c δ u, δ u ], then it holds (1+c )(d 1)+ (1+c )(d 1)+ = 1. It follows that [δ l, δ u ]. Let x (i) := A i + x (0), then it is x (i) M B δ (x (0) ) i = 1,..., d. Choosing δ δl = +δ u δl it follows that δ l = +δ l δ δu +δ u = δ u, and it is x (i) x (0) = A i = δ [δ l, δ u ] i = 1,..., d, (59) i.e., condition (a1) holds for the chosen solutions x (i). Further, it holds for all i j Furthermore, it is 1 x (i) x (0) (x (j) x (0) ) = 1 A i A j (A i A j) = = δ. 1 ( A i + A j ) (60) (1 + c )(d 1) + (1 + c )(d 1) + c δ u δ l (61) (1 + c )(d 1)δ u + δ u (1 + c )(d 1)δ l + c δ l (6) δ u + δ u (1 + c )(d 1) δ l + c δ l (1 + c )(d 1) (63) δ u δ l + c δ l δ u (1 + c )(d 1). (64) It follows that δ l + δ u δ l + c δ l δ u (1 + c )(d 1) (65) and δu c δl δ u (1 + c )(d 1) δ l + δ u. (66) Combining the above results it follows i j 1 x (i) x (0) (x (j) x (0) ) = δ = δ l + δ u [ δu c δ l δ u (1 + c )(d 1), δ l + c δ l δ u (1 + c )(d 1) ].

20 0 Hence, condition (a) holds for the chosen x (i), and the claim follows. [ ] Ad (b): Let δ, δ l, δ u R with 0 < δ u < δ and δ l (1+c )(d 1)+ (1+c )(d 1)+c δ u, δ u and let x (i) M B δ (x (0) ) i = 1,..., d. Let A i := x (i) x (0) and A := (A 1,..., A d ) R N d, then A T A = ( A i, A j ) d i,j=1 R d d is symmetric and has d real eigenvalues λ i, i = 1,..., d. Let K i := z R z A i, A i d j=1,j i A i, A j, (67) then it follows by the Theorem of Gerschgorin (e.g., Atkinson (1989)) that every eigenvalue λ i of A T A is contained in d i=1 K i. By condition (a1) it holds A i, A i = A i [δl, δ u], i = 1,..., d, and by condition (a)it is 1 A i A j [δ u c δ l δ u (1+c )(d 1), δ l + Putting this together it follows i j c δ l δ u (1+c )(d 1) ] i j. A i, A j = 1 Ai + A j A i A j = { 1 1 (a1),(a) = ( ) Ai + A j A i A j, if Ai + A j A i A j ( ) Ai A j A i A j, if Ai + A j < A i A j δu δl + ( δ u c δ l δ u (1 + c )(d 1). ) c δ l δ u (1+c )(d 1), if A i + A j A i A j c δ l δ u (1+c )(d 1) δ l, if A i + A j < A i A j For all z K i with z A i it follows z A i = z A i, A i z A i + d j=1,j i = δ u + c δ l δ u (1 + c ). d j=1,j i A i, A j A i, A j δu c δl + (d 1) δ u (1 + c )(d 1)

21 1 Furthermore, for all z K i with z A i it is d A i z = z A i, A i A i, A j d z A i j=1,j i j=1,j i A i, A j δl c δl (d 1) δ u (1 + c )(d 1) Hence, it is K i = δ l c δ l δ u (1 + c ). [ ] δl c δ l δ u (1+c ), δ u + c δ l δ u (1+c ) i = 1,..., d and it follows K := d K i i=1 [ δl c δl δ u (1 + c ), δ u + c δl ] δ u (1 + c. (68) ) The following consideration shows that all eigenvalues λ 1... λ d of A T A are strictly positive. It is Hence, λ1 λ d λ i inf z K z δ l c δ l δ u (1 + c ) > δ l c δ l + δ l (1 + c ) = δ l (c + 1) (1 + c ) δ l = 0. is defined and it holds c δ l δ u (1+c ) λ 1 δ u + λ d δl c δl δ u (1+c ) = (1 + c )δ u + c δ l δ u (1 + c )δ l c δ l + δ u = c (δ u + δ l ) δ l + δ u = c. Since A T A has only strictly positive eigenvalues it is rank(a T A) = d and it follows that also rank(a) = d. Let σ 1... σ d be the singular values of A, then it holds σ i = λ i, and it follows κ (A) = σ 1 λ1 = c. σ d λ d Hence, it is A T x (0)M(c, δ), and the proof is complete.

22 4 The Algorithms Here, three different strategies for the successive approximation of the tangent space are presented that are based on the considerations made in the previous section. Given an initial solution z (0) = (x (0), α (0) ) M R N, N = n + k, where M = F 1 (0) and F the map defined in (11), all methods aim to find suitable neighboring solutions z (i) M in the vicinity of z (0). While the first two approaches work directly in the complete (x, α)-space, the third approach splits the x- and the α-space for the successive approximation. The first method considered here (see Algorithm 1) is straightforward: successively d neighboring solutions z (i) M B δ (z (0) ) are computed starting from randomly chosen starting points in the vicinity of z (0). If the resulting matrix A satisfies (a1) and (a) from Theorem 3, then rge(a) can be viewed as a suitable approximation of T x (0)M. Algorithm 1 (Randomly chosen solutions z (i) ) (S1) Choose δ > 0, set i := 1. (S) Choose z (i) B δ (z (0) ) R N uniformly at random. (S3) Solve F ( ) starting with z (i). (S4) If no acceptable solution has been found in (S3), go to (S), else proceed with the solution z (i) with F (z (i) ) 0. (S5) Set A i := z (i) z (0). (S6) If i < d + 1 go to (S), else STOP. Here, a solution z is defined to be acceptable if the value F (z ) is below a given (low) threshold. In practice, it has been observed that Algorithm 1 already yields sufficient approximations of the tangent spaces even though it does not check the conditions (a1) and (a) from Theorem 3 (refer to the numerical results presented in the next section). However, as Example 1 shows, an acceptable approximation cannot be expected in general, even for arbitrarily small values of δ. To prevent such cases, the next algorithm is constructed. For this, the following penalty functions will be needed: ( ) z (i) z (0) δ l, if z (i) z (0) < δl h 0 (δ l, δ u, z (0), z (i) ( ) ) := z (i) z (0) δu, if z (i) z (0) > δu 0, else (69) h 1 (d, c, δ l, δ u, z (i), z (j) ) := ( ( )) 1 z(i) z (j) δu c δ l δ u (1+c )(d 1), if 1 ( ( 1 z(i) z (j) δl + c δ l δ u (1+c )(d 1) 0, else z(i) z (j) < δ u c δ l δ u (1+c )(d 1) )), if 1 z(i) z (j) > δ l c δ l δ u (1+c )(d 1) (70)

23 3 By construction of h 0 and h 1 it holds: h 0 (δ l, δ u, x (0), x (i) ) = 0 z (i) z (0) satisfies (a1) from Theorem 3. h 1 (d, c, δ l, δ u, z (i), z (j) ) = 0 j < i 1 z(i) z (j) satisfies (a) from Theorem 3. Algorithm is based on the result in the previous section since it aims to find a distribution of the solutions z (i) as discussed in Theorem 3. Algorithm (Distribution of solutions z (i) via penalization) [ ] (S1) Choose δ u > 0, c 1, δ l (1+c )(d 1)+ (1+c )(d 1)+c δ u, δ u and set i := 1. (S) Choose z (i) B δu (z (0) ) R N uniformly at random. (S3) Solve F ( ) + w 0 h 0 (δ l, δ u, z (0), ( )) + w 1 j<i h 1(d, c, δ l, δ u, ( ), z (j) ) with weights w 0, w 1 R + starting with z (i). (S4) If no acceptable solution has been found in (S3), go to (S), else proceed with the solution z (i) with F (z (i) ) +w 0 h 0 (δ l, δ u, z (0), z (i) )+w 1 j<i h 1(d, c, δ l, δ u, z (i), z (j) ) = 0. (S5) Set A i := z (i) z (0) and i := i + 1. (S6) If i < d + 1 go to (S), else STOP. Crucial is of course the computation of the minimizers of g MOP (x) = F ( ). Note that F already contains the gradient information of each objective. Hence, if the gradient g MOP ( z) is evaluated directly, the Hessians H i (z) of each objective at x have to be computed. To prevent this, it is suggested here to approximate g MOP ( z) by finite differences or (which is more accurate) to compute it using automatic differentiation (Griewank 000). In that case and assuming that n >> k, the cost to obtain g MOP ( z) scales basically linearly with n in terms of memory (if not the entire matrix is stored at once but evaluated one by one) and quadratic in terms of flops. If a derivative free solver is used, the number of flops grows only linearly with n. These costs hold ideally also for the entire approximation of the tangent space. The above methods can in principle be applied to any given problem of the form (1). Based on numerical experiments on high dimensional MOPs the authors of this work have, however, observed a different sensibility in x and α space of the map F leading to the problem of finding a proper value of δ for the neighborhood search. As a possible remedy, it is here suggested to split the two different spaces as follows: instead of choosing a neighbor solution z (i) = ( x (i), α (i) ) B δ (z (i) ) which is corrected to the solution set (steps (S) and (S3) of Algorithm 1), one computes neighboring solution by varying the weight vector α (i) once in the beginning and tackling F α (i)(x) := F (x, α (i) ) for fixed α (i). Algorithm 3 details this procedure. Algorithm 3 (Distribution of solutions z (i) via variation of α)

24 4 (S1) Choose δ α, δ z > 0, i := 1. (S) Choose α (i) B δα (α (0) ) R k with α (i) 1 = 1 and α (i) 0 at random. (S3) Solve F α (i)(x) starting with x (0). (S4) If (x (i), α (i) ) / B δz (x (0), α (0) ), set δ α := δα and go to (S ). (S5) If no acceptable solution could be computed go to (S ), else set x (i) as the obtained solution. (S6) Set A i := (x (i), α (i) ) (x (0), α (0) ). (S7) If i < d + 1 go to (S ), else STOP. 5 Results In this section, first the mechanism of Algorithm to select new solutions z (i) is demonstrated. Further on, the performances of the resulting PC methods are tested and compared against the classical implementation. 5.1 Revisit of Example 1 In Example 1, two points x (1) and x () have been chosen such that the space spanned by x (1) x (0) and x () x (0) is orthogonal to the tangent space T x (0)M, and one could find such points in every ball B δ (x (0) ). The example demonstrated what can go wrong if one does not take care of the condition constraint in Theorem. The following discussion shows that for a given point x (1) Algorithm computes another point x () which in case δ is small enough prevents the spanned space to be orthogonal to T x (0)M and which even serves as an approximation of T x (0)M. Consider the -dimensional manifold from Example 1 and the vector given therein. For δ = 3 Setting δ u := 1, c := M := {x R 3 x 1 + x + x 3 = 0, x i < 1, i = 1, } (71) ) T A 1 := x (1) x (0) δ = (, 0, δ (7) 4 it holds A 1 = ( 3 4, 0, 9 ) T. (73) and d =, it follows that (1+c )(d 1)+ (1+c )(d 1)+c = 3. Choosing δ l := 3 δu+δu = 5 6 leads to δ l [ 3 δ ] u, δ u as required in Algorithm, and also A 1 satisfies A 1 [δ l, δ u ] = [ 5 6, 1]. Similar to Example 1, one can choose another vector A := x () x (0) = x () M, but in contrast to Example 1

25 5 A is chosen here such that it satisfies conditions (a1) and (a) as Algorithm does. Defining the vector leads [ to A = 15 δu c δ l δ u ( A := 0, 3 4, 9 ) T (74) [ 5 6, 1] and 1 A 1 A = 9 16 [ , ] ] = (1+c )(d 1), δ l + c δ l δ u (1+c )(d 1). Thus, A 1 and A satisfy (a1) and (a). Figure 3 illustrates the areas of points which satisfy (a1) and (a) and the approximated tangent space. Choosing a smaller value for δ leads to a better approximation of T x (0)M, e.g. for δ = 3 4 it holds A 3 = ( 3 8, 0, 9 ) T. (75) 64 Setting δ u := 9 0, c := 3 3, d = and choosing δ l in the same way as above leads to δ l = 3 8 [ 3 δ ] u, δ u. In addition, A3 satisfies A 3 [δ l, δ u ] = [ 3 8, 0] 9. Defining the vector A 4 := leads to A 4 = [ 3 [ δu c δ l δ u ] 8, 9 0 ( 0, 3 8, 9 ) T (76) 64 ] and 1 A 3 A 4 = 9 64 [ , 41600] =. Thus, A 3 and A 4 satisfy (a1) and (a). Fig- (1+c )(d 1), δ l + c δ l δ u (1+c )(d 1) ure 4 illustrates the areas of points which satisfy (a1) and (a) and the approximated tangent space spanned by A 3 and A Testing the PC Methods Now the performances of the different PC methods when approximating the tangent space successively are tested and compared. As base algorithm it was chosen to use the recovering technique presented in Dellnitz et al. (005) and Schütze et al. (005). This method uses boxes as a tool to maintain a spread of the solutions: The domain R is partitioned by a set of boxes. Every solution z of F is associated with the box which contains z, and only one solution is associated with each box. The idea of the recovering algorithm is to detect from a given box which contains a solution of F neighboring boxes which contain further solutions of F, and so on. By this, the solution set is represented by a box collection C. Ideally, i.e., for a perfect outcome set, the associated box collection C covers the entire Pareto set tightly. In the following, it will be distinguished between the classical recover algorithm R QR as described in Schütze et al. (005) which uses a QR-decomposition of F to obtain the tangent space, and the modifications R Alg.1, R Alg., and R Alg.3 which are obtained via a successive approximation of the tangent space via Algorithm 1 to 3, respectively.

26 6 Fig. 3. The vectors v i, i = 1, (which are multiples of the secants A i for sake of a better visualization) span a subspace that approximates the (exact) tangent space T x (0)M. The horizontal area marks the points which satisfy (a1), the vertical area marks the points wich satisfy (a) and their intersection marks the points wich satisfy both. To compare the performance of the three different PC algorithms the following scalable MOP taken from Schütze et al. (005) is used: f 1, f, f 3 : R R n R n f i (x) = (x j a i j) + (x i a i i) 4, j=1 j i (77) where a 1 = (1, 1, 1, 1,...) R n a = ( 1, 1, 1, 1,...) R n a 3 = (1, 1, 1, 1,...) R n. For the application of the recovering techniques the domain R = [ 1.5, 1.5] n has been chosen. Table 1 shows a comparison of the algorithms R QR, R Alg.1, and R Alg. for different values of n on the benchmark problem. Here, all procedures have been started with one single solution z = (a 1, α 1 ), where α 1 = (1, 0, 0) T, i.e., with the solution of the first objective f 1. For all required scalar optimization problems in both the predictor and the corrector step the derivative free Quasi Newton method e04jyf of the NAG Fortran package 4 has been used. For all cases in Table 1, it holds for two box collections C 1 and C with C 1 > C that the 4

7 Fig. 4. The vectors v i, i = 3, 4 (which are multiples of the secants A i for sake of a better visualization) span a subspace that approximates the (exact) tangent space T x (0)M.

27 7 Fig. 4. The vectors v i, i = 3, 4 (which are multiples of the secants A i for sake of a better visualization) span a subspace that approximates the (exact) tangent space T x (0)M. The horizontal area marks the points which satisfy (a1), the vertical area marks the points wich satisfy (a) and their intersection marks the points wich satisfy both. collection C 1 is indeed a superset of C (though the differences do not play a major role in this case since additional boxes are mostly neighboring solutions, and no significant difference could be observed when considering the Pareto fronts. In all cases, the box collections are near to a perfect covering of the Pareto set.). Though the approximation qualities are basically equal, this does not hold for the computational times. In all cases, R Alg.1 is the fastest method, and for n = 1, 000, R QR is the slowest method (which holds as well for all larger values of n, where R QR is applicable). Among the two novel methods R Alg.1 and R Alg., it can be observed as anticipated that R Alg.1 is a bit faster while R Alg. tends to find more solutions which is probably due to the better approximation of the tangent space. Figure 5 shows the result obtained by R Alg. for n = 1, 000. In order to treat parameter dimensions n > 1, 000 it was observed that the best strategy is to approximate the tangent space via a split of x- and α-space as done in Algorithm 6. Further, it is advantageous to use a solver that uses gradient information such as the limited memory BFGS method (Shanno 1978). Table lists the number of function and derivative calls as well as the CPU time of R Alg.3 applied on MOP 77 for n = 100 up to n = 100, 000, where # F is the number of derivative calls of F α (i)(x). Figure 6 shows the result for n = 100, 000. None of the other methods could obtain similar results. Finally, an attempt has been made to estimate the Hausdorff distance between the x-part of the candidate set obtained by the different PC methods (X QR, X Alg.1, X Alg., X Alg.3 ) and the x-part of the Pareto set P Q = (X PQ, α PQ ). To approximate X PQ, all the algorithms described above have been run using a much smaller box size with different starting values, the resulting

28 8 Table 1. Comparison of the recovering methods R QR, R Alg.1, and R Alg. on MOP (77), where the derivative free routine e04jyf has been used to solve the required scalar optimization problems. Listed are the number of boxes generated by the recovering algorithms, the CPU time (in seconds), the numbers of function calls # F and derivative calls # F. n R QR R Alg.1 R Alg. #boxes CPU # F 895, 59 1, 83, 57 1, 848, 806 # F , 000 #boxes CPU # F, 169, 144, 89, 431 4, 178, 00 # F #boxes CPU # F 5, 765, 70 8, 69, , 70, 643 # F #boxes CPU # F 1, 654, 54 0, 700, 411 5, 561, 88 # F non-dominated solutions have been merged to get a reference set X ref X PQ (this led to an amount of 70,000 non-dominated solutions). Since boxes are used to represent the sets of interest, the metric induced by the -norm has been chosen to calculate the Hausdorff distance (compare to Definition 3). The boxes used to calculate the box collections as shown in Table 1 have a side length of d b = In Table 5. the resulting Hausdorff distances obtained by the different methods for dimensions n = 100 and n = 1, 000 are listed. Since all values are below d b, all approximations can be considered as good enough according to the given precision induced by the boxes. lar approximation results for 6 Conclusions and Future Work In this paper, the numerical treatment of high dimensional multi-objective optimization problems has been addressed by means of continuation methods. The bottleneck for such problems is the approximation of the tangent space. Given a solution, the cost to obtain a further solution is for full Hessians of the objectives O(n ) in terms of memory and O(n 3 ) in terms of flops, where n is the dimension of the parameter space. Alternatively, is was suggested to perform a successive approximation of the tangent space which is based on some analysis

29 Table. Numerical results for R Alg.3 on MOP (77) for n = 1, 000 to n = 100, 000 parameters. n # F # F CPU boxes 100, 947 7, , 893 7, , 7 6, , 000 3, 155 7, , 000, 15 7, , 000 1, 170 6, , 000, 069 7, , 000 1, 3 6, , 000, 385 7, , 000 1, 67 6, Table 3. Hausdorff distance of the results of all PC methods to the estimated Pareto front of MOP (77) for n = 100 and n = 1, 000 parameters. n d H(X ref, X QR) d H(X ref, X Alg.1 ) d H(X ref, X Alg. ) d H(X ref, X Alg.3 ) , on the geometry of the problem as presented in Section 3. The cost of the novel method is O(n) in terms of memory and O(n ) in terms of flops. Finally, the new approach has been tested within a particular predictor corrector method on a benchmark model with up to n = 100, 000 dimensions yielding superior results to the classical implementation. The presented approach is not restricted to the solution of high dimensional multi-objective optimization problems. In fact, any parameter-dependent rootfinding problem of high dimension can be efficiently tackled using the presented continuation method (under the assumptions discussed in Section 1). For the future, it would be interesting to apply the novel method to a real world problem which will also allow for an improved comparison to other methods. The techniques are of importance for various technical and engineering applications. For example, for the numerical solution of optimal control problems for complex systems such as multi-body systems, one is faced with high dimensional multiobjective optimization problems. Using a direct method, the optimal control problem is transformed to a nonlinear optimization problem by discretizing all state and control variables in time. The discrete states and controls defined on a discrete time grid are the optimization variables for the optimization problem (see e.g. Leyendecker et al. (009) or Ober-Blöbaum et al. (010) for singleobjective optimal control problems). To meet accuracy requirements, the time discretization has to be fine enough which leads especially for long time spans as e.g. for the trajectory design for space missions to a high number of optimization

30 30 variables. Since the minimization of not only one but rather multiple conflicting objectives is of interest (e.g., minimal fuel consumption and minimal flight time), the high dimensional optimization problem results in a multi-objective problem Dellnitz et al. (009). Considering partial differential equation constrained optimization problems (for an overview see Biegler et al. (003)), a discretization in space and time leads to an even higher number of optimization parameters. Using the presented algorithms, one is able to compute the entire Pareto set for these multi-objective optimal control problems in an efficient way. Acknowledgments The last author acknowledges support from CONACyT project no References Allgower, E. L. and Georg, K. (1990), Numerical Continuation Methods, Springer. Atkinson, K. E. (1989), An Introduction to Numerical Analysis, Springer, New York. Biegler, L. T., Ghattas, O., Heinkenschloss, M. and van Bloemen Waanders, B. (003), Large-Scale PDE-Constrained Optimization, Springer Lecture Notes in Computational Science and Engineering. Coello Coello, C. A., Lamont, G. B. and Van Veldhuizen, D. A. (007), Evolutionary Algorithms for Solving Multi-Objective Problems, second edn, Springer, New York. ISBN Das, I. and Dennis, J. (1998), Normal-boundary intersection: A new method for generating the Pareto surface in nonlinear multicriteria optimization problems., SIAM Journal of Optimization 8, Deb, K. (001), Multi-Objective Optimization Using Evolutionary Algorithms, Wiley. Dellnitz, M., Ober-Blöbaum, S., Post, M., Schütze, O. and Thiere, B. (009), A multiobjective approach to the design of low thrust space trajectories + using optimal control., Celestial Mechanics and Dynamical Astronomy 105(1), Dellnitz, M., Schütze, O. and Hestermeyer, T. (005), Covering Pareto sets by multilevel subdivision techniques, Journal of Optimization Theory and Applications 14, Dellnitz, M., Schütze, O. and Sertl, S. (00), Finding zeros by multilevel subdivision techniques, IMA Journal of Numerical Analysis (), Deuflhard, P. (004), Newton Methods for Nonlinear Problems. Affine Invariance and Adaptive Algorithms, Springer. Eichfelder, G. (008), Adaptive Scalarization Methods in Multiobjective Optimization, Springer, Berlin Heidelberg. ISBN Fliege, J. (004), Gap-free computation of Pareto-points by quadratic scalarizations, Mathematical Methods of Operations Research 59, Golub, G. H. and Loan, C. F. V. (1996), Matrix Computations, Johns Hopkins University Press. Griewank, A. (000), Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, number 1 in Frontiers in Appl. Math., SIAM, Philadelphia, PA. Guddat, J., Vasquez, F. G., Tammer, K. and Wendler, K. (1985), Multiobjective and Stochastic Optimization based on Parametric Optimization, Akademie-Verlag.

31 Harada, K., Sakuma, J., Kobayashi, S. and Ono, I. (007), Uniform sampling of local Pareto-optimal solution curves by pareto path following and its applications in multi-objective GA, in GECCO, pp Hillermeier, C. (001), Nonlinear Multiobjective Optimization - A Generalized Homotopy Approach, Birkhäuser. Jahn, J. (1986), Mathematical Vector Optimization in Partially Ordered Linear Spaces, Verlag Peter Lang GmbH, Frankfurt am Main. Jahn, J. (006), Multiobjective search algorithm with subdivision technique, Computational Optimization and Applications 35(), Karush, W. E. (1939), Minima of functions of several variables with inequalities as side conditions, PhD thesis, University of Chicago. Königsberger, K. (1997), Analysis, Springer. Kuhn, H. and Tucker, A. (1951), Nonlinear programming, in J. Neumann, ed., Proceeding of the nd Berkeley Symposium on Mathematical Statistics and Probability, pp Lara, A., Sanchez, G., Coello, C. A. C. and Schütze, O. (010), HCS: A new local search strategy for memetic multiobjective evolutionary algorithms, IEEE Transactions on Evolutionary Computation 14(1), Leyendecker, S., Ober-Blöbaum, S., Marsden, J. and Ortiz, M. (009), Discrete mechanics and optimal control for constrained systems, Optimal Control Applications & Methods, DOI: /oca.91. Miettinen, K. (1999), Nonlinear Multiobjective Optimization, Kluwer Academic Publishers. Nocedal, J. and Wright, S. (006), Numerical Optimization, Springer Series in Operations Research and Financial Engineering, Springer. Ober-Blöbaum, S., Junge, O. and Marsden, J. (010), Discrete mechanics and optimal control: an analysis, ESAIM: Control, Optimisation and Calculus of Variations. DOI: /cocv/ Pareto, V. (1971), Manual of Political Economy, The MacMillan Press. Rakowska, J., Haftka, R. T. and Watson, L. T. (1991), Tracing the efficient curve for multi-objective control-structure optimization, Computing Systems in Engineering (6), Rheinboldt, W. C. (1986), Numerical Analysis of Parametrized Nonlinear Equations, Wiley. Ringkamp, M. (009), Fortsetzungsalgorithmen für hochdimensionale Mehrzieloptimierungsprobleme, Diploma thesis, University of Paderborn. Schäffler, S., Schultz, R. and Weinzierl, K. (00), A stochastic method for the solution of unconstrained vector optimization problems, Journal of Optimization Theory and Applications 114(1), 09. Schütze, O., Coello, C. A. C., Mostaghim, S. and Talbi, E.-G. (008), Hybridizing evolutionary strategies with continuation methods for solving multi-objective problems, Engineering Optimization 40(5), Schütze, O., Dell Aere, A. and Dellnitz, M. (005), On continuation methods for the numerical treatment of multi-objective optimization problems, in J. Branke, K. Deb, K. Miettinen and R. E. Steuer, eds, Practical Approaches to Multi- Objective Optimization, number in Dagstuhl Seminar Proceedings, Internationales Begegnungs- und Forschungszentrum (IBFI), Schloss Dagstuhl, Germany. < Schütze, O., Mostaghim, S., Dellnitz, M. and Teich, J. (003), Covering Pareto sets by multilevel evolutionary subdivision techniques, in C. M. Fonseca, P. J. Fleming, 31

32 3 E. Zitzler, K. Deb and L. Thiele, eds, Evolutionary Multi-Criterion Optimization, Lecture Notes in Computer Science. Shanno, D. F. (1978), On the convergence of a new conjugate gradient algorithm., SIAM J. Numer. Anal. 15, Steuer, R. E. (1986), Multiple Criteria Optimization: Theory, Computation, and Applications, John Wiley & Sons, Inc.

33 33 (a) Parameter Space (b) Image Space Fig. 5. Numerical result of R Alg. on MOP (77) for n = 1, 000. Above: the projection of the final box collecion for x 1, x, and x 3. Below: the obtained Pareto front.

Unit: Optimality Conditions and Karush Kuhn Tucker Theorem

Unit: Optimality Conditions and Karush Kuhn Tucker Theorem Goals 1. What is the Gradient of a function? What are its properties? 2. How can it be used to find a linear approximation of a nonlinear function?