Non Convex-Concave Saddle Point Optimization

Size: px

Start display at page:

Download "Non Convex-Concave Saddle Point Optimization"

Hugh Robbins
5 years ago
Views:

1 Non Convex-Concave Saddle Point Optimization Master Thesis Leonard Adolphs April 09, 2018 Advisors: Prof. Dr. Thomas Hofmann, M.Sc. Hadi Daneshmand Department of Computer Science, ETH Zürich

3 Abstract This thesis investigates the theoretically properties of commonly used optimization methods on the saddle point problem of the form min max f (θ,ϕ), θ R n ϕ R m where f is neither convex in θ nor concave in ϕ. We show that gradientbased optimization schemes have undesired stable stationary points; hence, even if convergent, they are not guaranteed to yield a solution to the problem. To remedy this issue, we propose a novel optimizer that exploits extreme curvatures to escape from non-optimal stationary points. We theoretically and empirically prove the advantage of extreme curvature exploitation on the saddle point problem. Moreover, we explore the idea of using curvature information even further and investigate the properties of second-order methods in saddle point optimization. In this vein, we theoretically analyze the issues that arise in this context and propose a novel approach that uses secondorder information to find a structured saddle point. i

4 Acknowledgements First and foremost, I would like to express my gratitude to my supervisor Hadi Daneshmand for his advice and assistance. Without his vital support and constant help, this thesis would not have been possible. Furthermore, I would like to thank Prof. Dr. Thomas Hofmann for providing me with the opportunity to do this thesis in his group and his guidance along the way. Also, I like to thank Dr. Aurelien Lucchi for the interesting research discussions and his valuable comments. ii

5 Contents Contents iii 1 Introduction Saddle Point Problem Structure of the Thesis and Contributions Examples of Saddle Point Problems Convex-Concave Saddle Point Problems Generative Adversarial Network Framework Robust Optimization for Empirical Risk Minimization Preliminaries Notation and Definitions Stability Critical Points Assumptions Saddle Point Optimization Gradient-Based Optimization Linear transformed Gradient Steps Asymptotic Behavior of Gradient Iterations Convergence Analysis on a Convex-Concave Objective Gradient-Based Optimizer Gradient-Based Optimizer with Preconditioning Matrix Experiment Stability Analysis Local Stability Convergence to Non-Stable Stationary Points Undesired Stable Points iii

6 Contents 6 Curvature Exploitation Curvature Exploitation for the Saddle Point Problem Theoretical Analysis Stability Guaranteed Improvement Efficient Implementation Hessian-Vector Product Implicit Curvature Exploitation Curvature Exploitation for linear-transformed Gradient Steps Experiments Escaping from Undesired Stationary Points of the Toy- Example Generative Adversarial Networks Robust Optimization Second-order Optimization Introduction Dynamics of Newton s Method Avoiding Undesired Saddle Points Simultaneous vs. full Newton Updates Generalized Trust Region Method for the Saddle Point Problem Support for Different Loss Functions Experiments SPNewton vs. Newton and Gradient Descent SPNewton vs. Simultaneous SFNewton Robust Optimization Problems with Second-order Methods and Future Work Bibliography 69 Appendix 73 Convergence Analysis on a Convex-Concave Objective iv

7 Chapter 1 Introduction 1.1 Saddle Point Problem Throughout the thesis, we consider the problem of finding a structured saddle point on a smooth objective, namely the saddle point problem (SPP). The optimization is done over the function f : R n+m R, parameterized with θ R n,ϕ R m, and the goal is to find a pair of arguments such that f is minimized with respect to the former and maximized with respect to the latter, i.e., min max f (θ,ϕ). (1.1) θ ϕ Here, we assume that f is smooth in θ and ϕ but not necessarily convex in θ or concave in ϕ. As we will see in the upcoming chapter, this problem arises in many applications: e.g., probability density estimation [10], robust optimization [21], and game theory [15]. Solving the saddle point problem of Eq. (1.1) is equivalent to finding a point (θ,ϕ ) such that f (θ,ϕ) f (θ,ϕ ) f (θ,ϕ ). (1.2) holds for all θ R n and ϕ R m. For a non convex-concave function f, finding such a saddle point is computationally infeasible. For example, even if the parameter vector ϕ is given, finding θ is a global non-convex minimization problem that is generally known as being NP-hard. Local Optima of Saddle Point Optimization Instead of optimizing for a global saddle point, we consider a more modest goal: finding a locally optimal saddle point, i.e., a point (θ,ϕ ) for which the condition of Eq. (1.2) 1

8 1. Introduction holds true in a local neighborhood K γ = {(θ,ϕ) θ θ γ, ϕ ϕ γ} (1.3) around (θ,ϕ ), with a sufficiently small γ > 0. We will call such points locally optimal saddle points 1. Definition 1.1 The point (θ,ϕ ) is a locally optimal saddle point of the problem in Eq. (1.1) if holds for (θ,ϕ) K γ. f (θ,ϕ) f (θ,ϕ ) f (θ,ϕ ) (1.4) Let S f R n R m be the set of all locally optimal saddle points; then the problem reduces to finding a point (θ,ϕ ) S f. Here, we do not consider any partial ordering on members of S f. We assume that all elements in S f are equally good solutions to our saddle point problem. 1.2 Structure of the Thesis and Contributions The thesis starts by introducing the reader to the saddle point problem and by stating the issues that arise for non convex-concave functions. By showing, in chapter 2, that multiple practical learning scenarios give rise to such a problem, we emphasize the relevance of our following analysis. In chapter 3, we lay out the mathematical foundation for the thesis by introducing the notation, basic definitions, and assumptions. We define the most important gradient-based optimization methods for the saddle point problem in chapter 4 and try to build an intuition for their properties by investigating their behavior on a simple example. Chapter 5 analyzes the dynamics of gradientbased optimization in terms of stability. Through this analysis, we discover a severe issue of gradient-based methods that is unique to saddle point optimization, namely that it introduces stable points that are no solution to the problem we are trying to solve. This is in clear contrast to GD for minimization tasks. Our observation leads to the design of a novel optimizer in chapter 6. With the use of curvature exploitation, we can remedy the identified issue. Moreover, we establish theoretical guarantees for the new method and provide an efficient implementation. We empirically support our arguments with multiple experiments on artificial and real-world data. In chapter 7, we extend our idea of using curvature information in saddle point optimization: instead of considering only the extreme curvature, we explore the advantages of using a full second-order method. We theoretically 1 In the context of game theory or Generative Adversarial Networks [10] these points are called local Nash-equilibria. We will formalize this expression and the relationship to locally optimal saddle points in section

9 1.2. Structure of the Thesis and Contributions identify the problems that arise in this context and propose a novel secondorder optimizer, that empirically outperforms gradient-based optimization on the saddle point problem. The contribution of this thesis is fivefold: 1. We show that the Stable-Center manifold theorem [14] also applies to the gradient-based method for the saddle point problem. This implies that a convergent series will almost surely yield a solution that is in the stable set of the method. 2. We provide evidence that gradient-based optimization on the saddle point problem introduces stable points that are not optimal and therefore not a solution to Eq We propose a novel optimizer, called CESP, that uses extreme curvature exploitation [7] to remedy the identified issues. We theoretically prove that if convergent this method will yield a solution to the saddle point problem. 4. We prove that optimization methods with a linear-transformed gradient step, such as Adagrad, can t solve the issue of GD as they also introduce non-optimal stable points. For this class of optimizers, we propose a modified variant of CESP. 5. We use the theoretical framework of the generalized trust region method [8] to design a second-order optimizer for saddle point problems. 3

11 Chapter 2 Examples of Saddle Point Problems The problem of finding a structured saddle point arises in varying forms in many different applications. We start this chapter by introducing examples that give rise to a classical saddle point problem with a convex-concave structure. This case has been studied extensively and provides rich theoretical guarantees for conventional optimization methods. In the next step, we motivate the need for local saddle point optimization by providing prominent examples of applications where the objective does not have a convexconcave structure. 2.1 Convex-Concave Saddle Point Problems In many practical learning scenarios the saddle point function f in Eq. (1.1) is convex in θ and concave in ϕ. Optimization on such problems has been studied in great depth and, therefore, we only mention one example here for completeness. A detailed list of different applications of convexconcave saddle point problems, ranging from fluid dynamics to economics, is presented in [4]. Constrained Optimization To solve a constrained minimization problem, we need to construct the Lagrangian and minimize it with respect to the model parameters, while maximizing with respect to its Lagrangian multipliers. For example, let s consider the following quadratic problem min x R x Ax b x (2.1) n where A R n n and b R n. constraint The minimization is with subject to the Bx = g (2.2) 5

12 2. Examples of Saddle Point Problems 6 where B R m n (m < n). By introducing the Lagrangian multiplier y, we can re-formulate it as an unconstrained problem. min max x R n y R x Ax b x + y (Bx g) (2.3) m Hence, transforming a constrained problem of the form (2.1) into the equivalent unconstrained problem in Eq. (2.3), gives rise to a saddle point problem of the general form (1.1). In many applications of computer science and engineering, the Lagrangian multiplier y has a physical interpretation and its computation is also of interest. 2.2 Generative Adversarial Network In recent years, we have seen a great rise of interest in the (non convexconcave) saddle point problem, mostly due to the invention of Generative Adversarial Networks (GAN) by Goodfellow et al. [10]. The introduction of GANs in 2014 was a significant improvement in the field of generative models. The principal problem in previous generative models was the difficult approximation of intractable probabilistic computations. The goal of a generative model is to capture the underlying data distribution. Observing that most interesting distributions, as for examples images, are high dimensional and potentially very complicated, we intuitively realize that modeling the distribution explicitly may be intractable. GANs overcome this problem by modeling the data distribution in an implicit form. Instead of computing the distribution itself, it models a device called generator which can draw samples from the distribution. The breakthrough idea of GANs is embedded in the training procedure of this generator. Inspired by game theory, the generator is faced with an opponent, the discriminator, during the training process. The two networks compete in a minimax game where the generator tries to minimize the same objective the discriminator seeks to maximize. Iteratively optimizing this objective with respect to both networks leads to more realistic generated samples over time. Intuitively, this procedure can be described with the simple game where the generator wants to fool the discriminator into believing the generated samples are real. Initially, the generator achieves this goal with very low quality samples. But over time, as the discriminator gets stronger, the generator needs to generate more realistic samples. Eventually, the generator is able to fool any discriminator once it can perfectly imitate the data distribution Framework During the learning process the generator is adjusted such that the generative distribution p g gets closer to the data distribution p d. The distribution p g is implicitly represented by a neural network G from which we can draw

2.2. Generative Adversarial Network Probability that input + is a real data sample 0 1 Pr 4565 + + /,/ + /,,/0 +,-,,- Discriminator <(+; =) Generator G(z; θ) + R,-,-! ~ # $ -1 1 Figure 2.

The network is parameterized with θ and takes as input a latent variable vector z that we sample from a noise prior p z.

In order to achieve this goal, GANs use a discriminator network D : X [0, 1] parameterized by ϕ that distinguishes real samples from generated samples by assigning each a probability (0: generated;

13 2.2. Generative Adversarial Network Probability that input + is a real data sample 0 1 Pr /,/ + /,,/0 +,-,,- Discriminator <(+; =) Generator G(z; θ) + R,-,-! ~ # $ -1 1 Figure 2.1: High-level sketch of the GAN structure and the interaction of the generator and discriminator networks with an MNIST image as input. instances. The network is parameterized with θ and takes as input a latent variable vector z that we sample from a noise prior p z. The goal of the training is to learn a parameterization θ of the network such that its implicit sample distribution mimics the data distribution, i.e., G(z; θ) p d. In order to achieve this goal, GANs use a discriminator network D : X [0, 1] parameterized by ϕ that distinguishes real samples from generated samples by assigning each a probability (0: generated; 1: real). The discriminator is trained in parallel with the generator with real samples x X and generated samples ˆx G(z; θ). Through the simultaneous updates, both networks advance in their task concurrently, inducing the need for their opponent to further improve. A high-level view of the network structure with MNIST images as data source is shown in figure 2.1. Objective The two networks are optimized over the same objective f (θ, ϕ). While the generator tries to minimize the expression, the discriminator tries to maximize it. Mathematically, this is described with the following saddle point problem: min max θ R n ϕ R m f (θ,ϕ) (2.4) = min max E x p θ ϕ d log D(x;ϕ) + E z pz log(1 D(G(z; θ);ϕ)) 7

14 2. Examples of Saddle Point Problems The original GAN training, proposed by Goodfellow et al. [10], uses an alternating stochastic gradient descent/ ascent approach as shown in Algorithm 1. Algorithm 1 GAN Training 1: for number of training iterations do 2: Sample {z (1),..., z (m) } from noise prior p z. 3: Sample {x (1),..., x (m) } from data distribution p d. 4: Update discriminator by ascending its stochastic gradient: m 1 m ϕ [log D(x (i) ;ϕ) + log(1 D(G(z (i) ; θ);ϕ))] i=1 5: Update generator by descending its stochastic gradient: 6: end for m 1 m θ log(1 D(G(z (i) ; θ);ϕ)) i=1 8 Mathematical evidence on why GANs work In this paragraph, we want to formalize our intuitive understanding of the GAN training. The following argument relies on the assumption that for any fixed generator we have access to an optimal discriminator, which is given by: DG p (x; θ) = d (x) p d (x) + p g (x; θ), (2.5) where p g is the modeled distribution that is implicitly defined through the generator network G [10]. Using the expression for the optimal discriminator, the training objective in Eq. (2.4) can be re-written as: min max θ ϕ = min θ f (θ,ϕ) = min c(θ) (2.6) θ ( ) ( ) p E x pd log d (x) p g (x; θ) + E x pg log (2.7) p d (x) + p g (x; θ) p d (x) + p g (x; θ) ( p d p ) ( d + p g + KL p g p ) d + p g (2.8) 2 2 = min log(4) + KL θ = min θ log(4) + JSD(p d p g ) (2.9) Note that the Jensen-Shannon Divergence (JSD) has a single minimum at p d = p g, which is the global minimum of c(θ). Hence, for an optimal discriminator the minimization of the objective results in a perfectly replicated data generating distribution.

15 2.2. Generative Adversarial Network Problems of GANs Generative Adversarial Networks have massively improved state-of-the-art performance in learning complicated probability distributions. Like no other model, they are able to generate impressive results, even on very high-dimensional image data [24, 5, 26]. Despite all their success, GANs are also known to be notoriously hard to train [18]. Many theoretical and practical adjustments to the training procedure have been proposed in order to stabilize it [25, 1, 18]. While most of the modifications are out of the scope of this thesis, we will cover the earliest and probably most popular fix to GAN training: the use of individual loss functions. From the Saddle Point Problem to the Zero-Sum Game Already in the initial GAN paper [10], Goodfellow pointed out that the saddle point objective of Eq. (2.4) does not work well in practice. He argues that early in learning the discriminator network is able to reject generated samples really well, leading to a saturation of the generator s objective log(1 D(G(z; θ); ϕ)). To remedy this issue, he proposes to minimize the generator s parameters over the non-saturating loss log D(G(z; θ); ϕ). This results in the following optimization objective: min θ f 1 (θ,ϕ), max ϕ f 2(θ,ϕ) (2.10) f 1 (θ,ϕ) = E z pz log D(G(z; θ);ϕ) (2.11) f 2 (θ,ϕ) = E x pd log D(x;ϕ) + E z pz log(1 D(G(z; θ);ϕ)). (2.12) Note that this change preserves the fixed points of the training dynamics, but adjusts the generator s objective to be convex in D(G(z; θ); ϕ), instead of concave. Even though this modification leads to serious improvement in practice, it can theoretically no longer be described as a saddle point problem. Instead, the new objective gives rise to a Zero-Sum game [15]. However, since the fixed points of the dynamics are preserved, we can further generalize our optimization goal in Eq. (1.2) to handle the new objective. In particular, we do this generalization by defining a local Nash-equilibrium, which is a well known notion in game theory [15], and the GAN community [19, 18, 17]. Definition 2.1 (Local Nash-Equilibrium) The point (θ,ϕ ) is a local Nashequilibrium to the adapted saddle point problem of the form { min θ f 1 (θ,ϕ) (2.13) max ϕ f 2 (θ,ϕ) 9

16 2. Examples of Saddle Point Problems if for (θ,ϕ) K γ holds. f 1 (θ,ϕ ) f 1 (θ,ϕ ) (2.14) f 2 (θ,ϕ ) f 2 (θ,ϕ) (2.15) Note that for f 1 = f 2 the definition of the local Nash-equilibrium reduces to definition 1.1 of the locally optimal saddle point Robust Optimization for Empirical Risk Minimization The idea of robust optimization [3] is to adapt the optimization procedure such that it becomes more robust against uncertainty which is usually represented by variability in the data. For a loss function l : Θ X R, we consider the typical problem of finding a parameter vector θ Θ to minimize the risk R(X; θ) = E P [l(x; θ)], (2.16) where P denotes the distribution of the data X X. Since the true distribution is unknown, the problem is usually reduced to empirical risk minimization. Instead of taking the expected value over the true distribution, the empirical distribution ˆP n = { } 1 n n i=1 over the training data {X 1,..., X n } X is used. R erm (X; θ) = E ˆP n [l(x; θ)] = 1 n n l(x i ; θ) (2.17) i=1 Robust optimization aims to minimize this quantity while simultaneously being robust against variation in the data, i.e., discrepancy between ˆP n and P. The approach it hereby takes is to consider the worst possible distribution ˆP n, within the limit of some distance constraint, at training time. It, therefore, becomes more resilient against problems caused by a training set that does not represent the true distribution appropriately. Putting this idea into an objective yields min θ sup P P [ { R r (X; θ, P) = E P [l(x; θ)] : D(P ˆP n ) ρ }] n (2.18) where D(P ˆP n ) is a divergence measure between the distribution P and the empirical distribution of the training data ˆP n. Hence, we arrive at an objective that is minimized over the model parameter θ and maximized with respect to the parameters of the distribution P. This is the standard form of

17 2.3. Robust Optimization for Empirical Risk Minimization a saddle point problem as defined in 1.1. Contrary to the GAN objective, it has the additional divergence constraint that needs to be fulfilled. Particularly interesting is the connection between robust optimization and the bias-variance tradeoff in statistical learning. Recent work by Namkoong et al. [21] showed that robust optimization is a good surrogate for the variance-regularized empirical risk 1 R erm (X; θ) + c n Var ˆP n (l(x; θ)). (2.19) Being a good approximation to this particular quantity is very desirable because it has been shown by [2] that, under appropriate assumptions, the true error is upper bounded with high probability by R(X; θ) R erm (X; θ) + c 1 1 n Var(l(X; θ)) + c 2 n. (2.20) Therefore, robust optimization is a good surrogate for an upper bound on the true error and holds great promises for reliable risk minimization. 11

19 Chapter 3 Preliminaries 3.1 Notation and Definitions Throughout the thesis, vectors are denoted by lower case bold symbols, scalars by roman symbols and matrices by bold upper case letters. For a vector v, we use v to denote the l 2 -norm, whereas for a matrix M, we use M to denote the spectral norm. We regularly use the short notation z := (θ,ϕ) to refer to the parameter vector and H := 2 f (θ,ϕ) for the Hessian. We use A B for two symmetric matrices A and B to express that A B is positive semi-definite Stability Major parts of our theoretical analysis of the saddle point problem rely on the notion of stability. It is a well-known concept and tool from non-linear systems theory. In the following, we present a definition of the most important types of stability of dynamic systems [12]. Consider a system with parameters x R n, whose gradient with respect to the time is given by the function g(x), i.e., ẋ = g(x) (3.1) Assume, without loss of generality, that the origin is a critical point (equilibrium) of the system. Hence, it holds that g(0) = 0. Definition 3.1 Stability (Definition 4.1 from [12]): The equilibrium point x = 0 of 3.1 is stable, if for each ε > 0, there is δ = δ(ε) > 0 such that x 0 < δ x t < ε, t 0 (3.2) 13

20 3. Preliminaries unstable if it is not stable. asymptotically stable if it is stable and δ can be chosen such that x 0 < δ lim t x t = 0 (3.3) exponentially stable if it is asymptotically stable and δ, k, λ > 0 can be chosen such that x 0 < δ x t k x 0 e λt (3.4) The system is called stable if we can be sure that it stays within a ball of radius ε when it is initialized within a ball of radius δ(ε). This notion alone does not guarantee convergence to the equilibrium point. It might as well orbit around the point as long as it stays within the ε-ball. The stronger concept of asymptotic stability, on the other hand, doesn t allow infinite orbiting around the equilibrium, but requires convergence in the limit of t Critical Points Definition 3.2 A critical point ( θ, ϕ) of the function f (θ,ϕ) is a point where the gradient of the function is zero, i.e. f ( θ, ϕ) = 0. (3.5) Critical points can be further analyzed by the curvature of the function, i.e., the Hessian of f. The structure of the function f, with its two different input vectors, gives rise to the following block structure of the Hessian. [ ] 2 H = θ f (θ,ϕ) θϕ f (θ,ϕ) ϕθ f (θ,ϕ) ϕ 2 f (θ,ϕ) (3.6) Note that the Hessian is a symmetric matrix and, thus, all its eigenvalues are real numbers. There are four important types of critical points that can be categorized through the eigenvalues of the Hessian and its individual blocks: If all eigenvalues of H are strictly positive, then the critical point is a local minimum. If all eigenvalues of H are strictly negative, then the critical point is a local maximum. If H has both positive and negative eigenvalues, then the critical point is a saddle point.

21 3.1. Notation and Definitions If all eigenvalues of 2 θ f (θ,ϕ) are strictly positive and all eigenvalues of ϕ 2 f (θ,ϕ) are strictly negative, then the critical point is a locally optimal saddle point (local Nash-equilibrium). This claim is formalized in lemma 3.3. Note that the set of locally optimal saddle points is a subset of the general saddle points defined before. Note that for a function f (θ,ϕ) that is convex in θ and concave in ϕ, all its saddle points are locally optimal and according to the Minimax Theorem (Sion-Kakutani) it holds that. min max θ ϕ f (θ,ϕ) = max ϕ min θ f (θ,ϕ). (3.7) Since we are considering a more general saddle point problem, where f (θ,ϕ) is neither convex in θ nor concave in ϕ, these guarantees don t hold and a more complex analysis of the saddle points is necessary. Characterization of locally optimal saddle points With the use of the nondegenerate assumption on the Hessian matrices (Assumption 3.5), we are able to establish sufficient conditions on (θ,ϕ ) to be a locally optimal saddle point. Lemma 3.3 Suppose that f satisfies Assumption 3.5; then, z := (θ,ϕ ) is a locally optimal saddle point on K γ if and only if the gradient with respect to the parameters is zero, i.e. f (θ,ϕ ) = 0, (3.8) and the second derivative at (θ,ϕ ) is positive definite in θ and negative definite in ϕ, i.e. there exist µ θ, µ ϕ > 0 such that 2 θ f (θ,ϕ ) µ θ I, 2 ϕ f (θ,ϕ ) µ ϕ I. (3.9) Proof From definition 1.1 follows that a locally optimal saddle point (θ,ϕ ) K γ is a point for which the following two conditions hold: f (θ,ϕ) f (θ,ϕ) and f (θ,ϕ ) f (θ,ϕ) (x, y) K γ (3.10) Hence, θ is a local minimizer of f and ϕ is a local maximizer. We therefore, without loss of generality, prove the statement of the lemma only for the minimizer θ, namely that 1. θ f (θ,ϕ) = 0 ϕ s.t. ϕ ϕ γ 2. 2 θ f (θ,ϕ) µ θ I ϕ s.t. ϕ ϕ γ, µ θ > 0. The proof for the maximizer ϕ directly follows from this. 15

22 3. Preliminaries 1. If we assume that θ f (θ,ϕ) = 0, then there exists a feasible direction d R n such that θ f (θ,ϕ)d < 0 and we can find a step size α > 0 for θ(α) = θ + αd s.t. α d γ. Now with the use of Taylor s theorem, we arrive at the following inequality f (θ(α),ϕ) = f (θ,ϕ) + α θ f (θ,ϕ)d + }{{} O(α 2 ) }{{} <0 0 for sufficiently small α (3.11) < f (θ,ϕ) (3.12) which contradicts that f (θ,ϕ) is a local minimizer. Hence, θ f (θ,ϕ) = 0 is a necessary condition for a local minimizer. 2. To prove the second statement, we again make use of Taylor s theorem with α > 0, d R n for θ(α) = θ + αd s.t. α d γ. f (θ(α),ϕ) = f (θ,ϕ) α2 d 2 θ f (θ,ϕ)d + O(α 3 ) }{{} 0 for sufficiently small α (3.13) The inequality f (θ(α),ϕ) f (θ,ϕ) holds if and only if d 2 θ f (θ,ϕ)d 0, which means that 2 θ f (θ,ϕ) needs to be positive semi-definite. Because we assume that the second derivative is non-degenerate (Assumption 3.5), we arrive at the sufficient condition that 2 θ f (θ,ϕ) µ θ I, with µ θ > 0 (3.14) which concludes the proof. 3.2 Assumptions For the upcoming theoretical analysis of different optimization procedures on the saddle point problem, we need to make several assumptions on the function f (θ,ϕ). Smoothness We assume that the function f (θ,ϕ) is smooth, which is formalized in the following assumption. Assumption 3.4 (Smoothness) We assume that f (z) := f (θ,ϕ) is a C 2 function, and that its gradient and Hessian are Lipschitz with respect to the 16

23 3.2. Assumptions parameters θ and ϕ, i.e., we assume that the following inequalities hold: f (z) f ( z) L z z z (3.15) 2 f (z) 2 f ( z) ρ z z z (3.16) f (θ,ϕ) f ( θ,ϕ) L θ θ θ (3.17) 2 θ f (θ,ϕ) 2 θ f ( θ,ϕ) ρ θ θ θ (3.18) f (θ,ϕ) f (θ, ϕ) L ϕ ϕ ϕ (3.19) ϕ 2 f (θ,ϕ) 2 θ f (θ, ϕ) ρ ϕ ϕ ϕ (3.20) θ f (z) l θ, ϕ f (z) l ϕ, z f (z) l z (3.21) Non-Degenerate For our theoretical guarantees, we require a non-degenerate assumption on the Hessian matrix of f. Assumption 3.5 (Non-degenerate Hessian) We assume that the matrices 2 θ f (θ,ϕ) and 2 ϕ f (θ,ϕ) are non-degenerate for all (θ R n,ϕ R m ). 17

25 Chapter 4 Saddle Point Optimization 4.1 Gradient-Based Optimization The saddle point problem of Eq. (1.1) is usually solved by a gradient-based optimization method. In particular, the problem is split into two individual optimization problems that can be solved simultaneously with gradient steps in different directions, i.e., [ θt+1 ϕ t+1 ] = [ θt ϕ t ] + η [ ] θ f (θ t,ϕ t ). (4.1) ϕ f (θ t,ϕ t ) This method is used in various applications of saddle point optimization, as explained in chapter 2. Stationary points of the above dynamics are critical points of the function f (cf ). More precisely, the point z := (θ,ϕ ) is a stationary point of the above iterates if the mapping function, implicitly defined through the recursion, maps z to itself, i.e., if z f (z ) = 0. With F G R n R m we denote the set of all stationary points of the gradient iterates, i.e., F G := {z R n R m z f (z ) = 0}. (4.2) 4.2 Linear transformed Gradient Steps Linear transformation of the gradient updates has been shown to accelerate optimization for various types of problems, including the saddle point problem. Such methods can be expressed by the following recursion [ θt+1 ϕ t+1 ] = [ θt ϕ t ] + ηa θ,ϕ [ θ f (θ t,ϕ t ) ϕ f (θ t,ϕ t ) ] (4.3) where A θ,ϕ is a symmetric ((n + m) (n + m))-matrix. Different optimization methods use a different linear transformation A θ,ϕ. Most common op- 19

26 4. Saddle Point Optimization timization schemes use a block diagonal matrix A θ,ϕ = illustrates the choice of A θ,ϕ for different optimizer. [ ] A 0. Table B Table 4.1: Update matrices for commonly used optimization schemes. Gradient Descent Adagrad [9] Newton Saddle-Free Newton [8] Formula A t = I B t = I ( A t,ii = t τ=1 ( θi f (θ τ,ϕ τ )) 2 + ε ( B t,ii = t ( τ=1 ϕi f (θ τ,ϕ τ ) )2 + ε ) 1 ) 1 positive definite? Yes. Yes. A t = ( 2 θ f (θ t,ϕ t ) ) 1 Around local min of f (θ,ϕ). ( 1 B t = ϕ 2 f (θ t,ϕ t )) Around local max of f (θ,ϕ). A t = 2 θ f (θ t,ϕ t ) 1 B t = ϕ 2 f (θ t,ϕ t ) 1 Yes. 4.3 Asymptotic Behavior of Gradient Iterations There are three different asymptotic scenarios for the gradient iterations in Eq. (4.1): 1. Divergence, i.e., lim t f (θ t,ϕ t ). 2. Being trapped in a loop, i.e., lim t f (θ t,ϕ t ) > Convergence to a stationary point of the gradient updates, i.e., lim t f (θ t,ϕ t ) = 0. Up to the best of our knowledge, there is no global convergence guarantee for general saddle point optimization. Typical convergence guarantees require convex-concavity or somehow quasiconvex-quasiconcavity of f [6]. Instead of global convergence, we consider the rather modest analysis of convergent iterations for the upcoming chapters of the thesis. Hence, we focus on the third outlined case where we are sure that the method converges to some stationary point (θ,ϕ) for which holds f (θ,ϕ) = 0. The question of interest for the upcoming chapters is whether such a sequence always yields a locally optimal saddle point as defined in 1.1? 20

27 4.4. Convergence Analysis on a Convex-Concave Objective 4.4 Convergence Analysis on a Convex-Concave Objective In this section, we want to analyze the differences between common optimization methods for the saddle point problem in Eq. (1.1). As described before, it is very difficult to make any global argument for a non convexconcave function f. In order to still get an intuition about the different optimization schemes on the saddle point problem, we consider (in this section) a function f (θ,ϕ) that is convex in θ and concave in ϕ, i.e., we assume that 2 θ f (θ,ϕ) αi and 2 ϕ f (θ,ϕ) αi α > 0 (4.4) for all (θ,ϕ) R n R m. Moreover, we assume that the function f has a unique critical point (θ,ϕ ), which is the solution to the saddle point problem (1.1). In this simplified setting, we want to obtain insights to the convergence behavior of commonly used optimization methods. In particular, we are interested in the convergence rate to the optimal saddle point (θ,ϕ ) Gradient-Based Optimizer The following Lemma shows that the convergence rate of gradient descent is independent of cross-dependencies between the parameters θ and ϕ in the strictly convex-concave setting. By choosing a proper step size γ we are guaranteed to converge to the saddle point (θ,ϕ ). Lemma 4.1 Suppose that 2 θ f (θ,ϕ) αi and 2 ϕ f (θ,ϕ) αi with α > 0, and assumptions 3.4 and 3.5 hold. Let (θ,ϕ ) be the unique solution of the saddle point problem, then t gradient steps obtain [ θ (t) θ ] ϕ (t) ϕ 2 (1 + γ(l z γ 2α)) t [ θ (0) θ ϕ (0) ϕ ] 2 (4.5) Proof See appendix Gradient-Based Optimizer with Preconditioning Matrix In non-convex minimization it is common practice to use a gradient-based optimizer with a preconditioning matrix, such as for example Adagrad [9]. Usually this matrix is diagonal and positive definite. These methods have been adapted to be used in the saddle point problem, which yields the simultaneous update step of Eq. (4.3). In this section, we try to understand the convergence behavior of this more general update step on the convexconcave objective. The following lemma establishes an expression for the distance to the unique saddle point after one iteration of the update in Eq. (4.3). 21

28 4. Saddle Point Optimization An interesting observation is that, as opposed to gradient descent before, the convergence depends on the cross-dependencies between the parameters θ and ϕ. Lemma 4.2 Suppose that 2 θ f ηi and 2 ϕ f γi with γ > 0, and assumptions 3.4 and 3.5 hold. The step size matrices A and B are diagonal, positive semi-definite matrices with with a constant α min I A α max I (4.6) β min I B β max I (4.7) R := min(α min, β min ) max(α max, β max ). (4.8) Let (θ,ϕ ) be the unique solution of the saddle point problem, then one modified gradient step of Eq. (4.3) with step size η = γr L z max(α max, β max ) (4.9) leads to [ θ + θ ] ϕ + ϕ 2 = (1 γ2 R 2 ) L z 2γ θ ϕ i=1 j=1 Proof See appendix 7.5. [ θ θ ϕ ϕ ] 2 (A ii B jj )(θ i θ i )(ϕ j ϕ j ) θ i ϕ j f ( θ, ϕ) (4.10) The Lemma provides an upper bound for the convergence of a general optimization scheme that uses diagonal step-matrices, as for example Adagrad. As we can see, the right-hand-side of inequality 4.10 consists of a summation of two terms. First, one [ that, depending ] on the problem and the chosen θ θ step size, is smaller than ϕ ϕ 2. It therefore drives the new parameters closer to the optimum. The value of the second summand is rather unclear. In general we can not make any statement about the sign of the products (A ii B jj )(θ i θ i )(ϕ i ϕ i ) θ i ϕ j f ( θ, ϕ). Hence, from this form it is not possible to tell whether this cross-dependency term increases or decreases the upper bound. What we can see, however, is that the product of any ij-combination vanishes, if either θ i or ϕ j gets close to its optimum value. This ensures that for θ, ϕ close enough around the optimum the algorithm converges to θ, ϕ, respectively. 22

29 4.4. Convergence Analysis on a Convex-Concave Objective Experiment Let s consider a simple convex-concave saddle point problem of the form min max θ R n ϕ R n( f (θ,ϕ) = α 2 θ 2 β 2 ϕ 2 + θ Mϕ) (4.11) where M R n n is a diagonal matrix. The Jacobian and Hessian of the objective are given, respectively, by [ ] [ ] αθ + Mϕ f (θ,ϕ) = βϕ + M 2 αi M f (θ,ϕ) = θ M (4.12) βi Experimentally, we want to compare the convergence to the saddle point of Eq. (4.11) for gradient descent of Eq. (4.1) (GD) and Adagrad (special form of Eq. (4.3)). Lemma 4.1 indicates that with a sufficiently small step size GD always converges to the saddle point. Since Adagrad, however, is part of the class of algorithms that use a more general update step matrix, we need to use the convergence upper bound given by Lemma which, in general, does not provide the same guarantees. The problematic part about this bound is the additive cross-dependency term given by γ n i=1 (B ii A ii )(θ i θ i )(ϕ i ϕ i )M ii. (4.13) In general, it is not possible to make a statement about the sign of this term. To experimentally show that the cross-dependency term can either improve or worsen the convergence rate of Adagrad, we design specific problems based on the knowledge about the factors in expression Since it is an additive term, we are interested in its sign to determine whether it increases or decreases the bound. It consists of four different factors, with each of it being either positive or negative. Therefore, we need to make some assumptions in order to investigate the sign change behavior of the whole term. In particular, we would like to fix the sign of (θ i θ i )(ϕ i ϕ i )M ii for all i 1,..., n such that by varying the diagonal values of the step matrices A and B we can achieve a change in the sign of the term. While this is trivial to achieve for the cross-dependency matrix M (we can choose its values arbitrarily), it is not obvious how to fix the sign of the other two factors. However, we can approximately make them positive by initializing all parameters, such that θ (0) i θ i and ϕ (0) i ϕ i i 1,..., n. (4.14) Hence, by choosing all diagonal values of M to be greater than zero, we can see that the cross-term dependency improves the algorithm when A ii > B ii 23

4. Saddle Point Optimization Figure 4.1: L2 Distance to the saddle-point over iterations of the optimization procedure of the problem formulation in equation 4.11 with θ,ϕ R 30.

The preconditioning matrices A and B for Adagrad depend on the history of the optimization process (cf. table 4.1).

30 4. Saddle Point Optimization Figure 4.1: L2 Distance to the saddle-point over iterations of the optimization procedure of the problem formulation in equation 4.11 with θ,ϕ R 30. The different plots show varying values for α = a and β = b while the diagonal cross-dependency matrix M has fixed values coming from N (5, 1). for all i 1,..., n. The preconditioning matrices A and B for Adagrad depend on the history of the optimization process (cf. table 4.1). Their diagonal values are approximately inversely proportional to the sum of squared past gradients. Therefore, extreme parameter updates get dampened while parameters that get few or small updates receive higher learning rates. By using the knowledge about the Jacobian of the objective from equation 4.12, we can influence the values of the step matrices A and B by varying the parameters α and β. Intuitively, a large value of α together with a small β should lead to a positive sign of the cross-dependency term and, therefore, worsen Adagrad compared to GD. The results of the experiment, shown in figure 4.1, are in accordance with this intuition. The parameters of this experiment are chosen based on the argumentation above, i.e., large positive initial values for θ and ϕ and positive values for the matrix M. The variation of the α and β parameterization shows the anticipated behavior. If α β (plot top right), the crossdependency term hinders Adagrad to find the optimal parameters, and thus it is outperformed by GD. Conversely, if β α (plot bottom left), the additive term becomes negative and improves the performance of Adagrad. 24

31 4.4. Convergence Analysis on a Convex-Concave Objective The designed experiment provides evidence to the argument arising from Lemma 4.2, namely that Adagrad can either benefit or be hurt by the crossdependency term, depending on the specific parameterization of the convexconcave saddle point problem. 25

33 Chapter 5 Stability Analysis 5.1 Local Stability In general, a stationary point of the gradient iterates in Eq. (4.1) can be either stable or unstable, where the notion of stability is defined as in Def Stability characterizes the behavior of the optimization dynamics in a local region around a stationary point. Within an epsilon-ball around a stable stationary point, successive iterates of the method are not able to escape the region. Conversely, for unstable points, successive iterates can escape this region. Let s consider the gradient iterations of Eq. (4.1) as a dynamical system with the following mapping function [ ] [ ] θt θ f (θ G(θ t,ϕ t ) = + η t,ϕ t ) (5.1) ϕ f (θ t,ϕ t ) ϕ t and assume that it has a stationary point at z := (θ,ϕ ). Then, we can linearize the system by a first-order Taylor approximation around this point, i.e., G(z) G(z ) + G(z )(z z ) (5.2) = z + G(z )(z z ). (5.3) Obviously, this approximation is only valid for points z in a small neighborhood around the critical point z. To quantify the previously stated intuition about stability, we can say that a fixed-point z is unstable if successive iterates of (4.1) increase the distance to the point z. Conversely, a fixed point z is stable if z t z z t 1 z 1 (5.4) 27

34 5. Stability Analysis 28 which, with the help of the linearization (5.2), leads to the following expression: z t z z t 1 z = G(z t 1 ) z z t 1 z (5.5) G(z )(z t 1 z ) z t 1 z = G(z ) 1. (5.6) Hence, we can analyze the stability of a stationary point of the gradient iterates by the absolute value of its Jacobian G(z ). In particular, a stationary point z is locally stable if all the eigenvalues of the corresponding Jacobian G(z ) lie inside the unit sphere. 5.2 Convergence to Non-Stable Stationary Points With the notion of stability, we are able to analyze the asymptotic behavior of the gradient method. It has been shown by [14] that for a non-convex minimization problem of the form min f (θ) (5.7) θ Rn gradient descent with random initialization almost surely converges to a stable stationary point of its dynamics. The next lemma extends this result to gradient iterations of Eq. (4.1) for saddle point problems. Lemma 5.1 (Random Initialization) Suppose that assumptions 3.4 and 3.5 hold. Consider gradient iterations of Eq. (4.1) starting from a random initial point. If the iterations converge to a stationary point, then the stationary point is almost surely stable. Proof The following Lemma 5.2 proves that the gradient update from Eq. (4.1) for the saddle point problem is a diffeomorphism. The remaining part of the proof follows directly from theorem 4.1 from [14]. Lemma 5.2 The gradient mapping for the saddle point problem G(θ,ϕ) = (θ,ϕ) + η( θ f (θ,ϕ), ϕ f (θ,ϕ)) (5.8) ( ) with step size η < min 1 L, 1 x L, 1 y is a diffeomorphism. 2Lz Proof The following proof is very much based on the proof of proposition 4.5 from [14]. A necessary condition for a diffeomorphism is bijectivity. Hence, we need to check that G is 1. injective, and 2. surjective for η < min ( 1 L x, 1 L y, 1 2Lz ).

35 5.2. Convergence to Non-Stable Stationary Points 1. Consider two points z := (θ,ϕ), z := ( θ, ϕ) K γ for which G(z) = G( z) (5.9) holds. Then, we have that [ ] [ ] θ f (z) θ f ( z) z z = η η (5.10) ϕ f (z) ϕ f ( z) [ ] θ f (z) + = η θ f ( z). (5.11) ϕ f (z) ϕ f ( z) Note that θ f (z) θ f ( z) f (z) f ( z) L z z z (5.12) ϕ f (z) ϕ f ( z) f (z) f ( z) L z z z, (5.13) from which follows that For 0 < η < 1 2Lz z z η L 2 z z z 2 + L 2 z z z 2 (5.14) = 2ηL z z z. (5.15) this means z = z, and therefore g is injective. 2. We re going to show that G is surjective by constructing an explicit inverse function for both optimization problems individually. As suggested by [14], we make use of the proximal point algorithm on the function f for the parameters θ, ϕ, individually. For the parameter θ the proximal point mapping of f centered at θ is given by θ( θ) = arg min 2 θ θ 2 η f (θ,ϕ) }{{} h(θ) θ 1 (5.16) Moreover, note that h(θ) is strongly convex in θ if η < 1 L x : ( θ h(θ) θ h( θ)) (θ θ) (5.17) = (θ η θ f (θ,ϕ) θ + η θ f ( θ,ϕ)) (θ θ) (5.18) = θ θ 2 η( θ f (θ,ϕ) θ f ( θ,ϕ)) (θ θ) (5.19) (1 ηl x ) θ θ 2 (5.20) Hence, the function h(θ) has a unique minimizer, given by 0 =! θ h(θ) = θ θ η θ f (θ,ϕ) (5.21) θ = θ η θ f (θ,ϕ) (5.22) 29

36 5. Stability Analysis 30 which means that there is a unique mapping from θ to θ under the gradient mapping G if η < 1 L x. The same line of reasoning can be applied to the parameter ϕ with the negative proximal point mapping of f centered at ϕ, i.e. ϕ( ϕ) = arg max 1 ϕ 2 ϕ ϕ 2 η f (θ,ϕ) }{{} h(ϕ) (5.23) Similarly as before, we can observe that h(ϕ) is strictly concave for η < 1 L y and that the unique minimizer of h(ϕ) yields the ϕ update step of G. From ( this, we ) conclude that the mapping G is surjective for (θ,ϕ) if η < min 1, 1 Lx Ly Observing that under assumption 3.5, G 1 is continuously differentiable concludes the proof that G is a diffeomorphism. Conclusion The result of lemma 5.1 implies that a convergent sequence of the gradient iterates almost surely yields a stable stationary point of its dynamics. Remembering that our analysis is solely concerned with convergent sequences (cf. section 4.3), the result suggests that we only need to consider stable stationary points of the gradient dynamics. In a minimization problem, as in Eq. (5.7), this result implies that if the gradient dynamics of the form θ t+1 = θ t η θ f (θ t ) (5.24) converge to a solution θ, this solution is not just a stable stationary point but also a minimizer of the function f. This is due to the fact that the Jacobian of the update in Eq. (5.24) is given by: J(θ t ) = I η 2 θ f (θ t) (5.25) As mentioned earlier, a point is stable for a certain dynamic only if the eigenvalues of its Jacobian lie within the unit-sphere. From the above equation, we see that this is only possible - with a sufficiently small step size - if 2 θ f (θ t) is a Hurwitz matrix. Since any Hessian matrix is symmetric, the eigenvalues of 2 θ f (θ t) are real numbers. Therefore, the stability condition reduces to 2 θ f (θ t) being a positive definite matrix. As defined in 3.1.2, this implies that θ t is a minimum of the function f. Based on the intuition from the minimization problem, one can jump to a quick conclusion and claim that since we do simultaneous minimization/ maximization on two individual problems, the gradient iterates of the saddle point problem yield a locally optimal saddle point. We will show in the upcoming sections that this claim is not true, and outline the need of a more complex analysis of the dynamics of the gradient iterates for the saddle point problem.

37 5.3. Undesired Stable Points 5.3 Undesired Stable Points As explained in the previous section, a convergent sequence of gradient iterations on the minimization problem (Eq. 5.24) yields almost surely a solution that locally minimizes the function value. This is a desirable property of the optimizer, because (with random initialization) we are guaranteed to obtain a solution to the posed minimization problem. However, this section will show that the gradient dynamics for the saddle point problem do not share the same property, and therefore, we can not be certain that a convergent series yields a solution to the problem. It is already known that every locally optimal saddle point is a stable stationary point of the gradient dynamics in Eq. (4.1) [18, 20]. However, the gradient dynamic might introduce additional stable points that are not locally optimal saddle points. This fact is in a clear contrast to GD for minimization where the set of stable stationary points of GD is the same as the set of local minima. In the next example we use an example to illustrate our claim. Example Consider the function f (x, y) = 2x 2 + y 2 + 4xy y3 1 4 y4 (5.26) with x, y R 1. The saddle point problem defined with this function is given by min max f (x, y) (5.27) x R y R The critical points of the function, i.e., points for which f (x, y) = 0, are [ ] [ ] [ ] z 0 = z 0 1 = z 2 = 2 2 (5.28) 2 Evaluating the Hessian at the three critical points gives rise to the following three matrices: [ ] [ ] [ ] H(z 0 ) = H(z ) = H(z 2 ) = (5.29) 2 We see that only z 1 is a locally optimal saddle point, because 2 x f (z 1 ) = 4 > 0 and 2 y f (z 1 ) = 4 2 < 0 (cf. lemma 3.3), whereas the two other points are both a local minimum in the parameter y, rather than a maximum. Figure 5.1 (a) illustrates gradient steps converging to the undesired stationary point z 0. Due to the stability of this point, even a small perturbation of the gradient in each step can not avoid convergence to it (see Figure 5.1 (b)). 1 To guarantee smoothness, one can restrict the domain of f to a bounded set. 31

5. Stability Analysis (a) The background shows a contour plot of the norm of gradient f (x, y). (b) Optimization trajectory when adding gaussian noise from N (0, σ) to the gradient in every step.

38 5. Stability Analysis (a) The background shows a contour plot of the norm of gradient f (x, y). (b) Optimization trajectory when adding gaussian noise from N (0, σ) to the gradient in every step. Figure 5.1: Optimization trajectory of the gradient method (4.1) on the function in Eq. (5.26) with a step size of η = The method converges to the critical point (0,0), even though it s not a locally optimal saddle point and, therefore, not a solution to the problem defined in (5.27). 32

39 Chapter 6 Curvature Exploitation A recent approach for solving non-convex minimization problems, such as (5.7), involves curvature exploitation [7, 27]. The main idea behind this method is to follow if possible the negative curvature of the function. To get an intuition on why this might be a useful direction, consider first an ordinary gradient-based minimization. It ensures convergence to a firstorder stationary point, i.e., a point for which the gradient vanishes. However, to find a solution to the minimization problem (5.7), we require second-order stationarity, i.e., the Hessian of the function is positive definite (cf. section about critical points). By following the negative curvature, on the other hand, it is possible to escape from stationary points that are no local minima. Interestingly, we will show in this chapter how we can use the approach of extreme curvature exploitation not to escape from saddle points, but to find locally optimal saddle points. 6.1 Curvature Exploitation for the Saddle Point Problem The example in section 5.3 has shown that gradient iterations on the saddle point problem introduce undesired stable points. In this section, we propose a strategy to escape from these stationary points. Our approach is based on extreme curvature exploitation [7]. Extreme curvature direction Let (λ θ, v θ ) be the minimum eigenvalue of 2 θ f (z) with its associated eigenvector, and (λ ϕ, v ϕ ) be the maximum eigen- 33

40 6. Curvature Exploitation value of 2 ϕ f (z) with its associated eigenvector. Then, we define v ( ) z = v (+) z = { λθ 2ρ θ sgn(v θ θ f (z))v θ if λ θ < 0 0 otherwise { λϕ 2ρ ϕ sgn(v ϕ ϕ f (z))v ϕ if λ ϕ > 0 0 otherwise (6.1) (6.2) v z = (v ( ) z, v (+) z ) (6.3) where sgn : R { 1, 1} is the sign function. We call the vector v z extreme curvature direction at z. Algorithm Using the extreme curvature direction, we modify gradient steps to obtain our new update as [ θt+1 ϕ t+1 ] = [ θt ϕ t ] + v zt + η [ ] θ f (θ t,ϕ t ) ϕ f (θ t,ϕ t ) (6.4) The new update is constructed by adding the extreme curvature direction to the gradient method of Eq. (4.1). From now on we will refer to this modified update as the CESP method (curvature exploitation for the saddle point problem). 6.2 Theoretical Analysis Stability Extreme curvature exploitation has already been used for escaping from unstable stationary points (i.e., saddle points) of gradient descent for minimization problems. In saddle point problems, curvature exploitation is advantageous not only for escaping from unstable stationary points but also for escaping from undesired stable stationary points. In the next lemma, we prove that all stationary points of the CESP method are locally optimal saddle points. Lemma 6.1 The point z := (θ,ϕ) is a stationary point of the iterates in Eq. (6.4) if and only if z is a locally optimal saddle point. Proof The point z := (θ,ϕ) is a stationary point of the iterates if and only if v z + η( θ f (z ), ϕ f (z )) = 0. Let s consider w.l.o.g. only the stationary point condition with respect to θ, i.e. v z = η θ f (z ) (6.5) 34

41 6.2. Theoretical Analysis We prove that the above equation holds only if θ f (z ) = v z = 0. This can be proven by a simple contradiction: suppose that θ f (z ) = 0, then multiplying the both sides of the above equation by θ f (z ) yields λ θ /(2ρ θ ) sgn(v θ }{{} θ f (z ))v θ θ f (z ) = η θ f (z ) 2 (6.6) }{{} <0 0 Since the left-hand side is negative and the right-hand side is positive, the above equation leads to a contradiction. Therefore, f (z ) = 0 and v z = 0. This means that λ θ 0 and λ ϕ 0 and therefore, according to lemma 3.3, z is a locally optimal saddle point. From lemma 6.1, we know that every stationary point of the CESP dynamics is a locally optimal saddle point. The next lemma establishes a stability condition for the gradient iterates, namely that every locally optimal saddle point is a stable point. Lemma 6.2 Suppose that assumptions 3.4 and 3.5 hold. Let z := (θ,ϕ ) be a locally optimal saddle point, i.e., f (z) = 0, 2 θ f (z ) µ θ I, 2 ϕ f (z ) µ ϕ I, (µ θ, µ ϕ > 0) (6.7) Then iterates of Eq. (6.4) are stable in K γ as long as γ min{µ θ /( 2ρ θ ), µ ϕ /( 2ρ ϕ )} (6.8) Proof The proof is based on a simple idea: In a K γ neighbourhood of a locally optimal saddle point f can not have extreme curvature, i.e., v z = 0. Hence, within K γ the update of Eq. (6.4) reduces to the gradient update in Eq. (4.1), which is stable according to [20, 18]. To prove our claim that negative curvature does not exist in K γ, we make use of the smoothness assumption. Suppose that z := (θ,ϕ) K γ, then the smoothness assumption 3.4 implies Similarly, one can show that 2 θ f (z) = 2 θ f (z ) ( 2 θ f (z ) 2 θ f (z)) (6.9) 2 θ f (z ) ρ θ z z I (6.10) 2 θ f (z ) 2ρ θ γi (6.11) (µ θ 2ρ θ γ)i (6.12) 0 [γ < µ θ /( 2ρ θ )] (6.13) 2 ϕ f (z) 0 [γ < µ ϕ /( 2ρ ϕ )]. (6.14) Therefore, the extreme curvature direction is zero according to the definition in Eq. (6.1). 35

42 6. Curvature Exploitation Guarantees for a convergent series Combining the statements of the previous two lemmas, we conclude that the set of locally optimal saddle points and the set of stable stationary points of our new method are the same. As a consequence, we prove that a convergent series of our method is guaranteed to reach a locally optimal saddle point. Corollary 6.3 Suppose that assumptions 3.4 and 3.5 hold. Then, a convergent series of the iterates of Eq. 6.4 yields a locally optimal saddle point. Proof It is a direct consequence of lemma 6.1 and Guaranteed Improvement The gradient update step of Eq. (4.1) has the property that an update with respect to θ (ϕ) decreases (increases) the function value. The next lemma proves that the modified method of Eq. (6.4) shares the same desirable property. Lemma 6.4 In each iteration of Eq. (6.4) f decreases in θ with f (θ t+1,ϕ t ) f (θ t,ϕ t ) (η/2) θ f (z t ) 2 + λ 3 θ /(24ρ θ2), (6.15) and increases in ϕ with f (θ t,ϕ t+1 ) f (θ t,ϕ t ) + (η/2) ϕ f (z t ) 2 + λ 3 ϕ/(24ρ ϕ 2). (6.16) as long as the step size is chosen as 1 1 η min{, } (6.17) 2L θ l θ ρ θ 2L ϕ l ϕ ρ ϕ Proof The Lipschitzness of the Hessian (Assumption 3.4) implies that for R n f (θ +,ϕ) f (θ,ϕ) θ f (θ,ϕ) θ f (θ,ϕ) ρ x 6 3 (6.18) holds. The update in Eq. (6.4) for θ is given by = αv η θ f (θ,ϕ), where α = λ/(2ρ θ ) (where we assume w.l.o.g. that v θ f (θ,ϕ) < 0) and v is the eigenvector associated with the minimum eigenvalue λ of the Hessian matrix 2 θ f (θ t,ϕ t ). In the following, we use the shorter notation: f := θ f (θ,ϕ) and H := 2 θ f (θ,ϕ). We can construct a lower bound on the left 36

43 6.2. Theoretical Analysis hand side of Eq. (6.18) as f (θ +,ϕ) f (θ,ϕ) f 1 2 H (6.19) f (θ +,ϕ) f (θ,ϕ) f 1 2 H (6.20) f (θ +,ϕ) f (θ,ϕ) αv f + (η η2 2 L θ) f α2 λ + αηλv f which leads to the following inequality (6.21) f (θ +,ϕ) f (θ,ϕ) (6.22) αv f (η η2 2 L θ) f α2 λ αηλv f + ρ x 6 3 (6.23) 1 2 α2 λ (η η2 2 L θ) f 2 + ρ x 6 3 (6.24) By using the triangular inequality we obtain the following bound on the cubic term 3 (η f + α v ) 3 (6.25) 4η 3 f 3 + 4α 3 v 3 (6.26) = 4η 3 f 3 + 4α 3. (6.27) Replacing this bound into the upper bound of Eq. (6.22) yields f (θ +,ϕ) f (θ,ϕ) 1 2 α2 λ (η η2 2 L θ) f 2 + ρ x ( 4η 3 f 3 + 4α 3) 6 (6.28) The choice of α = λ/(2ρ θ ) leads to further simplification of the above bound: f (θ +,ϕ) f (θ,ϕ) λ3 24ρ 2 η(1 η 2 L θ 2 3 ρ θη 2 l x ) f 2 (6.29) θ Now, we choose step size η such that 1 η 2 L θ 2 3 ρ θη 2 l θ 1/2 (6.30) For η 1/(2Lρ θ l θ ), the above inequality holds. Therefore, the following decrease in the function value is guaranteed f (θ +,ϕ) f (θ,ϕ) λ 3 /(24ρ θ 2) (η/2) f 2 (6.31) Similarly, one can derive the lower-bound for the function increase in ϕ. 37

44 6. Curvature Exploitation 6.3 Efficient Implementation Hessian-Vector Product Storing and computing the Hessian in high dimensions is very costly; therefore, we need to find a way to efficiently extract the extreme curvature direction. The most prominent approach for obtaining the eigenvector, corresponding to the largest absolute eigenvalue, (and the eigenvalue itself) of 2 θ f (z) is power iterations on I β 2 θ f (z) as v t+1 = (I β 2 θ f (z))v t (6.32) where v t+1 is normalized after every iteration and β > 0 is chosen such that I β 2 θ f (z) 0. Since this method only requires implicit Hessian computation through a Hessian-vector product, it can be implemented about as efficiently as gradient evaluations [23]. The results of [13] imply that for the case of λ min ( 2 θ f (z)) γ, we need at most 1 γ log(k/δ2 )L x iterations of the power method to find a vector ˆv such that ˆv 2 θ f (z)ˆv γ 2 with probability 1 δ (cf. [14]). The pseudo code in algorithm 2 shows an efficient realization of the CESP method by using the Hessian-vector product Implicit Curvature Exploitation Recent results on the non-convex minimization problem have shown that perturbed gradient steps with an isotropic noise can approximately move along the extreme curvature direction [11]. Suppose that θ f (θ 0,ϕ) = 0, 2 θ f (θ 0,ϕ) 0. Perturbed gradient steps can be written as θ 1 = θ 0 + rξ (6.33) θ t+1 = θ t η θ f (θ t,ϕ). (6.34) where ξ is drawn uniformly from the unit sphere. There is a choice for r, the noise term ξ, and the step size η such that θ T exploits the negative curvature of 2 θ f (θ 0,ϕ). In other words, one can naturally exploit the negative curvature by perturbation of parameters and gradient steps on θ and ϕ. Nonetheless, this approach requires a very careful adjustment of hyperparameters and number of iterations. In our experiments, we only use Hessian-vector products to approximate the extreme curvature direction The source code for an implementation of this algorithm in tensorflow is available on GitHub.

45 6.4. Curvature Exploitation for linear-transformed Gradient Steps Algorithm 2 Efficient CESP implementation Require: Smooth f, (θ, ϕ), β, ρ θ, ρ ϕ 1: function MinimumEigenvalue(function: f, parameters: x) 2: v 0 random vector R x with unit length. 3: for i in 1,..., k do 4: v i (I β 2 x f )v i 1 5: v i v i / v i 6: end for 7: v v k sgn(v k x f ) 8: λ v 2 x f v 9: if λ 0 then 10: λ 0 11: v 0 12: end if 13: return v, λ 14: end function 15: for t in 1,..., T do 16: v θ, λ θ MinimumEigenvalue( f (θ t,ϕ t ),θ t ) 17: v ϕ, λ ϕ MinimumEigenvalue( f (θ t,ϕ t ),ϕ t ) 18: θ t θ t + λ θ 2ρ θ v θ η θ f (θ t,ϕ t ) 19: ϕ t ϕ t + λ ϕ 2ρ ϕ v ϕ + η ϕ f (θ t,ϕ t ) 20: end for 6.4 Curvature Exploitation for linear-transformed Gradient Steps As described in section 4.2, it is common practice to use linear-transformed gradient update steps as an optimization procedure on the saddle point problem. This gives rise to optimization methods like Adagrad that have been shown to accelerate learning. In this section, we propose a modified CESP method that adds the extreme curvature direction to the lineartransformed gradient update of Eq. (4.3), and prove that it keeps the same desirable properties concerning its stable points. We can adapt our CESP method to a linear-transformed variant as [ ] [ ] [ ] θt+1 θt θ f (θ = + v zt + ηa t,ϕ t ) θt,ϕ t ϕ f (θ t,ϕ t ) ϕ t+1 ϕ t (6.35) where we choose the linear transformation matrix A θt,ϕ t to be positive definite. Even for this adapted method we can show that it is able to filter out the undesired stable stationary points of the gradient method for the saddle point problem. Moreover, we can show that for a convergent series it holds the same guarantees as its non-transformed counterpart, namely that the set 39

46 6. Curvature Exploitation of its stable stationary points and the set of locally optimal saddle points are the same. Lemma 6.5 A point z is a stationary point of the linear-transformed CESP update in Eq. (6.35) if and only if z is a locally optimal saddle point. Proof As a direct consequence of lemma 6.1 and the positive definiteness property of the linear transformation matrix follows that all stationary points of the linear-transformed CESP update in Eq. (6.35) are locally optimal saddle points. Lemma 6.6 Every locally optimal saddle point is a stable stationary point of the linear-transformed CESP update in Eq. (6.35). Proof Let s consider some locally optimal saddle point z := (θ,ϕ ) in the sense of definition 1.1. From lemma 3.3 follows that 2 θ f (θ,ϕ ) µ θ I, 2 ϕ f (θ,ϕ ) µ ϕ I. (6.36) for µ θ, µ ϕ > 0. As a direct consequence, the extreme curvature direction is zero, i.e., v z = 0. Hence, the Jacobian of the update in Eq. (6.35) is given by I + ηa z [ 2 θ f (z ) θ,ϕ f (z ] ) ϕ,θ f (z ) ϕ 2 f (z. (6.37) ) The point z is a stable point of the dynamic if all eigenvalues of its Jacobian lie within the unit-sphere. This condition can be fulfilled, with a sufficiently small step size, if and only if all the real parts of the eigenvalues of J(z ) are negative. Hence, to prove stability of the update for a locally optimal saddle point z, we have to show that the following expression is a Hurwitz matrix: [ J(z 2 ) = A z θ f (z ) θ,ϕ f (z ] ) }{{} ϕ,θ f (z ) ϕ 2 f (z := J (6.38) ) :=A }{{} :=H 40 Since A is a positive definite matrix, we can construct its square root A 1 2 such that A = A 1 2 A 1 2. The matrix product AH can be re-written as AH = A 1 2 (A 1 2 HA 1 2 )A 1 2. (6.39) Since we are multiplying the matrix J := A 2 1 HA 1 2 from the left with the inverse of the matrix from which we are multiplying from the right side, we

47 6.5. Experiments can observe that J has the same eigenvalues as J. The symmetric part of J(z ) is given by 1 ( J + J ) = (A HA 2 + A 2 H A ) = A 2 (H + H )A 2 1 (6.40) [ = A θ f (z ] ) 0 0 ϕ 2 f (z A 1 2 (6.41) ) From assumption 6.36 follows that the block diagonal matrix (H + H ) is a symmetric, negative definite matrix for which, therefore, holds x ( J + J )x 0 for any x R n+m. The remaining part of the proof follows the argument from [4] Theorem 3.6. Let (λ, v) be an eigenpair of J. Then, the following two equalities hold: v Jv = λ (6.42) (v Jv) = v J v = λ (6.43) Therefore, we can re-write the real part of the eigenvalue λ as: By observing that Re(λ) = λ + λ 2 = 1 2 v ( J + J )v. (6.44) v ( J + J )v = Re(v) ( J + J )Re(v) + Im(v) ( J + J )Im(v) (6.45) is a real, negative quantity, we can be sure that the real part of any eigenvalue of J is negative. Therefore it directly follows that, with a sufficiently small step size η > 0, any locally optimal saddle point z is a stable statoinary point of the linear-transformed update method in Eq. (6.35). Corollary 6.7 The set of locally optimal saddle points as defined in 1.1 and the set of stable points of the linear-transformed CESP update method in Eq. (6.35) are the same. Proof It is a direct consequence of lemma 6.5 and Experiments Escaping from Undesired Stationary Points of the Toy-Example Previously, we saw that for the two-dimensional saddle point problem on the function of Eq. (5.26), gradient iterates might converge to an undesired stationary point that is not locally optimal. Figure 6.1 illustrates that CESP 41

6. Curvature Exploitation (a) (b) The background shows the vector field of the extreme curvature, as defined in Eq. (6.1). Note that for the function in Eq. (5.

and (b) show the two trajectories from the fixed starting point ( 3, 1), the plots in (c) and (d) show the basin of attraction of the locally optimal saddle point (green area) and the undesired

48 6. Curvature Exploitation (a) (b) The background shows the vector field of the extreme curvature, as defined in Eq. (6.1). Note that for the function in Eq. (5.26) the curvature in x-dimension is constant positive and therefore v z ( ) is always zero. (c) (d) Figure 6.1: Comparison of GD and the new CESP method on the function in Eq. (5.26). While plots (a) and (b) show the two trajectories from the fixed starting point ( 3, 1), the plots in (c) and (d) show the basin of attraction of the locally optimal saddle point (green area) and the undesired saddle point (red area). solves this issues. In this example simultaneous gradient iterates converge to the undesired stationary point z 0 = (0, 0) for many different initialization parameters, whereas our method always converges to the locally optimal saddle point. The source code of this experiment is available on github. 42

49 6.5. Experiments Generative Adversarial Networks This experiment evaluates the performance of the CESP method on a Generative Adversarial Network (GAN), which reduces to the saddle point problem min θ ( max f (θ,ϕ) = Ex pd log(d θ (x)) + E z pz log(1 D θ (G ϕ (z))) ) (6.46) ϕ where the functions D : R x [0, 1] and G : R z R x are neural networks, parameterized with the variables θ and ϕ, respectively. We use the MNIST data set and a simple GAN architecture, that is described in table 6.1. To accelerate training we use the linear transformed CESP method, where the transformation matrix corresponds to the Adagrad update. Table 6.1: Parameters of the GAN model. Discriminator Generator Input Dimension Hidden Layers 1 1 Hidden Units / Layer Activation Function Leaky ReLU Leaky ReLU Output Dimension Batch Size 1000 Learning Rate η 0.05 Learning Rate α := 1 2ρ θ = 1 2ρ ϕ 0.01 On this objective, we evaluate the influence of the negative curvature step of our new method. In particular, we compare the plain Adagrad optimizer (GD) with the newly proposed linear transformed CESP method regarding the spectrum of f at a convergent solution z. Note that we are interested in any solution that gives rise to a locally optimal saddle point, as defined in 1.1. In this regard, we track the smallest eigenvalue of 2 θ f (z ) and the largest eigenvalue of ϕ 2 f (z ) through the optimization. If the former is positive and the latter negative, the optimization procedure has achieved its goal and found a locally optimal saddle point. The results are shown in figure 6.2. The decrease in the squared norm of gradient indicates that both methods converge to a solution. Moreover, both fulfill the condition for a locally optimal saddle point for the parameter ϕ, i.e., the maximum eigenvalue of 2 ϕ f (z ) is negative. However, the graph of the minimum eigenvalue of 2 θ f (z ) shows that the CESP method converges 43

6. Curvature Exploitation (a) (b) (c) Figure 6.2: The first two plots show the minimum eigenvalue of 2 θ f (θ,ϕ) and the maximum eigenvalue of ϕ 2 f (θ,ϕ), respectively. The third plot shows f (z t ).

50 6. Curvature Exploitation (a) (b) (c) Figure 6.2: The first two plots show the minimum eigenvalue of 2 θ f (θ,ϕ) and the maximum eigenvalue of ϕ 2 f (θ,ϕ), respectively. The third plot shows f (z t ). The transparent graph shows original values, whereas the solid graph is smoothed with a Gaussian filter. faster, and with less frequent and severe spikes, to a solution where the minimum eigenvalue is zero. Hence, the negative curvature step seems to be able to drive the optimization procedure to regions that yield points closer to a locally optimal saddle point. Using two individual loss functions It is common practice in GAN training to not consider the saddle point problem as defined in Eq. (6.46), but rather split the training in two individual optimization problems over different functions (cf. section 2.2.1). In particular, one usually considers min( f 1 (θ,ϕ) = E z pz log D θ (G ϕ (z))) θ (6.47) max 2(θ,ϕ) = E x pd log D θ (x) + E z pz log(1 D θ (G ϕ (z)))) ϕ (6.48) 44

6.5. Experiments (b) (b) (c) Figure 6.3: The first two plots show the minimum eigenvalue of 2 θ f 1(θ,ϕ) and the maximum eigenvalue of ϕ 2 f 2 (θ,ϕ), respectively. The third plot shows f (z t ).

51 6.5. Experiments (b) (b) (c) Figure 6.3: The first two plots show the minimum eigenvalue of 2 θ f 1(θ,ϕ) and the maximum eigenvalue of ϕ 2 f 2 (θ,ϕ), respectively. The third plot shows f (z t ). The transparent graph shows the original values, whereas the solid graph is smoothed with a Gaussian filter. Our CESP optimization method is defined individually for the two parameter sets θ and ϕ and can therefore also be applied to a setting with two individual objectives. Figure 6.3 shows the results on the GAN problem trained with two individual losses. In this experiment, CESP decreases the negative curvature of 2 θ f 1, while the gradient method can not exploit the negative curvature properly (the negative curvature is oscillating in all iterations). More details on the parameters of the experiments are provided by table Robust Optimization Setup The approach of Robust Optimization [3], as defined in section 2.3, gives rise to a general saddle point problem. Even though robust optimization is often formulated as a convex-concave saddle point problem, we con- 45

52 6. Curvature Exploitation 46 sider it on a neural network that does not fulfill this assumption. The optimization problem that we target here is an application of robust optimization in empirical risk minimization [21], namely solving min θ sup P P [ f (X; θ, P) = { E P [l(x; θ)] : D(P ˆP n ) ρ n }] (6.49) where l(x; θ) denotes the cost function to minimize, X the data, and D(P ˆP n ) a divergence measure between the true data distribution P and the empirical data distribution ˆP n. We use this framework on the Wisconsin breast cancer data set [16], which is a binary classification task with 30 attributes and 569 samples. Our classification model is chosen to be a Multilayer Perceptron with the following prediction formula: ŷ(x) = σ (W 2 σ (W 1 x + b 1 ) + b 2 ) (6.50) The non-convex sigmoid activation function σ : R (0, 1) is applied elementwise. The weights and biases of the layers are the parameter of the model, i.e., { } θ = W 1 R 2 30, b 1 R 2, W 2 R 1 2, b 2 R. (6.51) The robust loss function is given by l(x; θ, P ) = n i=1 p i [y i log(ŷ(x i )) + (1 y i ) log(1 ŷ(x i ))] (6.52) where x i and y i correspond to the i-th datapoint and label, respectively. The values of P = {pi }n i=1 form a trainable, not-normalized distribution over which we maximize the training loss. To enforce that the distance between the normalized version of P and the empirical distribution P n = { } 1 n n i=1 is bounded, we add the following regularization term: r(p ) = n i=1 (p i 1 n )2. (6.53) By subtracting the regularization r(p ), scaled by λ > 0, from the robust loss l(x; θ, P ), we construct the optimization objective for this experiment as min max l(x; θ, P ) λr(p ). (6.54) θ P The subtraction comes from the fact that the regularization depends only on P and since we are maximizing over this parameter we need to subtract it from the loss.

4 shows the comparison of the gradient method (GD) and our CESP optimizer on this problem in terms of the minimum eigenvalue of 2 θ f (X; θ, P).

53 6.5. Experiments (a) (b) Figure 6.4: The left plot shows the minimum eigenvalue of 2 θ f (X; θ, P) and the right plot the squared sum of gradients. The solid line shows the mean value over 10 runs with different initialization, whereas the blurred area indicates the 90th percentile. Results Figure 6.4 shows the comparison of the gradient method (GD) and our CESP optimizer on this problem in terms of the minimum eigenvalue of 2 θ f (X; θ, P). Note that f is concave with respect to P and therefore its Hessian is constant negative. The results indicate the anticipated behavior that CESP is able to more reliably drive a convergent series towards a solution where the minimum eigenvalue of 2 θ f (X; θ, P) is positive. 47

55 Chapter 7 Second-order Optimization 7.1 Introduction In the previous chapter, we used extreme curvature exploitation to design an optimizer that provably avoids undesired stable points of gradient-based optimization. Hence, with the information about the most extreme curvature alone, we were already able to significantly improve the theoretical guarantees and practical performance of an optimizer on the saddle point problem. This naturally raises the question if an optimization procedure can be further advanced by considering additional curvature information. Taking this idea to the extreme means having access to the full Hessian, which leads us to the field of second-order optimization. In this chapter, we explore the properties of second-order methods for the saddle point problem, identify issues that arise in this context, and propose a modified Newton method for non convex-concave saddle point optimization. Benefits of Second-order Optimization Second-order optimization methods, like the Newton method, hold great potential for accelerating training in many learning tasks. Making use of the curvature information of the function, these methods can often find optima in fewer steps than their firstorder counterparts. Even though the additional computational complexity, these methods introduce, make them in-practical for most contemporary Deep Learning tasks, the rapid growth of available computational power may pave the way for them enabling them to become the standard for general optimization tasks. Currently, the family of quasi-newton methods is bridging the gap between first- and second-order methods. While technically still being an (efficient) first-order method, they approximate the curvature information and try to imitate the update of the second-order method. Hence, with the rise of quasi-newton methods and the growing computational power of modern systems, the analysis of second-order methods 49

56 7. Second-order Optimization becomes an essential task. For solving the saddle point problem in practice, it is the conventional approach to change different optimization schemes like GD, Adagrad or Adam in a modular way, assuming that they are all capable of optimizing the function as intended. However, as already seen in previous chapters, this assumption is not true in general and, therefore, optimizer need to be analyzed carefully on their suitability for the saddle point problem. This chapter shows that Newton s method suffers from two major problems, that can be circumvented by two individual modifications. The first problem is closely related to the attraction issue of Newton s method to saddle points in non-convex minimization [8]. While this can be solved with the Saddle Free Newton (SFN) approach from [8], the second problem is caused by the simultaneous optimization procedure itself and, thus, yields a need for a new problem formulation. Instead of splitting the problem into two individual minimization problems, we need to apply the Newton method on the full objective to make sure that we obtain the best estimate of the update directions. 7.2 Dynamics of Newton s Method We can apply Newton s method on the saddle point problem in a straightforward manner by using the simultaneous optimization approach. The particular update equations for this method are given in table 4.1. Stacking the two equations together yields the optimization step in matrix form as: [ ] [ ] θ 2 1 [ ] = η θ f 0 θ f ϕ 0 ϕ 2, (7.1) f ϕ f where we used the equivalence between the inverse of a block diagonal matrix with the inverse of all its elements. With the help of the following lemma, we show that the simultaneous Newton update step shares a desirable property with Gradient Descent, namely that all locally optimal saddle points are stable stationary points of its dynamics [18, 20]. Lemma 7.1 Consider a locally optimal saddle point z := (θ,ϕ). This point is a stable stationary point to the general update dynamics of Eq. (4.3) if A t and B t are positive definite matrices. Proof It is a direct consequence from lemma 6.6. As mentioned in table 4.1, ( 2 θ f ) 1 and ( 2 ϕ f ) 1 are positive definite matrices around any locally optimal saddle point. Therefore, we can apply the result of lemma 7.1 that every locally optimal saddle point is a stable stationary point to the update dynamics in Eq. (7.1). 50

57 7.2. Dynamics of Newton s Method Avoiding Undesired Saddle Points Minimizing non-convex functions with Newton s method is known to be problematic due to the prevalence of saddle points in high dimensional error surfaces [8]. In this section, we establish a connection between this issue in non-convex minimization and the problem of converging to non-optimal saddle points in saddle point optimization. We follow the argument of [8] to evaluate the optimization step direction of Newton s method around critical points. We analyze a critical point ( θ, ϕ) locally by using a second-order Taylor expansion, together with a re-parameterization coming from Moore s Lemma. Because we are optimizing two functions separately with respect to different sets of parameters, the expansions are also done individually: f ( θ + θ, ϕ) = f ( θ, ϕ) f ( θ, ϕ + ϕ) = f ( θ, ϕ) n i=1 λ θ i ( vθ i )2 (7.2) m λ ϕ i ( vϕ i )2 (7.3) i=1 where λi θ and λ ϕ i are the eigenvalues from 2 θ f ( θ, ϕ) and ϕ 2 f ( θ, ϕ), respectively. The values of vi θ and v ϕ i are the updates of the parameters, corresponding to motion along the eigenvectors of their respective Hessian, i.e., v θ i = (e θ i ) θ (7.4) v ϕ i = (e ϕ i ) ϕ (7.5) are the eigenvectors from 2 θ f ( θ, ϕ) and 2 ϕ f ( θ, ϕ), respec- where ei θ and e ϕ i tively. The way we defined the simultaneous Newton method (Table 4.1) makes the update of the parameters θ independent of the change in ϕ, and vice versa. Therefore, both individual updates re-scale the step along their eigenvector with the inverse of the corresponding eigenvalue. Using the Taylor expansions from above, this yields the following update steps along their respective eigenvectors: λθ i λi θ vi θ and λϕ i λ ϕ i v ϕ i (7.6) We can make two interesting observations from these updates, when comparing it to the corresponding Gradient Descent (GD) update. The GD steps along the eigendirections of the two Hessians are given by λ θ i vθ i and λ ϕ i vϕ i. (7.7) 51

58 7. Second-order Optimization The two differences between the methods are that (i) Newton s method rescales the step length with the absolute value of the eigenvalue and (ii) the direction of the step changes when λi θ < 0 (λ ϕ i > 0). While the rescaling of the step length avoids slowing down in directions with small absolute eigenvalue, which gives Newton s method an advantage over GD, the change in direction can lead to a problem. Let s consider (without loss of generality) the update of the minimization over one parameter θ i θ. If the ith eigenvalue λi θ of 2 θ f is negative, the updates of GD and Newton, along the eigenvector ei θ, point in different directions. GD: λ θ i vθ i Newton: v θ i (7.8) The authors of [8] argue that for the Newton method, we are moving in a direction of increasing error, towards the critical point. Hence, any critical point of the individual dynamics θ ( ϕ) becomes an attractor, even if it is no local minimum (maximum) of the objective, but rather a saddle point, which means that 2 θ f ( θ, ϕ) is not positive semi-definite ( ϕ 2 f ( θ, ϕ) is not negative semi-definite). Summarizing this finding for our problem yields that the simultaneous Newton method is attracted to all critical points, and not just locally optimal saddle points. The authors of [8] propose the Saddle Free Newton (SFN) method to circumvent the identified problem of the vanilla Newton method for non-convex optimization. The SFN method modifies the inverse Hessian H 1 in the update equation by H 1, where H is the matrix obtained by taking the absolute value of each of the eigenvalues of H. Through this change, we can make sure that the update of GD and Newton, along the eigendirection, in equation 7.8 always have the same sign. Applying this concept to our simultaneous Newton method for for the saddle point problem is straightforward and results in the following update equation: [ ] θ ϕ [ = η 2 θ f 0 0 ϕ 2 f ] 1 [ ] θ f, (7.9) ϕ f which yields the two parameter updates along the corresponding eigenvectors as λθ i λ θ i vθ i and λ ϕ i λ ϕ i vϕ i. (7.10) As we can see, this method combines the benefits of the re-scaling with the eigenvalue, but still keeps the same direction as GD. 52

59 7.2. Dynamics of Newton s Method Simultaneous vs. full Newton Updates The simultaneous Newton update equation 7.1 presents the major simplification of this method. Namely, that we approximate the Hessian of the objective with a block diagonal matrix, disregarding all curvature information of the cross-terms. In this section we are going to show that this gives rise to different update directions and visualize its consequences for a simple convex-concave objective. With the full Newton method we denote the parameter updates according to the following equation: [ ] θ = η ϕ [ 2 θ f θϕ f θϕ f 2 ϕ f ] 1 [ ] θ f ϕ f (7.11) We update both sets of parameters at once by using the Hessian of the full problem, which includes the curvature information about the cross terms, i.e., θϕ f and ϕθ f. Note that the update equation for the simultaneous method in Eq. (7.1) is a special instance of this where θϕ f = 0. Therefore, in general, the two methods yield different update equations, which means that they change the parameters along different directions. Obviously, with a shrinking relative magnitude of the cross term curvature θϕ f, the block diagonal approximation gets closer to the true Hessian. Intuitively, it seems clear that with a better approximation the update directions of the simultaneous and the full method become more similar. Hence, the similarity of the update directions of the two methods can be controlled by the relative magnitude of θϕ f. The following paragraph gives a quantitative analysis of this argument, using a simple convex-concave example. Convex-Concave Example To study the difference between the simultaneous Newton update in Eq. (7.1) and the full Newton update in Eq. (7.11) on an actual example, we consider the following simple convex-concave function f (θ,ϕ) = α 2 θ 2 + θ Mϕ β 2 ϕ 2 (7.12) with θ R n, ϕ R m, M R n m and α, β > 0. Since this function is convex in θ and concave in ϕ, the only critical point (θ,ϕ ) = (0, 0) is a locally optimal saddle point. The gradient and Hessian of the function are given by f (θ,ϕ) = [ ] αθ + Mϕ βϕ + M θ [ ] αi M H(θ,ϕ) =. (7.13) βi M 53

60 7. Second-order Optimization To compare the simultaneous method to the full method, we evaluate the parameter update step for both at a random point (θ,ϕ). First, observe that derivatives of f of a higher order than two are zero. Hence, the second order Taylor expansion is exact, i.e., f ([ ]) θ + θ = f ϕ + ϕ ([ ]) θ ϕ + [ [ ] θ ϕ ] θ f (θ,ϕ) ϕ f (θ,ϕ) + 1 [ ] [ θ ϕ ] θ H(θ,ϕ). (7.14) 2 ϕ [ ] θ, and setting it ϕ Taking the derivative of the expansion with respect to to zero yields [ ] [ ] θ f (θ,ϕ) θ + H(θ,ϕ) = 0 (7.15) ϕ f (θ,ϕ) ϕ [ ] [ ] θ = H 1 θ f (θ,ϕ) (7.16) ϕ ϕ f (θ,ϕ) which is exactly the update equation for Newton s method. Since our particular function has zero gradients only at the origin, the parameter update of the full Newton method at any point (θ,ϕ) is leading straight to the origin. [ θ ϕ ] = [ θ ϕ ], (7.17) For the simultaneous update, on the other hand, we have the following two update equations which results in the overall update θ = ( 2 θ f (θ,ϕ)) 1 θ f (θ,ϕ) (7.18) = 1 α (αθ + Mϕ) = θ 1 Mϕ (7.19) α ( 1 ϕ = ϕ 2 f (θ,ϕ)) ϕ f (θ,ϕ) (7.20) = 1 β ( βϕ + M θ) = ϕ + 1 β M θ (7.21) [ ] θ = ϕ [ ] θ + ϕ [ ] 1 α Mϕ. (7.22) 1 β M θ 54

61 7.3. Generalized Trust Region Method for the Saddle Point Problem Figure 7.1: Function 7.12 with α, β = 2 and M = 1. The blue and black arrow show one optimization step of the full Newton and simultaneous Newton, respectively, from the start point (θ, ϕ) = ( 4, 2). Hence, we introduce an additive error term to the optimization scheme, when disregarding the cross terms, that depends on the matrix M. In figure 7.1, we see one example where the additive error term leads to a significantly worse update. This example shows that even in a simple convex-concave setting, the simultaneous Newton method, with its block diagonal approximation, can lead to bad updates when the cross-term dependencies in the matrix M are large. 7.3 Generalized Trust Region Method for the Saddle Point Problem In the previous section, we have seen that the naïve replacement of a common GD optimizer with the Newton method introduces two severe problems. First, that it is attracted to any critical point and secondly that the simultaneous approach can yield bad approximations to the true eigendirections of the full Hessian. We already identified the solutions to both of the problems, namely, using the SFN method for the former and a full Newton update for the latter. However, combining these two approaches 55

62 7. Second-order Optimization is not straightforward, as the SFN method is only defined for minimization problems. Therefore, in this section, we extend the generalized trust region approach from [8] to the saddle point problem, yielding a modified secondorder method to find structured saddle points. To extend the generalized trust region method, we must account for the fact that we are doing a min-max operation over the function f with respect to the parameters θ and ϕ, respectively. Therefore, we formulate two individual optimization problems within this framework: ( [ ] [ ]) ([ ] [ ]) θ θ θ θ + θ θ = arg min T 1 f,, s.t. d, (7.23) θ ϕ ϕ ϕ ϕ + ϕ ( [ ] [ ]) ([ ] [ ]) θ θ θ θ + θ ϕ = arg min T 1 f,, s.t. d, (7.24) ϕ ϕ ϕ ϕ ϕ + ϕ ( [ ] [ ]) θ θ The expression T 1 f,, denotes the first-order Taylor approximation of f around the point, evaluated at. In the generalized trust ϕ ϕ [ ] [ ] θ θ ϕ ϕ region framework, the authors of [8] use the discrepancy between the firstorder and second-order Taylor approximation as the distance measure d. To extend this idea to our saddle point problem, with two individual optimization objectives, we use the mean sum of the discrepancies to ensure that both problems have the same constraint. Note that the discrepancy is defined as the absolute difference, and therefore it is independent of the sign of f. Thus, the distance simplifies to the discrepancy of f, which is given by: d ([ ] [ ]) θ θ + θ, ϕ ϕ + ϕ = f (θ,ϕ) + [ θ [ ] ϕ ] θ f + 1 ϕ f 2 [ ] [ θ ϕ ] θ H ϕ f (θ,ϕ) [ θ ϕ ] [ θ f ϕ f ] = 1 [ ] 2 [ θ ϕ ] θ H (7.25) ϕ The Hessian, denoted by H, is defined as [ ] 2 H = θ f θϕ f ϕθ f ϕ 2. (7.26) f 56

63 7.3. Generalized Trust Region Method for the Saddle Point Problem Lemma 7.2 (Lemma 1 from [8]) Let A be a nonsingular square matrix in R n R n, and x R n be some vector. Then it holds that x Ax x A x, (7.27) where A is the matrix obtained by taking the absolute value of each of the eigenvalues of A. Proof Let e 1,..., e n be the different eigenvectors of A and λ 1,..., λ n the corresponding eigenvalues. By decomposing the matrix A, we can re-write the left-hand side of the inequality as follows: x Ax = = x n i=1 ( n λ i e i ei i=1 ) x = n i=1 ) ( λ i (ei x ei x) = λ i x e i ei x (7.28) n 2 λ i (ei x) (7.29) With the use of the triangular inequality x i x i (7.30) i i we obtain the upper bound of the expression as: x n Ax λ i i=1 i=1 ( ) ( 2 n ei x = x λ i e i ei i=1 ) x (7.31) = x A x (7.32) With the help of Lemma 7.2, we can construct an upper bound of the distance measure d with ([ ] [ ]) θ θ + θ d, 1 [ ] [ θ ϕ ϕ + ϕ ϕ ] θ H (7.33) 2 ϕ [ ] H1 H where H = 2 is the matrix obtained by taking the absolute value H 3 H 4 of each of the eigenvalues of the Hessian of f. To solve the two constrained optimization problems, we use Lagrange multipliers which yield the following two functions: L 1 = f (θ,ϕ) + [ [ ] θ ϕ ] θ f ϕ f + λ ( 1 2 [ ] ) [ θ ϕ ] θ H ϕ (7.34) 57

64 7. Second-order Optimization L 2 = f (θ,ϕ) [ θ [ ] ϕ ] θ f ϕ f ( [ ] ) 1 [ + λ θ ϕ ] θ H 2 ϕ (7.35) Minimizing L 1 with respect to θ, yields L 1 θ = θ f (θ,ϕ) + λ(h 1 θ H 2 ϕ H 3 ϕ) (7.36) Similarly, minimizing L 2 with respect to ϕ, yields L 1 ϕ = ϕ f (θ,ϕ) + λ(h 4 ϕ H 3 θ H 2 θ) (7.37) Lemma 7.3 Let A R n R n be a symmetric matrix. Then, A is also symmetric, where A is the matrix obtained by taking the absolute value of each of the eigenvalues of A. Proof Let e i,..., e n be the different eigenvectors of A and λ 1,..., λ n the corresponding eigenvalues. Using eigendecomposition for symmetric real matrices, we can re-write the matrix A as n λ i e i ei. (7.38) i=1 Taking the absolute values of every eigenvalue leads to the following expression for A : n λ i e i ei. (7.39) i=1 Observing that the outer product of each eigenvector forms a symmetric matrix, and that the set of symmetric matrices is closed under the weighted sum, concludes the proof. Using the statement of Lemma 7.3, and observing that θϕ f = ϕθ f due to the symmetry of the Hessian, leads to the conclusion, that H2 = H 3. Hence, we can replace H3 with H 2 in equation 7.36 and H2 with H 3 in equation Setting both derivatives to zero yields the following two equations, respectively. θ f = λ [ ] [ ] θ H 1 H 2 (7.40) ϕ ϕ f = λ [ ] [ ] θ H 3 H 4 (7.41) ϕ 58

65 7.3. Generalized Trust Region Method for the Saddle Point Problem This set of equations can be re-written as [ ] θ f = λ ϕ f [ H1 H 2 H 3 H 4 from which we obtain the following solution: ] [ ] θ, (7.42) ϕ [ ] θ = 1 [ ] θ f ϕ λ H 1. (7.43) ϕ f Hence, the optimal update step of the generalized trust region method uses the inverse of the modified Hessian of f, which is constructed using the absolute value of every eigenvalue of the Hessian. Additionally, it is using a negative sign for the update over the parameters to be maximized. We call the method using this update SPNewton (saddle point Newton) from now on Support for Different Loss Functions So far, we have adapted the Saddle Free Newton procedure via an extended version of the generalized trust method to the saddle point problem. However, we have seen in chapter 2 that in many practical use cases we can accelerate training by considering two individual loss functions, which is not solvable by our proposed SPNewton method. In this section, we are going to generalize the saddle free Newton idea even further to the case where we have two individual loss functions (cf. section 2.2.1) f 1, f 2 over which we minimize and maximize, respectively. I.e., instead of the standard saddle point problem approach where we optimize over min max f (θ,ϕ) θ ϕ { min θ f (θ,ϕ) min ϕ f (θ,ϕ), (7.44) we consider the more general case with two individual loss functions: { min θ f 1 (θ,ϕ) min ϕ f 2 (θ,ϕ). (7.45) While adapting common first-order methods to this adjustment is straightforward, our SPNewton method relies on the fact that we optimize over a single loss function for both sets of parameters. Therefore, in this section, we are going to further generalize the method to be able to handle this more generic case. 59

66 7. Second-order Optimization Adapting the framework The first obvious modification we have to make is to re-formulate the individual optimization problems within the generalized trust region method (7.23, 7.24). This is straightforward as we just have to replace the respective functions, which results in ( θ = arg min T 1 f 1, θ [ ] θ, ϕ [ ]) θ ϕ s.t. d ([ ] [ ]) θ θ + θ, (7.46) ϕ ϕ + ϕ ( ϕ = arg min T 1 f 2, ϕ [ ] θ, ϕ [ ]) θ ϕ s.t. d ([ ] [ ]) θ θ + θ,. (7.47) ϕ ϕ + ϕ However, now that we have two different loss functions we obviously have different Hessian matrices, which means that the two distance constraints are not the same anymore. We defined the distance as the mean absolute discrepancy between the first-order and second-order Taylor approximation, which now becomes the following expression: d ([ ] [ ]) θ θ + θ, ϕ ϕ + ϕ = 1 [ ] [ 2 θ ϕ ] H (1) θ + 1 [ [ ϕ 2 θ ϕ ] H (2) θ ϕ] 1 [ θ ϕ ] ( H (1) + H (2)) [ ] θ (7.48) 2 ϕ where H (i) is the Hessian of the function f i with i = {1, 2}, and H (i) denotes the corresponding matrix constructed by taking the absolute value of each eigenvalue. The inequality of the distance is a direct consequence of Lemma 7.2. With the given expression for the distance, we can use the same line of argument as in the previous section to arrive at the following update step equation: [ ] θ = 1 ( H (1) + H (2)) [ ] 1 θ f 1 ϕ λ ϕ f 2 (7.49) This method uses the inverse of the sum of the positive definite Hessians of the individual functions. With the use of this optimizer (which we call AdditiveSPNewton from now on) we are able to use the the saddle-free Newton idea in a modified saddle point problem setting with different loss functions (as it usually arises in the training of GANs), at the cost of constructing two Hessian matrices over the full set of parameters. 60

7.4. Experiments Discr. Gen. Input Dimension 2 2 Hidden Layers 1 1 Hidden Units / Layer 30 30 Activation Function ReLU ReLU Output Dimension 1 2 Batch Size 8192 Learning Rate 0.05 (a) (b) Figure 7.

All the mixtures have equal probability weight and a standard deviation of diag(0.1, 0.1). The table in (b) shows the Network architecture parameters. 7.4 Experiments 7.4.1 SPNewton vs.

67 7.4. Experiments Discr. Gen. Input Dimension 2 2 Hidden Layers 1 1 Hidden Units / Layer Activation Function ReLU ReLU Output Dimension 1 2 Batch Size 8192 Learning Rate 0.05 (a) (b) Figure 7.2: Plot (a) shows the data distribution. The four means of the Gaussian mixtures are arranged on a circle with radius two around the origin and indicated by the red dots. All the mixtures have equal probability weight and a standard deviation of diag(0.1, 0.1). The table in (b) shows the Network architecture parameters. 7.4 Experiments SPNewton vs. Newton and Gradient Descent Objective and setup In the first experiment, we want to compare the performance of SPNewton against the Newton s method (Newton) and Gradient Descent (SGD). We asses the quality of the generative model on a simple 2D toy problem, with data coming from a mixture of four Gaussians. The data distribution is visualized in figure 7.2 (a). The detailed parameters of the model are shown in table 7.2 (b). Results The visual outcome of the comparison of the three different optimization schemes is summarized in figure 7.3. The kernel density estimate of the generator s distribution is shown in green, while the means of the Gaussian mixtures are represented by the red dots. The results of this experiment show very clearly the aforementioned flaw of Newton s method for this type of problem. Namely, that it converges to some critical point very quickly, that is not an optimal solution. Evidence that it actually converged to a critical Figure 7.4: gradients. Logarithmic sum of 61

7. Second-order Optimization Iteration 1 Iteration 900 Iteration 1,800 Iteration 10,000 Newton SGD SPNewton Figure 7.

to bottom). Note that between the third and the fourth column is a relatively larger gap in iterations.

The SPNewton method, on the other hand, is not attracted by one of these undesired critical points.

Simultaneous SFNewton Objective and setup In the second experiment, we compare the SPNewton method against the simultaneously applied Saddle Free Newton method (SFNewton) from

2. It is, therefore, closely related to the toy experiment on the convex-concave function, but on the significantly larger and more complex 2D-Gaussian GAN problem.

68 7. Second-order Optimization Iteration 1 Iteration 900 Iteration 1,800 Iteration 10,000 Newton SGD SPNewton Figure 7.3: Plots of the kernel density estimate of the generator s distribution for 0 to iterations (left to right) and different optimization methods: Newton, SGD, SPNewton (top to bottom). Note that between the third and the fourth column is a relatively larger gap in iterations. 62 point is given by the small value of the absolute gradient sum in figure 7.4. The SPNewton method, on the other hand, is not attracted by one of these undesired critical points. More to the contrary, it converges quickly to something similar to the true data generating distribution, even outpacing Gradient Descent by far SPNewton vs. Simultaneous SFNewton Objective and setup In the second experiment, we compare the SPNewton method against the simultaneously applied Saddle Free Newton method (SFNewton) from Dauphin et al. [8]. This comparison is important, as it gives empirical evidence to the problem of discrepancy in the eigenvectors, described in section It is, therefore, closely related to the toy experiment on the convex-concave function, but on the significantly larger and more complex 2D-Gaussian GAN problem. We use the same data and model architecture for SPNewton as in the previous experiment. Results In figure 7.5, we see a scatter plot of the resulting distribution of the generator (blue), compared to a subset of the data samples (green).the top row shows the results for the simultaneous SFNewton, and the bottom row for the SPNewton optimizer. We can see, that the SPNewton optimizer finds the modes of the gaussian a lot faster and converges quickly to an almost perfect distribution. The simultaneous approach, on the other hand,

69 7.4. Experiments has trouble identifying the modes and doesn t seem to converge within ten thousand iterations. The result of this comparison confirms the theoretical argument of the superiority of a Newton method applied to the full objective, instead of a simultaneous approach. It illustrates the consequences of approximating the Hessian with a block diagonal matrix. Iteration 0 Iteration 2500 Iteration 4500 Iteration 8000 Iteration Figure 7.5: Scatter plots of a subset of the data (green) and generated samples (blue) over the course of iterations. The upper row shows the results for the SFNewton applied simultaneously, and the bottom row for the proposed SPNewton optimizer Robust Optimization Objective and Setup In this last experiment of the SPNewton method, we take a step back from the GAN objective and move to the framework of Robust Optimization (cf. section 2.3). Robust Optimization, in the sense of empirical risk minimization, gives rise to a saddle point problem that has already been described in more detail in a previous experiment in section In this experiment we again consider this framework with the same architectural parameters. Experiment 1 First, we test the SPNewton optimizer against the Newton method in both variations: simultaneously applied (7.1) and on the full objective (7.11). We compare the methods by their test set accuracy over the number of epochs. We run it for 2000 epochs with 10 different random parameter initializations and different trainings-, test-set splits. The accuracy results are reported in the left plot of figure 7.6 with their corresponding 90 percent confidence intervals. To be sure that the methods have converged we plot the squared sum of gradients over epochs in the right plot of figure The source code for the SPNewton optimizer and the experiments is available on github. 63

7. Second-order Optimization Figure 7.6: Robust optimization experiment on the breast cancer Wisconsin dataset [16] (binary classification on 30 attributes).

The experiment has been run over 2000 epochs for 10 different initializations and training-, test-set splits.

70 7. Second-order Optimization Figure 7.6: Robust optimization experiment on the breast cancer Wisconsin dataset [16] (binary classification on 30 attributes). The left plot shows the test set accuracy and the right plot the corresponding squared sum of gradients for the SPNewton, Newton and the simultaneous Newton optimization. The experiment has been run over 2000 epochs for 10 different initializations and training-, test-set splits. The solid line shows the mean over the different runs and the blurry area the corresponding 90 percent confidence interval The results clearly support our intuition from the previous section. The original Newton methods, in both forms, converge to a non-ideal critical point where the test set accuracy is low. Our SPNewton optimizer, on the other hand, is able to find a significantly better solution. 64 Experiment 2 In this experiment, we want to analyze the behavior of the different optimizer around critical points. In the previous section we argued that the major shortcoming of Newton s method is its attraction to any saddle point. We constructed SPNewton with the specific goal to not be attracted to bad saddle points, but still to locally optimal ones. The following experiment aims to validate this property of SPNewton. We first run Newton s method until we can be sure, based on the squared sum of gradients, that we found a critical point. Then we perturb the parameters of the model by some gaussian noise coming from N (0, 0.1), and compare the performance of SPNewton and Newton for the upcoming epochs. An optimizer that is attracted to the initial critical point will quickly converge to the same point again, whereas SPNewton should be able to escape and converge to a different point. The results of this experiment are shown in figure 7.7. The test set accuracy plot on the left gives evidence that, in fact, the SPNewton optimizer is able to escape from the critical point and find a better solution. On the contrary, the simultaneous Newton method seems to converge back to the initial critical

7.5. Problems with Second-order Methods and Future Work Figure 7.7: Robust optimization experiment on the breast cancer Wisconsin dataset [16] (binary classification on 30 attributes).

71 7.5. Problems with Second-order Methods and Future Work Figure 7.7: Robust optimization experiment on the breast cancer Wisconsin dataset [16] (binary classification on 30 attributes). The left plot shows the test set accuracy and the right plot the corresponding logarithm of the squared sum of gradients for the SPNewton and the simultaneous Newton optimization. The vertical black line indicates the epoch number at which the model parameters have been perturbed with additive values coming from N (0, 0.1). point. The squared sum of gradient plot on the right shows that (i) we really found a critical point with the Newton start, (ii) the Newton and SPNewton method converge after the perturbation. 7.5 Problems with Second-order Methods and Future Work The main drawback of any second-order method is the need to compute the Hessian. The size of this matrix depends quadratically on the number of parameters. Hence, for high-dimensional problems it is very costly or even unfeasible to compute and store the Hessian. Most of the saddle point problems we are addressing in this thesis are based on deep neural networks which are naturally very high dimensional. Therefore, the proposed SPNewton optimizer is not applicable due to the computational complexity the Hessian computation entails. However, to overcome this limitation we could relinquish the exact Hessian computation and rely on an approximation, i.e., an SPNewton optimizer within the framework of quasi-newton methods [22]. Quasi-SPNewton Quasi-Newton methods use the alternating structure of the algorithm to update its approximation of the Hessian B after each step with the gained gradient information. This can be done by observing that 65

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Nishant Gurnani. GAN Reading Group. April 14th, / 107 Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107 Why are these Papers Important? 2 / 107 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN,