arxiv: v1 [math.ap] 10 Oct 2013

Size: px

Start display at page:

Download "arxiv: v1 [math.ap] 10 Oct 2013"

Albert Kelly
5 years ago
Views:

1 The Exponential Formula for the Wasserstein Metric Katy Craig 0/0/3 arxiv:30.292v math.ap] 0 Oct 203 Abstract We adapt Crandall and Liggett s method from the Banach space case to give a new proof of the exponential formula for the Wasserstein metric. In doing this, we introduce a new class of metrics transport metrics that have stronger convexity properties than the Wasserstein metric. With these, we prove an Euler-Lagrange equation characterizing the discrete gradient flow. We also prove an almost contraction inequality that controls the distance between discrete gradient flows with different initial data. Combining these results, we obtain the exponential formula and quantify the rate at which the discrete gradient flow converges to the continuous gradient flow. We then apply our estimates to give simple proofs of properties of the gradient flow, including the contracting semigroup property and energy dissipation inequality. Contents Introduction 2 2 Discrete Gradient Flow in W 2 : Background and New Results 5 2. Wasserstein Metric Geodesics and Generalized Geodesics Convexity Differentiability Transport Metrics Discrete Gradient Flow Euler-Lagrange Equation Discrete Variational Inequality Exponential Formula for the Wasserstein Metric 5 3. Almost Contraction Inequality Relation Between Proximal Maps with Different Time Steps Asymmetric Recursive Inequality Inductive Bound Exponential Formula for the Wasserstein Metric Gradient Flow with Initial Conditions µ D(E) Appendix Varying Time Steps Allowing E(µ) < + when µ charges small sets Work partially supported by U.S. National Science Foundation grants DMS and DMS c 203 by the author. This paper may be reproduced, in its entirety, for non-commercial purposes. Key words: Wasserstein metric, gradient flow, exponential formula; Math Subject Classification: 47J, 49K, 49J

2 2 Introduction Given a continuously differentiable, convex function E : R d R {+ }, the gradient flow of E is the solution to the Cauchy problem d dt u(t) = E(u(t)), u(0) D(E) = {v Rd : E(v) < + }. () Through suitable generalizations of the notion of the gradient, the theory of gradient flows has been extensively studied in Hilbert spaces 3], Banach spaces 6, 7], nonpositively curved metric spaces 9], and general metric spaces, 5], including the the space of probability measures with finite second moment P 2 (R d ) endowed with the Wasserstein metric W 2. Gradient flow in the Wasserstein metric is of particular interest due to the few restrictions it imposes on initial data (no regularity, just P 2 (R d )) and the wide variety of partial differential equations that can be( studied with ) this perspective. Formally, the gradient vector field with respect to W 2 is µ δe δρ (µ), 2]. However, to rigorously prove existence of the gradient flow and study its properties, one typically works with a discretized version of the problem that doesn t rely on a rigorous notion of the gradient. In the Euclidean case, this discretized version of () is obtained by the implicit Euler method: the discrete gradient flow sequence with time step is u n u n = E(u n ), u 0 = u. We define the proximal map J : R d R d : u J u by J u u = E(J u) J u = (id + E) u. (2) Thus, the nth element of the discrete gradient flow sequence can be expressed as J n u. Setting = t n and sending n gives the exponential formula relating the discrete gradient flow to the gradient flow u(t): lim J n n t/n u = lim (id + t n n E) u = u(t) In this paper, we apply Crandall and Liggett s method for Banach space gradient flow 6] to prove the exponential formula for the Wasserstein metric, uniting the Banach space and Wasserstein theories. To generalize the notion of the proximal map to a metric space, note that equation (2) is the Euler-Lagrange equation corresponding to the minimization problem J u = argmin v R d { 2 v u 2 + E(v) This expression for J allows us to define the proximal map and discrete gradient flow without a rigorous notion of the gradient. We define the Wasserstein proximal map by { } J µ := argmin ν P 2 (R d ) 2 W 2 2 (ν, µ) + E(ν). (3) Likewise, the exponential formula is }. lim J n n t/n µ = S(t)µ, (4)

3 3 where S(t)µ denotes the gradient flow at time t with initial conditions µ. (See Definition.3.) The exponential formula (4) was first proved in the Wasserstein metric by Ambrosio, Gigli, and Savaré, through a careful analysis of affine interpolations of functions of the discrete gradient flow for example, interpolations of W2 2 µ, ν) and W2 2(J n µ, ν) ]. With this method, they obtain the sharp rate of convergence of the discrete gradient flow to the gradient flow and use this to develop many properties of the gradient flow. In spite of the definitive results obtained by their method, Ambrosio, Gigli, and Savaré also raised the question of whether it might be possible to obtain the same results using Crandall and Liggett s method, bringing together the Banach space and Wasserstein theories of gradient flow. Crandall and Liggett s method is appealing due to its robustness and simplicity, relying on convexity inequalities to quantify the behavior of the discrete gradient flow and iterating these inequalities to prove convergence as the time step goes to zero 6]. At first glance, an adaptation of Crandall and Liggett s method to the Wasserstein metric seems unlikely. For E convex, the generalization of E in the Banach space case is an accretive operator, which, by definition, is an operator for which the corresponding proximal map satisfies (J n J u J v u v. (5) While such an inequality does hold in metric spaces of nonpositive curvature (Mayer used it adapt Crandall and Liggett s method to this case 9]), the Wasserstein metric is nonnegatively curved, Theorem 7.3.2], and it is unknown if such a contraction holds. Still, there exist almost contraction inequalities, such as, Lemma 4.2.4] or 4, Theorem.3], and Carlen and the author demonstrated their usefulness for the studying qualitative properties of gradient flow applying them to show that many of the remarkable features of the solutions to the porous media and fast diffusion equations, such as convergence to Barenblatt profiles, are also present in the discrete gradient flow. 4]. In this paper, we use this type of almost contraction inequality to adapt Crandall and Liggett s method to the Wasserstein metric, giving new proofs of the exponential formula and of several properties of the gradient flow. A fundamental difference between our method and Crandall and Liggett s is that our almost contraction inequality involves the square distance, rather than the distance itself. This prevents us from applying the triangle inequality, as they did, to control the distance between different elements of the the discrete gradient flow. Furthermore, unlike in the Hilbertian case where is -convex along geodesics, x x y 2 2 µ 2 W 2 2 (µ, ω) is not, Example 9..5]. This is a recurring difficulty when extending results from Hilbert and Banach spaces to the Wasserstein metric. Ambrosio, Gigli, and Savaré circumvent this in ] by introducing a different class of curves generalized geodesics along which the square distance is - convex. We further develop this idea, introducing a class of transport metrics W 2,ω whose geodesics correspond exactly to the generalized geodesics. These metrics satisfy the key property that µ 2 W 2 2,ω(µ, ω) is convex along the geodesics induced by W 2,ω. This turns out to be the essential fact needed to control the discrete gradient flow and adapt Crandall and Liggett s method to the Wasserstein case. In sections 2. through 2.4, we recall general facts about the Wasserstein metric and functionals defined on this metric space. We will often impose the following assumptions on our functionals.

4 4 ASSUMPTION. (optional domain assumption). E(µ) < + only if µ is absolutely continuous with respect to Lebesgue measure. This assumption ensures that for all µ D(E) and ν P(R d ) there exists an optimal transport map t ν µ from µ to ν 2] (see section 2.). This is purely for notational convenience. In section 4.2 we describe how to remove this assumption. ASSUMPTION.2 (convexity assumption). E is proper, coercive, lower semicontinuous, and λ-convex along generalized geodesics for λ R. This assumption is essential. In particular, the fact that E is λ-convex along generalized geodesics ensures that E is λ-convex in the transport metric W 2,ω. In section 2.5, we define the transport metric W 2,ω and the corresponding subdifferential 2,ω and study their properties. In section 2.6, we recall basic facts of the discrete gradient flow, the proximal map J, and the associated minimization problem. In section 2.7, we we reframe the minimization problem in terms of the transport metrics, allowing us to prove an Euler-Lagrange equation for minimizer J. In section 2.8, we recall the discrete variational inequality from, Theorem 4..2] and prove a stronger version using transport metrics. In section 3., we begin our proof of the exponential formula by proving an asymmetric almost contraction inequality. In section 3.2, we apply our Euler-Lagrange equation to obtain an expression relating proximal maps with different time steps. In sections 3.3 and 3.4, we combine these results to bound the distance between gradient flow sequences with different time steps via an asymmetric induction in the style of Rasmussen 3]. Finally, in section 3.5, we prove the exponential formula (4) and quantify the convergence of the discrete gradient flow to the gradient flow. Following Ambrosio, Gilgi, and Savaré, Equation (4.0.3)], we define the Wasserstein gradient flow as follows: DEFINITION.3 (gradient flow). A curve S(t)µ : (0, + ) P 2 (R d ) is the gradient flow of a functional E with initial data µ D(E) if S(t)µ t 0 µ and d 2 dt W 2 2 (S(t)µ, ω) + λ 2 W 2 2 (S(t)µ, ω) E(ω) E(S(t)µ), ω D(E), Lebesgue a.e. t > 0. (6) We will sometimes refer to S(t)µ as the continuous gradient flow, to distinguish it from the discrete gradient flow. In the particular case that E satisfies convexity assumption.2 and is of the form { E(µ) = R F (x, ρ(x), ρ(x))dx for µ = ρ dx, ρ(x) C (R d ) d + otherwise, with F C 2 (R d 0, + ) R d ), (6) is equivalent to the gradient flow µ t := S(t)µ, satisfying ( d dt µ t = E ) µ t, µ t = ρ t dx ρ t in a weak sense, in the duality with Cc (R d (0, + )), Lemma 0.4., Theorem..4]. Formally, ( W E(µ) = E ) ρ µ is the Wasserstein gradient vector field, where E ρ t is the first variation of E at ρ t, 2]. We close section 3.5 by applying our estimates to give simple proofs of properties of the continuous gradient flow, including the contracting semigroup property and the energy dissipation

5 5 inequality. Finally, in section 3.6, we extend our results, which only applied to gradient flows with initial conditions µ D( E ), to include initial conditions µ D(E). (See Definition 2.8 of the metric slope E.) In the appendix, we describe two extensions. In section, 4., we adapt our proof of the exponential formula to include discrete gradient flows with varying time steps. In section 4.2, we describe how to remove the domain assumption., which we imposed for notational convenience. This new proof of the convergence of discrete gradient flow suggests several directions for future work. Is there an underlying geometric structure that relates the transport metrics W 2,ω to the Wasserstein metric W 2, making the assumption that E is convex along generalized geodesics more geometrically natural? Can this new proof of the exponential formula be used to study the behavior of the gradient flow as the functional E is perturbed or regularized? Is there a generalization of E in the Wasserstein metric that corresponds to the Banach space notion of an accretive operator? Acknowledgements: The author thanks Prof. Giuseppe Savaré for suggesting this problem. The author also thanks Prof. Eric Carlen for suggesting the form of Theorem 3.2 and for many helpful conversations. 2 Discrete Gradient Flow in W 2 : Background and New Results 2. Wasserstein Metric Let P(R d ) denote the set of Borel probability measures on R d. Given µ, ν P(R d ), a Borel measurable map t : R d R d transports µ onto ν if ν(b) = µ(t (B)) for all Borel sets B R d. We call ν the push-forward of µ under t and write ν = t#µ. Consider a measure µ P(R d R d ). (We distinguish probability measures on R d R d or R d R d R d, from probability measures on R d by writing them in bold font.) Let π be the projection onto the first component of R d R d, and let π 2 be the projection onto the second component. The first and second marginals of µ are π #µ P(R d ) and π 2 #µ P(R d ). Given µ, ν P (R d ), the set of transport plans from µ to ν is Γ(µ, ν) := {µ P(R d R d ) : π #µ = µ, π 2 #µ = ν}. The Wasserstein distance between µ and ν is W 2 (µ, ν) := ( { }) /2 inf x y 2 dµ(x, y) : µ Γ(µ, ν). (7) R d R d When W 2 (µ, ν) < +, there exist plans which attain the infimum. We denote this set of optimal transport plans by Γ 0 (µ, ν). When µ is absolutely continuous with respect to Lebesgue measure, there is a unique optimal transport plan from µ to ν of the form (id t)#µ, where id(x) = x is the identity transformation and t is unique µ-a.e. 2]. In particular, there is a map t satisfying t#µ = ν and ( ) /2 W 2 (µ, ν) = id t 2 dµ. R d We denote this unique optimal transport map by t ν µ. Furthermore, a Borel measurable map t that transports µ to ν is optimal if and only if it is cyclically monotone µ-a.e. 2, 0], i.e. if there exists N R d with µ(n) = 0 such that for every finite sequence of distinct points {x,..., x m } R d \N, t(x ) (x 2 x ) + t(x 2 ) (x 3 x 2 ) + + t(x m ) (x x m ) 0.

6 6 One technical difficulty when working with the Wasserstein distance on P(R d ) is that there exist measures that are infinite distances apart. Throughout this paper, we denote by ω 0 some fixed reference measure and define P 2,ω0 (R d ) = {µ P(R d ) : W 2 (µ, ω 0 ) < + }. By the triangle inequality, (P 2,ω0 (R d ), W 2 ) is a metric space. When ω 0 = δ 0, the Dirac mass at the origin, this is P 2 (R d ), the subset of P(R d ) with finite second moment. 2.2 Geodesics and Generalized Geodesics DEFINITION 2. (constant speed geodesic). Given a metric space (X, d), a constant speed geodesic u : 0, ] X is a curve satisfying d(u α, u β ) = β α d(u 0, u ), for all α, β 0, ]. We will often refer to constant speed geodesics simply as geodesics. As shown in, Theorem 7.2.2], all geodesics in P 2,ω0 (R d ) are curves of the form µ α = ( ( α)π + απ 2) #µ, µ Γ 0 (µ 0, µ ). If µ 0 is absolutely continuous with respect to Lebesgue measure, the geodesic from µ 0 to µ is unique and of the form µ α = ( ( α)id + αt µ µ 0 ) #µ0. We now recall the definition of generalized geodesics from, Definition 9.2.2]. Given a finite product R d R d R d, let π i the be projection onto the ith component and π i,j be the projection onto the ith and jth components. DEFINITION 2.2 (generalized geodesic). Given µ 0, µ, ω P 2,ω0 (R d ), a generalized geodesic from µ 0 to µ with base ω is a curve µ α : 0, ] P(R d ) of the form where µ P(R d R d R d ) satisfies µ α := ( ( α)π 2 + απ 3) #µ, π,2 #µ Γ 0 (ω, µ 0 ) and π,3 #µ Γ 0 (ω, µ ). (8) We refer to any µ P(R d R d R d ) that satisfies (8) as a plan that induces a generalized geodesic from µ 0 to µ with base ω. REMARK 2.3. Such a µ always exists by, Lemma 5.3.2]. If the base ω equals either µ 0 or µ, then µ α is a geodesic joining µ 0 and µ. REMARK 2.4. If ω is absolutely continuous with respect to Lebesgue measure, the generalized geodesic from µ 0 to µ with base ω is unique and of the form µ α = (( α)t µ 0 ω + αt µ ω ) #ω. Since (( α)t µ 0 ω + αt µ ω ) is a positive combination of optimal transport maps, it is cyclically monotone, hence it is the optimal transport map from ω to µ α.

7 7 2.3 Convexity Given a metric space (X, d), we consider functionals E : X R {+ } that satisfy the following conditions. proper: D(E) := {u X : E(u) < + } coercive: There exists 0 > 0, u 0 X such that inf v X { 2 0 d 2 (u 0, v) + E(v) lower semicontinuous: For all u n, u X such that u n u, lim inf n E(u n) E(u). λ-convex along a curve u α : Given λ R and a curve u α X, } >. (9) E(u α ) ( α)e(u 0 ) + αe(u ) α( α) λ 2 d2 (u 0, u ), α 0, ]. (0) λ-convex along geodesics: Given λ R, for all u 0, u X, there exists a geodesic u α from u 0 and u along which (0) holds. We will often simply say that E is λ-convex, or in the case λ = 0, convex. Fix ω 0 P(R d ) and suppose (X, d) = (P 2,ω0 (R d ), W 2 ). In this setting, convexity is often referred to as displacement convexity 0], and we have the additional stronger notion of convexity along generalized geodesics, Definition 9.2.2]. DEFINITION 2.5 (λ-convex along generalized geodesics). Given λ R, a functional E : P 2,ω0 (R d ) R {+ } is λ-convex along a generalized geodesics µ α if E(µ α ) ( α)e(µ 0 ) + αe(µ ) α( α) λ x 2 x 3 2 dµ, () 2 where µ is the plan that induces the generalized geodesic. E is convex along generalized geodesics if, for all µ 0, µ, ω P 2,ω0 (R d ), there exists a generalized geodesic µ α from µ 0 to µ with base ω along which E is convex according to (). REMARK 2.6. This definition is slightly different from E being λ-convex along all of the curves µ α according to equation (0), since W2 2 (µ 0, µ ) x 2 x 3 2 dµ(x). (2) When λ > 0, equation () is stronger, and when λ < 0, it is weaker. REMARK 2.7. When ω = µ 0 or µ, µ α is simply the geodesic from µ 0 to µ and equality holds in (2). Therefore, λ-convexity along generalized geodesics implies λ-convexity along geodesics.

8 8 2.4 Differentiability Given a functional E on a metric space (X, d), we may consider the metric slope. DEFINITION 2.8 (metric slope). Given a metric space (X, d) and a functional E : X R {+ }, for every u D(E), the metric slope of E at u is E (u) := lim sup v u (E(u) E(v)) + d(u, v). If our metric space is (X, d) = (P 2,ω0 (R d ), W 2 ), we may also consider the subdifferential of E, Definition 0..]. For ease of notation, we assume E satisfies domain assumption.. This ensures that, for any µ D(E), ν P 2,ω0 (R d ), there exists a unique optimal transport map t ν µ from µ to ν. The subdifferential can be defined without this assumption, but the notation becomes more cumbersome. We explain how to extend these results to the general case in section 4.2. The Wasserstein subdifferential, as originally defined in, Section 0], is inspired by the standard Euclidean subdifferential. One simply replaces the Euclidean distance with the Wasserstein metric and the vector v u with the transport map t ν µ id. DEFINITION 2.9 (Wasserstein subdifferential). Consider E : P 2,ω0 (R d ) R {+ } proper, lower semicontinuous, and satisfying domain assumption.. Given µ D( E ), ξ L 2 (µ) belongs to the Wasserstein subdifferential of E at µ if E(ν) E(µ) ξ, t ν µ id dµ + o(w 2 (µ, ν)) as ν W 2 µ. R d We write ξ E(µ). REMARK 2.0 (Wasserstein subdifferential and metric slope)., Lemma 0..5] relates the Wasserstein subdifferential and the metric slope: If E satisfies Assumptions. and.2, then µ D( E ) if and only if E(µ) is nonempty. In this case, E (µ) = min{ ξ L 2 (µ) : ξ E(µ)}. Finally, we recall the definition of the strong subdifferential from, 0..]. This quantifies the rate of change of E at µ when approaching µ via any transport map, optimal or not. DEFINITION 2. (strong subdifferential). Consider E : P 2,ω0 (R d ) R {+ } proper, lower semicontinuous, and satisfying domain assumption.. ξ E(µ) is a strong subdifferential in case for all measurable maps t : R d R d such that t id L 2 (µ) < +, E(t#µ) E(µ) ξ, t id dµ + o( t id L 2 (µ)) R d 2.5 Transport Metrics as t L2 id. A recurring difficulty in extending results from a Hilbert space (H, ) to the Wasserstein metric (P 2,ω0, W 2 ) is that while x x y 2 2

9 9 is -convex along geodesics, µ 2 W 2 2 (µ, ω) is not, Example 9..5]. Ambrosio, Gigli, and Savaré circumvent this difficulty by introducing the notion of generalized geodesics and showing that µ 2 W 2 2 (µ, ω) is -convex along generalized geodesics with base ω, Lemma 9.2.]. In this section, we introduce a class of metrics whose geodesics correspond exactly to the generalized geodesics with a given base. Furthermore, these metrics satisfy the key property that the square metric is convex along geodesics. This convexity turns out to provide the necessary control over the discrete gradient flow to adapt Crandall and Liggett s method to the Wasserstein metric. For simplicity of notation, we make the following assumption on the measure ω: ASSUMPTION 2.2 (ω doesn t charge small sets). ω P(R d ) is absolutely continuous with respect to Lebesgue measure. This ensures the existence of an optimal transport maps t µ ω from ω to any µ P 2,ω (R d ) 2]. We use these optimal transport map to define the (2, ω)-transport distance. See section 4.2 for how to extend this definition for ω are not absolutely continuous with respect to Lebesgue measure. DEFINITION 2.3 ((2, ω)-transport metric). The (2, ω)-transport metric is W 2,ω : P 2,ω (R d ) P 2,ω (R d ) R, ( /2 W 2,ω (µ, ν) := t µ ω t ν ω dω) 2. REMARK 2.4. If µ = ω or ν = ω, this reduces to the Wasserstein metric. W 2,ω (µ, ν) W 2 (µ, ν). In general, In the following proposition, we prove a few key properties of transport metrics. In particular, we show that the geodesics of the W 2,ω metric are exactly the generalized geodesics with base ω, and hence the function µ W 2,ω (ν, µ) 2 is convex in this metric for any ν P 2,ω (R d ). PROPOSITION 2.5 (properties of the (2, ω)-transport metric). (i) W 2,ω is a metric on P 2,ω (R d ). (ii) The constant speed geodesics with respect to the W 2,ω metric are exactly the generalized geodesics with base ω. Furthermore, these generalized geodesics µ α satisfy W 2 2,ω(ν, µ α ) = ( α)w 2 2,ω(ν, µ 0 ) + αw 2 2,ω(ν, µ ) α( α)w 2 2,ω(µ 0, µ ) ν P ω (R d ). (3) (iii) Generalized geodesics with base ω are the unique constant speed geodesics in the W 2,ω metric. Consequently, a functional E is λ-convex along generalized geodesics with base ω if and only if it is λ-convex in the W 2,ω metric. In particular, the function µ W2,ω 2 (ν, µ) is 2-convex in the W 2,ω metric for any ν P 2,ω (R d ).

10 0 Proof. (i) W 2,ω is symmetric and nonnegative by definition. It is non-degenerate since 0 = W 2,ω (µ, ν) W 2 (µ, ν) = µ = ν. W 2,ω satisfies the triangle inequality since L 2 (ω) satisfies the triangle inequality: W 2,ω (µ, ν) = t µ ω t ν ω L 2 (ω) t µ ω t ρ ω L 2 (ω) + t ρ ω t ν ω L 2 (ω) = W 2,ω (µ, ρ) + W 2,ω (ρ, ν) (ii) Let µ α := (( α)t µ 0 ω + αt µ ω )#ω be the generalized geodesic with base ω from µ 0 to µ at time α 0, ]. By Remark 2.4, t µα ω = ( α)t µ 0 ω + αt µ ω. Consequently, ( W 2,ω (µ α µ ν, µ µ ν β ) = ( = ) /2 (( α)t µ ω + αt ν ω) (( β)t µ ω + βt ν ω) 2 dω ) /2 ((β α)t µ ω + (α β)t ν ω 2 dω = β α W 2,ω (µ, ν) This shows that µ α is a constant speed geodesic. The second result follows from the corresponding identity of the L 2 (ω) norm. W2,ω(ν, 2 µ α ) = ( α)t µ 0 ω + αt µ ω t ν ω 2 L 2 (ω) = ( α) t µ 0 ω t ν ω 2 L 2 (ω) + α tµ ω t ν ω 2 L 2 (ω) α( α) tµ 0 ω t µ ω 2 L 2 (ω) = ( α)w 2 2,ω(µ 0, ν) + αw 2 2,ω(µ, ν) α( α)w 2 2,ω(µ 0, µ ) (iii) Suppose µ α is a constant speed geodesic in the W 2,ω metric from µ 0 to µ. Let µ α := (( α)t µ 0 ω + αt µ ω )#ω be the generalized geodesic with base ω from µ 0 to µ. Setting ν = µ α in equation (3) gives W 2 2,ω( µ α, µ α ) = ( α)w 2 2,ω( µ α, µ 0 ) + αw 2 2,ω( µ α, µ ) α( α)w 2 2,ω(µ 0, µ ). Using the fact that µ α is a constant speed geodesic shows W 2 2,ω( µ α, µ α ) = ( α)α 2 W 2 2,ω(µ, µ 0 ) + α( α) 2 W 2 2,ω(µ 0, µ ) α( α)w 2 2,ω(µ 0, µ ) = (α + ( α) )( α)αw 2 2,ω(µ 0, µ ) = 0. Therefore µ α = µ α and generalized geodesics are the unique constant speed geodesics in the W 2,ω metric. We may define the subdifferential with respect to W 2,ω in analogy with the Wasserstein subdifferential, Definition 2.9. DEFINITION 2.6 (W 2,ω subdifferential). Given E : P 2,ω (R d ) R {+ } proper and lower semicontinuous in W 2,ω, ξ L 2 (ω) belongs to the W 2,ω subdifferential 2,ω E(µ) in case E(ν) E(µ) ξ, t ν ω t µ ω dω + o(w 2,ω (µ, ν)) as ν µ.

11 REMARK 2.7 (lower semicontinuity in W 2 vs. W 2,ω ). By Remark 2.4, W 2,ω induces a weaker topology than W 2. Consequently, if E is lower semicontinuous in W 2, it is lower semicontinuous in W 2,ω. REMARK 2.8 (additivity of W 2,ω subdifferential). If ξ 2,ω E (µ) and ξ 2 2,ω E 2 (µ), E (ν) + E 2 (ν) E (µ) E 2 (µ) ξ + ξ 2, t ν ω t µ ω dω + o(w 2,ω (µ, ν)), so ξ + ξ 2 2,ω (E + E 2 )(µ). The next proposition provides a characterization of the W 2,ω subdifferential for functionals that are convex in W 2,ω, in analogy with, Equation (0..7)]. PROPOSITION 2.9 (W 2,ω subdifferential for convex function). Given E satisfying the conditions of Definition 2.6, ξ 2,ω E(µ) if and only if E(ν) E(µ) ξ, t ν ω t µ ω dω + λ 2 W 2,ω(µ, 2 ν) ν. (4) Proof. If (4) holds, then ξ 2,ω E(µ) by Definition 2.6. For the converse, assume ξ 2,ω E(µ). Define µ α = (( α)t µ ω + αt ν ω)#ω to be the generalized geodesic from µ to ν with basepoint ω. Since E is λ convex in the W 2,ω metric, E(µ α ) E(µ) α E(ν) E(µ) λ 2 ( α)w 2 ω(µ, ν). (5) By Proposition 2.5, W 2,ω (µ, µ α ) = αw 2,ω (µ, ν), and by Remark 2.4, t µα ω = ( α)t µ ω + αt ν ω. Combining these with the definition of ξ 2,ω E(µ) gives lim inf α 0 E(µ α ) E(µ) α lim inf ξ, t µα ω t µ α 0 α ω dω = lim inf ξ, ( α)t µ ω + αt ν ω t µ α 0 α ω dω = ξ, t ν ω t µ ω dω Sending α 0 in equation (5) shows E(ν) E(µ) ξ, t ν ω t µ ω dω + λ 2 W 2 ω(µ, ν). COROLLARY Given E satisfying the conditions of Definition 2.6 with λ 0, µ is a minimizer for E if and only if 0 2,ω E(µ). PROPOSITION 2.2 (W 2,ω subdifferential of W 2 2 (ω, )). The W 2,ω subdifferential of W 2 2 (ω, ) evaluated at µ contains the element 2(t µ ω id).

12 2 Proof. W2 2 (ω, ν) W2 2 (ω, µ) = t ν ω id 2 dω t µ ω id 2 dω = t ν ω t µ ω t ν ω, t µ ω 2 t ν ω, id + 2 t µ ω, id 2 t µ ω 2 dω = W2,ω(µ, 2 ν) + 2 t ν ω, t µ ω id + 2 t µ ω, id t µ ω dω = W2,ω(µ, 2 ν) + 2 t ν ω t µ ω, t µ ω id dω By Proposition 2.9, this implies that 2(t µ ω id) 2,ω W 2 ω(ω, µ). Finally, if E has a strong subdifferential (Definition 2.), E has a W 2,ω subdifferential. LEMMA 2.22 (strong subdifferential vs. W 2,ω subdifferential). Given E satisfying the conditions of Definition 2., if ξ E(µ) is a strong subdifferential, then ξ t µ ω 2,ω E(µ). Proof. If E has a strong subdifferential ξ at µ, ξ L 2 (µ), hence ξ t µ ω L 2 (ω). Furthermore, E(ν) E(µ) ξ, t ν ω t ω µ id dµ + o( t ν ω t ω µ id L 2 (µ)) R d = ξ t µ ω, t ν ω t µ ω dω + o(w 2,ω (µ, ν)) ν. R d Therefore, ξ t µ ω 2,ω E(µ). 2.6 Discrete Gradient Flow Given a functional E, a time step > 0, and µ, ν P 2,ω0 (R d ) the quadratic perturbation of E is Φ(, µ; ν) := 2 W 2 2 (µ, ν) + E(ν). (6) The proximal set J : P 2,ω0 (R d ) 2 P 2,ω 0 (R d) corresponding to E is { } J (µ) := argmin ν P 2,ω0 (R d ) 2 W 2 2 (µ, ν) + E(ν). (7) Define J 0 (µ) := µ. For the remainder of this section, we consider functionals that satisfy the convexity assumption.2. In order to jointly consider the cases λ 0 and λ < 0, we follow ] and define the negative part of λ to be { λ λ if λ < 0 = 0 if λ 0. In the case λ 0, we interpret = +. λ Suppose µ D(E) and 0 < <. (When λ < 0, the size restriction 0 < < ensures that λ λ 0 < + λ <.) Then there exists a unique element in J (µ) and the proximal map J : D(E) D(E) : µ µ is continuous, Theorem 4..2]. In, Theorem 3..6], Ambrosio, Gigli, and Savaré unite the notions of subdifferential and proximal map through the following chain of inequalities. Recall that E : P 2,ω0 R {+ } is the metric slope see Definition 2.8.

13 3 Theorem AGS. Given E satisfying convexity assumption.2 and µ D( E ) and 0 < < λ, 2 E 2 (µ ) W 2 2 (µ, µ ) 2 + λ (E(µ) E(µ ) 2 W 2 2 (µ, µ )) 2 ( + λ) 2 E 2 (µ). (8) The discrete gradient flow sequence with time step is constructed via repeated applications of the proximal map: µ n = J (µ n ), µ 0 D(E). We write J n to indicate n repeated applications of the proximal map, so that µ n = J n µ Euler-Lagrange Equation THEOREM 2.23 (Euler-Lagrange equation). Assume that E satisfies assumptions. and.2 and ω D(E). Then for 0 < <, ν is the unique minimizer of the quadratic perturbation λ Φ(, ω; ), if and only if (tω ν id) E(ν) is a strong subdifferential. (9) Hence, ω is characterized by the fact that (tω ω id) E(ω ). We assume ω D(E) and E satisfies domain assumption. to ease notation. See section 4.2 for how the assumption on ω can be relaxed to ω D(E) and the domain assumption can be removed. Proof of Theorem The fact that ν minimizes Φ(, ω; ν) = (tω ν id) E(ν) is a strong subdifferential is proved in, Lemma 0..2] using a type of argument introduced by Otto, 2]. To see the other direction, note that if (tω ν id) E(ν) is a strong subdifferential then by Lemma 2.22, (id tν ω) 2,ω E(ν). Combining Remark 2.8 and Proposition 2.2 shows 2 2(tν ω id) + (id tν ω) = 0 2,ω Φ(, ω; ν). Since W2 2(ω, ) = W 2,ω 2 (ω, ) is 2-convex in the W 2,ω metric and E is λ-convex in the W 2,ω metric, Φ(, ω; ) is ( + λ) -convex in the W 2,ω metric, with ( + λ) > 0. Therefore, by Corollary 2.20, when 0 < <, 0 λ 2,ω Φ(, ω; ν), and ν minimizes Φ(, ω; )

14 4 2.8 Discrete Variational Inequality The notion of a discrete variational inequality was introduced in ] to gain quantitative control over the discrete gradient flow for functionals that are convex along generalized geodesics. The starting point for this variational inequality in ] the observation that, if E is convex along generalized geodesics, it satisfies, Assumption 4.0.], which we recall for the reader s convenience: ASSUMPTION 4.0., ]. Given λ R, for every choice of µ, ν 0, and ν D(E) and 0 < <, there exists a curve ν λ α, α 0, ], such that ( ) ν Φ(, µ; ν) is + λ -convex on γ. In other words, the quadratic perturbation satisfies the inequality Φ(, µ; ν α ) ( α)φ(, µ; ν 0 ) + αφ(, µ; ν ) + λ α( α)w2 2 (ν 0, ν ). (20) 2 The convexity of Φ(, ω; ) implies the following discrete variational inequality, Theorem 4..2]. THEOREM AGS2. If, Assumption 4.0.] holds and E is proper, coercive, and lower semicontinuous, then for all 0 < < λ, µ D(E), and ν D(E), 2h W 2 2 (µ h, ν) W2 2 (µ, ν)] + λ 2 W 2 2 (µ h, ν) E(ν) E(µ h ) 2h W 2 2 (µ, µ h ) (2) For our purposes, we require not only control of the Wasserstein metric along the discrete gradient flows, but also control over transport metrics along discrete gradient flows. Luckily, the convexity of E along generalized geodesics implies something slightly stronger than, Assumption 4.0.], and we are able to obtain the following slightly stronger notion of convexity of the quadratic perturbation. In the next theorem two theorems, we assume the base point µ << L d so that the transport metric W 2,µ is well defined by Definition 2.3. As before, this assumption is only for ease of notation, and we describe how to remove it in section 4.2. THEOREM 2.24 (transport metric convexity of quadratic perturbation). Fix µ P 2,ω0 (R d ). If E is λ-convex along generalized geodesics, then for 0 < < ν Φ(, µ; ν) is λ, ( ) + λ -convex on generalized geodesics with base point µ. In other words, for every choice of ν 0, ν D(E), there exists a generalized geodesic ν α from ν 0 to ν with base µ such that Φ(, µ; ν α ) ( α)φ(, µ; ν 0 ) + αφ(, µ; ν ) + λ α( α)w 2 2 2,µ(ν 0, ν ). (22) Proof. By Proposition 2.5, 2 W 2 2(µ, ) = W 2,µ 2 (µ, ) is convex along all generalized geodesics with base µ. Therefore, if E is λ-convex along generalized geodesics, their sum Φ(, µ; ) is ( +λ)-convex along generalized geodesics with base µ.

15 5 THEOREM 2.25 (discrete variational inequality). Suppose E satisfies assumption.2. Then for all µ D(E) and ν D(E), or, equivalently, 2 W 2 2,µ(µ, ν) W 2 2 (µ, ν)] + λ 2 W 2 2,µ(µ, ν) E(ν) E(µ ) 2 W 2 2 (µ, µ ) ( + λ)w 2 2,µ(µ, ν) W 2 2 (µ, ν) 2 E(ν) E(µ ) ] 2 W 2 2 (µ, µ ) Proof. The following proof is nearly identical to, Theorem 4..2 (ii)], except for the stronger convexity assumption on Φ. By Theorem 2.24, there exists a generalized geodesic ν α from µ to ν with base point µ along which Φ(, µ; ) satisfies inequality (22). Combining this with the fact that µ is the minimizer of Φ(, µ; ) gives Φ(, µ; µ ) Φ(, µ; µ α ) ( α)φ(, µ; µ ) + αφ(, µ; ν) + λ α( α)w 2 2 2,µ(µ, ν). Rearranging and dividing by α, 0 Φ(, µ; ν) Φ(, µ; µ ) + λ ( α)w 2 2 2,µ(µ, ν). Sending α 0 and expanding Φ according to its definition gives the result. 3 Exponential Formula for the Wasserstein Metric Given E satisfying convexity assumption.2, we aim to show that, as the time step goes to zero, the discrete gradient flow converges to the continuous gradient flow lim J n n t/n µ = S(t)µ. (23) The key difficulty in showing (23) is proving that the limit exists, which we accomplish by proving the sequence is Cauchy and using the fact that W 2 is complete, Prop 7..5]. First we consider initial data µ D( E ). In section 3.6, we extend our results to µ D(E). 3. Almost Contraction Inequality In this subsection, we use the discrete variational inequality Theorem AGS2 to prove an almost contraction inequality for the discrete gradient flow. (Theorem AGS2 is sufficient for this purpose we use the stronger discrete variational inequality of Theorem 2.25 in a later section.) Our approach is similar to 4], though instead of symmetrizing the contraction inequality, we leave the inequality in an asymmetric form that is more compatible with the asymmetric induction in sections 3.3 and 3.4. The asymmetry useful a second time when we consider gradient flow with initial conditions ν D(E) see section 3.6. For the λ 0 case, we follow the proof of, Lemma 4.2.4]. For the λ > 0 case, we use a new approach. In this case, we rely on the fact that λ > 0 implies E is bounded below, Lemma 2.4.8]. THEOREM 3. (almost contraction inequality). Suppose E satisfies convexity assumption.2, µ D( E ), and ν D(E). If λ > 0, then for all > 0, ( + λ) 2 W 2 2 (µ, ν ) W 2 2 (µ, ν) + 2 E 2 (µ) + 2λ 2 E(ν) inf E] (24)

16 6 If λ 0, then for all 0 < < λ, ( + λ) 2 W 2 2 (µ, ν ) W 2 2 (µ, ν) + 2 E 2 (µ). (25) When λ > 0, (+λ) 2 may be large. Consequently, it is not surprising that we must compensate with extra terms on the right hand side of (24) that are not needed when λ 0. Proof. By Theorem AGS2, recalled for the reader s convenience in section 2.8, ( ( + λ)w2 2 (µ, ν ) W2 2 (µ, ν ) 2 E(ν ) E(µ ) ) 2 W 2 2 (µ, µ ), (26) ( ( + λ)w2 2 (ν, µ) W2 2 (ν, µ) 2 E(µ) E(ν ) ) 2 W 2 2 (ν, ν ). (27) Consider the case λ > 0. Dropping the 2 W 2 2(ν, ν ) term from (27), dividing by ( + λ), and adding to (26) gives ( + λ)w 2 2 (µ, ν ) + λ W 2 2 (µ, ν) 2 ( + λ) 2 W 2 2 (µ, ν ) W 2 2 (µ, ν) 2 ( E(ν ) + λ E(ν ) + + λ E(µ) E(µ ) ( ( + λ)e(ν ) E(ν ) + E(µ) ( + λ) Since λ > 0, E is bounded below, Lemma 2.4.8]. Applying Theorem AGS and the fact that E(µ ) E(µ), we have ( + λ) 2 W2 2 (µ, ν ) W2 2 (µ, ν) 2λ 2 E(ν ) λ E 2 (µ) 2λ 2 inf E 2 E 2 (µ) + 2λ 2 E(ν) inf E], which gives the result. Now consider the case λ 0. Adding (26) and (27) and then applying Theorem AGS gives ( + λ)w2 2 (µ, ν ) W2 2 (ν, µ) + λw2 2 (ν, µ) 2 E(µ) E(µ ) ] 2 W 2 2 (µ, µ ) W2 2 (ν, ν ) Since for a, b > 0 and 0 < ɛ <, the convex function 2 + λ E 2 (µ) W 2 2 (ν, ν ). (28) φ(ɛ) := a2 ɛ + b2 ɛ has the minimum value (a + b) 2, attained at ɛ = a/(a + b), we have Consequently, with ɛ := λ, we obtain (a + b) 2 a2 ɛ + b2 ɛ. W 2 2 (ν, µ) (W 2 (ν, ν) + W 2 (ν, µ)) 2 λ W 2 2 (ν, ν) + + λ W 2 2 (ν, µ). (29) Multiplying by λ, summing with (28), multiplying the total by ( + λ), and using the fact that λ <, we obtain which gives the result. ( + λ) 2 W 2 2 (µ, ν ) W 2 2 (µ, ν) + 2 E 2 (µ), ) 2 W 2 2 (µ, µ ) E(µ ) + 2 W 2 2 (µ, µ ) ])

17 7 3.2 Relation Between Proximal Maps with Different Time Steps We now apply the Euler-Lagrange equation, Theorem 2.23, to prove a theorem relating the proximal map with a large time step to the proximal map with a small time step h. Assumption. is purely for notational convenience. See Theorem 4. for the general case. THEOREM 3.2. Suppose E satisfies assumptions. and.2. Then if µ D(E) and 0 < h <, λ ( h J µ = J h t µ µ + h ) ] id #µ COROLLARY 3.3. Under the assumptions of the previous theorem, if µ D(E), n, ( h J n µ = J (J n µ) = J h t J n µ + h ) ] J n µ id #J n µ. Proof of Theorem 3.2. By Theorem 2.23, is a strong subdifferential. Next, since h/ <, ξ := (tµ µ id) E(µ ) (30) (id + hξ) = (id + h (tµ µ id)) = (( h )id + h tµ µ ). is cyclically monotone. Consequently, if we define ν := (id + hξ)#µ, the optimal transport map is t ν µ = id + hξ. Rearranging shows h (tν µ id) = ξ E(µ ), so by a second application of Theorem 2.23, µ = ν h. We now rewrite ν in terms of its push forward off of µ to obtain the result. By equation (30), t µ µ = (id + ξ), so (id + hξ) = ( h t µ µ + h id) (id + ξ). Therefore, ( ( h h ν = (id + hξ)#µ = t µ µ + h id ) (id + ξ) #µ = t µ µ + h id ) #µ. After proving Theorem 3.2, we discovered another proof of the same result in 8, 9]. It is nonvariational and quite different from the proof given above, and we hope our proof is of independent interest. 3.3 Asymmetric Recursive Inequality The following inequality bounds the Wasserstein distance between discrete gradient flow sequences with different time steps in terms of a convex combination of earlier elements of the sequences, plus a small error term. The recursion of this inequality is asymmetric: the (n, m)th term is controlled in terms of the (n, m )th term and the (n, m )th term. A fundamental difference between Crandall and Liggett s asymmetric recursive inequality and Theorem 3.4 is that the former involves the distance while the latter involves the square distance. (This is a consequence of the fact that our contraction inequality Theorem 3. involves the square distance plus error terms.) Therefore, where Crandall and Liggett are able to use the triangle inequality, we have to use the convexity of the square transport metrics. Passing from the transport metrics back to the Wasserstein metric consumes the bulk of the proof.

18 8 m (n,m) (n-,m-) (n,m-) n THEOREM 3.4 (asymmetric recursive inequality). Suppose E satisfies convexity assumption.2 and µ D( E ). If 0 < h < λ, Monday, August 26, 3 µ, J m h µ) + h + 2h 2 ( λ h) 2m E 2 (µ). ( λ h) 2 W 2 2 (J n µ, J m h µ) h ( λ ) W 2 2 (J n W 2 2 (J n µ, J m h µ) To consider λ 0 and λ < 0 jointly in the following theorem, we replace λ by λ : any function that is λ convex is also λ convex. Proof. To simplify notation, we abbreviate J n µ by J n and J m h µ by J m. First, note that ( λ h) 2 W2 2 (J n, J m ) = ( λ h) 2 W2 2 (J h (µ J n J n h ), J m ) by Theorem 3.2 W2 2 (µ J n J n h, J m ) + h 2 E 2 (J m ) by Theorem 3. W2,J 2 (µ J n J n n h, J m ) + h 2 E 2 (J m ) By Proposition 2.5, the W 2,J n metric is convex along generalized geodesics with base J n. In particular, it is convex along the geodesic µ J n J n h, which gives ( λ h) 2 W2 2 (J n, J m ) h W 2,J 2 (J n, J m ) + h W n 2,J 2 (J n, J m ) + h 2 E 2 (J m ). n (3) The first term on the right hand side coincides with the standard Wasserstein metric. To control the second term, we use the stronger version of the discrete variational inequality Theorem Specifically, replacing (µ, ν) in Theorem 2.25 with (J m, J n ) and (J n, J m ) gives ( λ h)w2,j 2 (J m, J n ) W 2 m 2 (J m, J n ) 2h E(J n ) E(J m ) ] 2h W 2 2 (J m, J m ) ( λ )W2,J 2 (J n, J m ) W 2 n 2 (J n, J m ) 2 E(J m ) E(J n ) ] 2 W 2 2 (J n, J n ) Multiplying the first inequality by, the second inequality by h, adding them together, and then

19 9 applying Theorem AGS gives ( λ h)w2,j 2 (J m, J n ) + h( λ )W 2 m 2,J (J n, J m ) n W2 2 (J m, J n ) + hw2 2 (J n, J m ) + 2h E(J m ) E(J m ) ] 2h W 2 2 (J m, J m ) hw2 2 (J n, J n ) W 2 2 (J m, J n ) + hw 2 2 (J n, J m ) + As in equation (29) we have, λ W 2 2,J n (J m, J n ) W 2 2 (J n, J n ) + Multiplying this by h and adding it to (32) gives ( λ h)w 2 2,J m (J m, J n ) + hw 2 2,J n (J n, J m ) h2 λ h E 2 (J m ) hw 2 2 (J n, J n ). (32) λ λ W 2 2 (J n, J m ). W2 2 (J m, J n h ) + λ W 2 2 (J n, J m ) + h2 λ h E 2 (J m ). Rearranging and dividing by h gives the upper bound W2,J 2 (J m, J n ) ( ) W 2 n 2 (J m, J n ) ( λ h)w2,j 2 (J m, J n ) h m + λ W 2 2 (J n, J m ) + h λ h E 2 (J m ). (33) We now combine this with equation (3) to prove the theorem. Substituting (33) into (3) and using ( λ h) ( λ h) 2 gives ( λ h) 2 W 2 2 (J n, J m ) h W 2 2 (J n, J m ) + h 2 E 2 (J m ) + h ( W 2 h 2 (J m, J n ) ( λ h) 2 W2 2 (J m, J n ) ) + λ W 2 2 (J m, J n ) + h ] λ h E 2 (J m ) Simplifying and rearranging, h ( λ h) 2 W2 2 (J n, J m ) ( h + Therefore, h ( λ ) ) W 2 2 (J n, J m ) + h h W 2 2 (J n, J m ) + h 2 E 2 (J m ) + h λ h E 2 (J m ). ( λ h) 2 W2 2 (J n, J m ) h λ h λ W 2 2 (J n, J m ) + h h W2 2 (J n, J m 3 ) + + h 2 ] λ E 2 (J m ) h h λ W 2 2 (J n, J m ) + h W2 2 (J n, J m ) + 2h2 λ h E 2 (J m ), since 0 < h λ. Finally, applying Theorem AGS and the fact that ( λ h) ( λ h) 2 gives the result: ( λ h) 2 W2 2 (J n, J m ) h λ W 2 2 (J n, J m ) + h W2 2 (J n, J m ) + 2h 2 ( λ h) 2m E 2 (µ)..

20 Inductive Bound The following inductive bound follows the simplification of Crandall and Liggett s method introduced by Rasmussen in 3]. (See also 5].) One key difference is that, in the Banach space case, one works with the distance, rather than the square distance. While this complicated matters in the proof of Theorem 3.4, it simplifies the induction in Theorem 3.6. We begin by proving a bound on the distance between the 0th and nth terms of the discrete gradient flow sequence. LEMMA 3.5. Given E as in Assumption.2 and µ D( E ), for all 0 < < λ W 2 (J n µ, µ) n ( λ ) n E(µ) Proof. This is follows from the triangle inequality, Theorem AGS, and the inequalities +λ and. λ λ W 2 (J n µ, µ) n i= W 2 (J i µ, J i µ) n i= i E(J µ) + λ n i= ( + λ) i E(µ) n ( λ ) n E(µ). THEOREM 3.6 (a Rasmussen type inductive bound). Suppose E satisfies convexity assumption.2. Then if µ D( E ) and 0 < h < λ, W 2 2 (J n µ, J m h µ) (n mh) 2 + hm n ] ( λ ) 2n ( λ h) 2m E 2 (µ). (34) Proof. We proceed by induction. The base case, when either n = 0 or m = 0, follows from the linear growth estimate Lemma 3.5. We assume the inequality holds for (n, m) and (n, m) and show that this implies it holds for (n, m + ). First, we apply the Asymmetric Recursive Inequality, Theorem 3.4, ( λ h) 2 W 2 2 (J n µ, J m+ h µ) h ( λ ) W 2 2 (J n µ, Jh m µ) + h W2 2 (J n µ, Jh m µ) + 2h2 ( λ h) 2(m+) E 2 (µ). Next, we divide by ( λ h) 2 and apply the inductive hypothesis. W2 2 (J n µ, J m+ h µ) h ((n ) mh) 2 + hm (n ) ] ( λ ) 2(n ) ( λ h) 2(m+) E 2 (µ) + h (n mh) 2 + hm n ] ( λ ) 2n ( λ h) 2(m+) E(µ) 2 + 2h 2 ( λ h) 2(m+) 2 E 2 (µ). To control the first term, note that ( λ ) 2(n ) = ( λ ) 2n+ < ( λ ) 2n and ((n ) mh) 2 + hm (n ) ] = (n mh) 2 2(n mh) hm (n ) ]. To control the third term, note that since 0 < h λ, ( λ h) 2 ( λ ) 2 ( λ ) 2n.

21 2 Using these estimates, we may group together the three terms and obtain the following bound. W2 2 (J n µ, J m+ h µ) { h (n mh) 2 2(n mh) hm (n ) ] + h ( λ ) 2n ( λ h) 2(m+) E 2 (µ). (n mh) 2 + hm n ] + 2h 2 } We now consider the convex combination (plus an additional 2h 2 term) within the brackets. h (n mh) 2 2(n mh) hm (n ) ] + h (n mh) 2 + hm n ] + 2h 2 = h (n mh) 2 + hm n ] + h (n mh) 2 + hm n ] + h 2(n mh) 2 ] + 2h 2 = (n mh) 2 + hm n ] 2(n mh)h h + 2h 2 = (n mh) 2 2(n mh)h + hm h n + 2h 2 = (n (m + )h) 2 + h 2 + h(m + ) h n (n (m + )h) 2 + h(m + ) n. Therefore, W 2 2 (J n µ, J m+ h µ) (n (m + )h) 2 + h(m + ) n ] ( λ ) 2n ( λ h) 2(m+) E 2 (µ). 3.5 Exponential Formula for the Wasserstein Metric We now combine our previous results to prove the exponential formula for the Wasserstein metric. THEOREM 3.7 (exponential formula). Suppose E satisfies convexity assumption.2. For µ D( E ), t 0, the discrete gradient flow sequence Jt/n n µ converges as n. Denote the limit by S(t)µ. The convergence is uniform in t on compact subsets of 0, + ), and when n 2λ t, the distance between Jt/n n and S(t)µ is bounded by W 2 (J n t/n µ, S(t)µ) 3 t n e 3λ t E (µ). (35) REMARK 3.8 (range of S(t)). Given µ D( E ), we may use the fact that E is lower semicontinuous, Corollary 2.4.0] and Theorem AGS to conclude E (S(t)µ) lim inf E (J n n t/n µ) lim inf ( n λ t/n) n E (µ) = e λ t E (µ). Therefore, S(t)µ D( E ). We have shown W 2 (Jt/n n µ, S(t)µ) O(n /2 ), which agrees with the rate Crandall and Liggett obtained in a Banach space 6]. By a different method, Ambrosio, Gigli, and Savaré showed W 2 (Jt/n n µ, S(t)µ) O(n ), Theorem 4.0.4], which agrees with the optimal rate in a Hilbert space 4]. Our rate improves upon the rate obtained by Clément and Desch 5], d(jt/n n µ, S(t)µ) O(n /4 ), though they considered the more general case of gradient

22 22 flow on a metric space (X, d) satisfying, Assumption 4.0.]. (See section 2.8 for the role this assumption played in our own proof.) Though we do not obtain the optimal rate of convergence, we demonstrate that Crandall and Liggett s approach extends to the Wasserstein metric, providing a simple and robust route to the exponential formula and properties of continuous gradient flow. This brings together the Banach space theory with the Wasserstein theory, and it is hoped that this method will help extend the abstract theory of Wasserstein gradient flow to a broader class of functionals. REMARK 3.9 (varying time steps). In fact, for any partition of the interval 0, t] into n time steps,..., n, the corresponding discrete gradient flow with varying time steps Π n i= J i µ converges to S(t)µ as the maximum step size goes to zero. See section 4.. Our estimates lead to a simple proof of the fact that S(t)µ is a λ-contracting semigroup, as originally shown in, Proposition 4.3.]. THEOREM 3.0 (S(t) is a λ-contracting semigroup). Given E satisfying convexity assumption.2, the function S(t) on 0, + ), is a λ-contracting semigroup, i.e. (i) lim t 0 S(t)µ = S(0)µ = µ (ii) S(t + s) = S(t)S(s)µ for t, s 0 (iii) W 2 (S(t)µ, S(t)ν) e λt W 2 (µ, ν) S(t) : D( E ) D( E ) : µ S(t)µ Next, we apply the semigroup property (ii) to conclude that E(S(t)µ) is nonincreasing. COROLLARY 3.. For all µ D( E ), E(S(t)µ) is non-increasing for t 0, + ). Combining the previous results, we prove that S(t) is the continuous gradient flow, in the sense of Definition.3. THEOREM 3.2 (S(t)µ is the continuous gradient flow). Given E satisfying convexity assumption.2 and µ D( E ), S(t)µ is the continuous gradient flow for E with initial conditions µ. Furthermore, so S(t)µ is locally Lipschitz on 0, + ). W 2 (S(t)µ, S(s)µ) t s e λ t e λ s E (µ), (36) Finally, we use our method to give a simple proof of the energy dissipation inequality, which shows the regularizing effect of the gradient flow. COROLLARY 3.3 (Energy Dissipation Inequality). Given E satisfying convexity assumption.2 and µ D( E ), for all t 0, t 0, t We now turn to the proofs of these results. t 0 E 2 (S(s)µ)ds E(S(t 0 )µ) E(S(t )µ).

23 23 Proof of Theorem 3.7. By Theorem 3.6, for fixed t 0, if we define := t n, h := t m, with m n > 2tλ, so 0 h < 2λ, W 2 2 (J n t/n µ, J m t/m µ) 3t2 n ( λ t/n) 2n ( λ t/m) 2m E 2 (µ) 3 t2 n e8λ t E 2 (µ). (37) In the second inequality, we use that ( α) e 2α for α 0, /2]. Thus, the sequence Jt/n n µ is Cauchy, and lim n Jt/n n µ exists. The estimate (37) shows that the convergence is uniform in t on compact subsets of 0, + ). If S(t)µ denotes the limit, then sending m in the first inequality of (37) gives the error estimate W 2 2 (J n t/n µ, S(t)µ) 3t2 n e6λ t E 2 (µ). (38) Proof of Theorem 3.0. (i) follows from Lemma 3.5, since W 2 (S(t)µ, µ) = lim W 2(J n t n t/n µ, µ) lim n ( λ t/n) n E(µ) = teλ t E(µ) t 0 0. We now turn to the contraction property (iii). Our proof of the λ > 0 case is new, using the almost contraction inequality, Theorem 3.. For completeness, we recall the proof of, Proposition 4.3.], which shows the λ 0 case. Iterating the contraction inequality from Theorem 3. for λ > 0 and applying Theorem AGS, W 2 2 (J n t/n µ, J n t/n ν) ( + λ(t/n)) 2n W 2 2 (µ, ν) + n i= (t/n) 2 ( ]) ( + λ(t/n)) 2i E 2 (J n i t/n µ) + 2λ E(J n i t/n ν) inf E (39) ( + λ(t/n)) 2n W 2 2 (µ, ν) + n(t/n) 2 ( E 2 (µ) + 2λ E(ν) inf E] ). (40) Likewise, for λ 0, n > tλ, we have W 2 2 (J n t/n µ, J n t/n ν) ( + λ(t/n)) 2n W 2 2 (µ, ν) + Sending n in both cases shows ( + λ(t/n)) 2n W 2 2 (µ, ν) + n i= W 2 2 (S(t)µ, S(t)ν) e 2λt W 2 2 (µ, ν). (t/n) 2 ( + λ(t/n)) 2i E 2 (J n i t/n µ) n(t/n) 2 ( + λ(t/n)) 2n E 2 (µ). (4) We now prove the semigroup property (ii). First, we show that S(t) m µ = S(mt)µ for fixed m N. To consider λ 0 and λ < 0 jointly, we replace λ by λ 0, since any function that is λ convex is also λ convex. First, note that W 2 (S(t) m µ, (J n t/n )m µ) = W 2 (S(t) m µ, J n t/n (J n t/n )m µ) W 2 (S(t) m µ, J n t/n S(t)m µ) + W 2 (J n t/n S(t)m µ, J n t/n (J n t/n )m µ) (42)

24 24 Remark 3.8 ensures S(t) m µ D( E ), so by Theorem 3.7, Jt/n n S(t)m µ n S(t) m µ. Consequently, we may choose n large enough so that the first term is arbitrarily small for fixed m N. We bound the second term in (42) using (4). By Remark 3.8, E 2 (S(t) m µ) e 2(m )λ t E 2 (µ). Therefore, W 2 2 (J n t/n S(t)m µ, J n t/n (J n t/n )m µ) ( λ (t/n)) 2n W 2 2 (S(t) m µ, (J n t/n )m µ) + n(t/n)2 e 2(m )λ t ( λ (t/n)) 2n E 2 (µ). Thus, taking square roots of both sides and combining with (42) shows that for all ɛ > 0, these exists n large enough so that W 2 (S(t) m µ, (J n t/n )m µ) ɛ + e 4λ t W 2 (S(t) m µ, (J n t/n )m µ). Iterating this shows that for n large enough m ( ) W 2 (S(t) m µ, (Jt/n n )m µ) ɛ e 4iλ t + e 4mλ t W 2 (µ, µ) ɛ me 2(m )λ t. (43) i=0 We now apply this to show S(t) m µ = S(mt)µ. By the triangle inequality, W 2 (S(t) m µ, S(mt)µ) W 2 (S(t) m µ, (J n t/n )m µ) + W 2 ((J n t/n )m µ, S(mt)µ). The first term can be made arbitrarily small by (43). Since W 2 ((Jt/n n )m µ, S(mt)µ) = W 2 ((Jtm/nm nm )µ, S(mt)µ), by Theorem 3.7 we may choose n large so the second term is arbitrarily small. Therefore, S(t) m µ = S(mt)µ. This shows shows that for any l, k, r, s N, ( l S k + r ) µ = S s ( ls + rk ks ) ( )] ls+rk ( )] ls ( )] rk ( ) l ( r µ = S µ = S S µ = S S µ. ks ks ks k s) Since S(t)µ is continuous in t 0, + ), S(t + s)µ = S(t)S(s)µ for all t, s 0. Proof of Corollary 3.. For t 0, the lower semicontinuity of E and definition of the proximal map (7) imply E(S(t)µ) lim inf E(J n n t/n µ) lim inf E(µ) = E(µ). n The result then follows from the semigroup property, Theorem 3.0 (ii). Proof of Theorem 3.2. First, we show that S(t)µ is locally Lipschitz continuous in t. Given t, s 0, define := t n, h := s m for m and n large enough so that 0 h <. By Theorem 3.6, λ W2 2 (Jt/n n µ, J s/m m µ) (t s) 2 + ts ] n + 2t2 ( λ t/n) 2n ( λ s/m) 2m E 2 (µ). (44) n Sending n, m and taking the square root of both sides gives W 2 (S(t)µ, S(s)µ) t s e λ t e λ s E (µ). (45) We now turn to the proof that S(t)µ is the continuous gradient flow for E with initial conditions µ in the sense of Definition.3. We already showed S(t)µ t 0 µ in part (i) of Theorem 3.0, so it remains to show that S(t)µ satisfies (6).

The Exponential Formula for the Wasserstein Metric

The Exponential Formula for the Wasserstein Metric A dissertation submitted to Rutgers, The State University of New Jersey, in partial fulfillment of the requirements for the degree of Doctor of Philosophy