Optimal Transport: A Crash Course Soheil Kolouri and Gustavo K. Rohde HRL Laboratories, University of Virginia
Introduction
What is Optimal Transport? The optimal transport problem seeks the most efficient way of transforming one distribution of mass to another.
What is Optimal Transport? The problem was originally studied by Gaspard Monge in the 18 th century.
What is Optimal Transport? I Minimizing the transportation cost: Heavy mass shouldn t be transported far! S. Kolouri and G. K. Rohde OT Crash Course
Monge Formulation
Monge formulation and Transport Maps: A map, f : X Y, for measures µ and ν defined on spaces X and Y is called a transport map iff, f 1 (A) I 0 (x)dx = I 1 (y)dy (1) A
Monge formulation and Transport Maps: A map, f : X Y, for measures µ and ν defined on spaces X and Y is called a transport map iff, f 1 (A) I 0 (x)dx = I 1 (y)dy (1) A Find the optimal transport map f : X Y that minimizes the expected cost of transportation, M(µ, ν) = inf f MP c(x, f(x))i 0 (x)dx (2) X
Monge formulation and Transport Maps: In the majority of engineering applications the cost is the Euclidean distance, M(µ, ν) = inf f MP x f(x) 2 I 0 (x)dx X MP = {f : X Y I 0 (x)dx = I 1 (y)dy} (3) f 1 (A) A Note that the objective function and the constraint in Eq. (3) are both nonlinear with respect to f.
Monge formulation and Transport Maps: In the majority of engineering applications the cost is the Euclidean distance, M(µ, ν) = inf f MP x f(x) 2 I 0 (x)dx X MP = {f : X Y I 0 (x)dx = I 1 (y)dy} (3) f 1 (A) A Note that the objective function and the constraint in Eq. (3) are both nonlinear with respect to f. When f exists and it is differentiable, above constraint can be written in differential form as, MP = {f : X Y det(df(x))i 1 (f(x)) = I 0 (x), x X} (4)
A Transport Map May Not Exist: A transport map, f, exists only if µ is an absolutely continuous measure with compact support and c(x, f(x)) is convex. Here an example where the transport map does not exists:
A Transport Map May Not Exist: A transport map, f, exists only if µ is an absolutely continuous measure with compact support and c(x, f(x)) is convex. Here an example where the transport map does not exists: Monge formulation is not suitable for analyzing point cloud distributions or any particle like distributions. This is when Kantorovich s formulation comes to rescue!
Kantorovich Formulation
Kantorovich formulation and Transport Plans: Working on optimal allocation of scarce resources, Kantorovich revisited the optimal transport problem in 1942.
Kantorovich formulation and Transport Plans: Working on optimal allocation of scarce resources, Kantorovich revisited the optimal transport problem in 1942.
Kantorovich formulation and Transport Plans: A transport plan between measures µ and ν defined on X and Y is a probability measure γ X Y with marginals, γ(x, A) = ν(a), γ(b, Y ) = µ(b) (5)
Kantorovich formulation and Transport Plans: A transport plan between measures µ and ν defined on X and Y is a probability measure γ X Y with marginals, γ(x, A) = ν(a), γ(b, Y ) = µ(b) (5) Kantorovich then formulated the transport problem as finding the transport plan that minimizes the expected cost with respect to γ, K(µ, ν) = min γ Γ(µ,ν) c(x, y)dγ(x, y) X Y Γ(µ, ν) = {γ γ(a, Y ) = µ(a), γ(x, B) = ν(b)} (6)
Kantorovich formulation and Transport Plans: A transport plan between measures µ and ν defined on X and Y is a probability measure γ X Y with marginals, γ(x, A) = ν(a), γ(b, Y ) = µ(b) (5) Kantorovich then formulated the transport problem as finding the transport plan that minimizes the expected cost with respect to γ, K(µ, ν) = min γ Γ(µ,ν) c(x, y)dγ(x, y) X Y Γ(µ, ν) = {γ γ(a, Y ) = µ(a), γ(x, B) = ν(b)} (6) Note that unlike Monge s formulation, the objective function and the constraints in Kantorvich s formulation are linear with respect to γ.
Discrete Kantorovich formulation (Earth Mover s Distance) Let µ = N i=1 p iδ xi and µ = M j=1 q jδ yj, where δ xi is a Dirac measure, K(µ, ν) = min γ c(x i, y j )γ ij i j s.t. γ ij = p i, γ ij = q j, γ ij 0 (7) j i
Kantorovich vs. Monge The following relationship holds between Monge s and Kantorovich s formulation, K(µ, ν) M(µ, ν) (8) When an optimal transport map exists, f : X Y, the optimal transport plan and the optimal transport map are related through, X Y c(x, y)dγ(x, y) = c(x, f(x))dµ(x) (9) X
Existence and uniqueness Brenier s theorem Let c(x, y) = x y 2 and let µ be absolutely continuous with respect to the Lebesgue measure. Then, there exists a unique optimal transport map f : X Y such that, f 1 (A) which is characterized as, dµ(x) = dν(y) A f(x) = x ψ(x) = ( 1 2 x 2 ψ(x)) }{{} φ(x) (10) for some concave scalar function ψ. In other words, f is the gradient of a convex scalar function φ, and therefore it is curl free.
Transport-Based Metrics
p-wasserstein distance Let P p(ω) be the set of Borel probability measures with finite p th moment defined on a given metric space (Ω, d). The p-wasserstein metric, W p, for p 1 on P p(ω) is then defined as the optimal transport problem with the cost function c(x, y) = d p (x, y). Let µ and ν be in P p(ω), then, ( 1 W p(µ, ν) = min d p p (x, y)dγ(x, y)) (11) γ Γ(µ,ν) Ω Ω
p-wasserstein distance Let P p(ω) be the set of Borel probability measures with finite p th moment defined on a given metric space (Ω, d). The p-wasserstein metric, W p, for p 1 on P p(ω) is then defined as the optimal transport problem with the cost function c(x, y) = d p (x, y). Let µ and ν be in P p(ω), then, ( 1 W p(µ, ν) = min d p p (x, y)dγ(x, y)) (11) γ Γ(µ,ν) Ω Ω or equivalently when the optimal transport map, f, exists, ( 1 W p(µ, ν) = min d p p (x, f(x))dµ(x)). (12) f MP Ω
p-wasserstein distance Let P p(ω) be the set of Borel probability measures with finite p th moment defined on a given metric space (Ω, d). The p-wasserstein metric, W p, for p 1 on P p(ω) is then defined as the optimal transport problem with the cost function c(x, y) = d p (x, y). Let µ and ν be in P p(ω), then, ( 1 W p(µ, ν) = min d p p (x, y)dγ(x, y)) (11) γ Γ(µ,ν) Ω Ω or equivalently when the optimal transport map, f, exists, ( 1 W p(µ, ν) = min d p p (x, f(x))dµ(x)). (12) f MP Ω In most engineering applications Ω R d and d(x, y) = x y.
p-wasserstein for 1D probability measures For absolutely continuous one-dimensional probability measures µ and ν on R with positive probability density functions I 0 and I 1, the optimal transport map has a closed form solution.
p-wasserstein for 1D probability measures For absolutely continuous one-dimensional probability measures µ and ν on R with positive probability density functions I 0 and I 1, the optimal transport map has a closed form solution. Let F µ and F ν be the cumulative distribution functions, { Fµ(x) = µ((, x)) F ν(y) = ν((, y)) (13)
p-wasserstein for 1D probability measures For absolutely continuous one-dimensional probability measures µ and ν on R with positive probability density functions I 0 and I 1, the optimal transport map has a closed form solution. Let F µ and F ν be the cumulative distribution functions, { Fµ(x) = µ((, x)) F ν(y) = ν((, y)) (13) In one dimension, there only exists one monotonically increasing transport map f : R R such that f MP, and it is defined as, f(x) := min{t R : F ν(t) F µ(x)}. (14) or equivalently f(x) = Fν 1 F µ(x).
p-wasserstein distance for 1D probability measures 1 W p(µ, ν) = ( F 1 1 µ (t) Fν (t) p dt) p 1 (15) 0 Figure: Note that, the Euclidean distance does not provide a sensible distance between I 0, I 1 and I 2 while the p-wasserstein distance does.
p-wasserstein distance for 1D probability measures 1 W p(µ, ν) = ( F 1 1 µ (t) Fν (t) p dt) p 1 (16) 0 Figure: Note that, the Euclidean distance does not provide a sensible distance between I 0, I 1 and I 2 while the p-wasserstein distance does.
Sliced p-wasserstein distance Slice an n-dimensional probability distribution (n > 1) into one-dimensional representations through projections and measure p-wasserstein distance between these representations.
Sliced p-wasserstein distance Slice an n-dimensional probability distribution (n > 1) into one-dimensional representations through projections and measure p-wasserstein distance between these representations. Where R denotes Radon transform and is defined as, RI(t, θ) := S d 1 I(x)δ(t θ x)dx t R, θ S d 1 (Unit sphere in R d ) (17)
Sliced p-wasserstein distance Slice an n-dimensional probability distribution (n > 1) into one-dimensional representations through projections and measure p-wasserstein distance between these representations. Where R denotes Radon transform and is defined as, RI(t, θ) := S d 1 I(x)δ(t θ x)dx t R, θ S d 1 (Unit sphere in R d ) (17) and the p-sliced-wasserstein (p-sw) distance is defined as: ( ) 1 SW p(i 0, I 1 ) = W p S d 1 p (RI p 0(., θ), RI 1 (., θ))dθ (18)
Geometric Properties
2-Wasserstein geodesics The set of continuous measures together with the 2-Wasserstein metric forms a Riemmanian manifold. Given the 2-Wasserstein space, (P 2 (Ω), W 2 ), the geodesic between µ, ν P 2 (Ω) is the shortest curve on P 2 (Ω) that connects these measures.
2-Wasserstein geodesics The set of continuous measures together with the 2-Wasserstein metric forms a Riemmanian manifold. Given the 2-Wasserstein space, (P 2 (Ω), W 2 ), the geodesic between µ, ν P 2 (Ω) is the shortest curve on P 2 (Ω) that connects these measures. Let ρ t for t [0, 1] parametrizes a curve on P 2 (Ω) with ρ 0 = µ and ρ 1 = ν, and let I t denote the density of ρ t, I t(x)dx = dρ t(x).
2-Wasserstein geodesics The set of continuous measures together with the 2-Wasserstein metric forms a Riemmanian manifold. Given the 2-Wasserstein space, (P 2 (Ω), W 2 ), the geodesic between µ, ν P 2 (Ω) is the shortest curve on P 2 (Ω) that connects these measures. Let ρ t for t [0, 1] parametrizes a curve on P 2 (Ω) with ρ 0 = µ and ρ 1 = ν, and let I t denote the density of ρ t, I t(x)dx = dρ t(x). For the optimal transport map, f(x), between µ and ν the geodesic is parametrized as, I t(x) = det(df t(x))i 1 (f t(x)), f t(x) = (1 t)x + tf(x) (19)
2-Wasserstein geodesics The set of continuous measures together with the 2-Wasserstein metric forms a Riemmanian manifold. Given the 2-Wasserstein space, (P 2 (Ω), W 2 ), the geodesic between µ, ν P 2 (Ω) is the shortest curve on P 2 (Ω) that connects these measures. Let ρ t for t [0, 1] parametrizes a curve on P 2 (Ω) with ρ 0 = µ and ρ 1 = ν, and let I t denote the density of ρ t, I t(x)dx = dρ t(x). For the optimal transport map, f(x), between µ and ν the geodesic is parametrized as, I t(x) = det(df t(x))i 1 (f t(x)), f t(x) = (1 t)x + tf(x) (19) It is straightforward to show that, W 2 (µ, ρ t) = tw 2 (µ, ν) (20)
2-Wasserstein geodesics
2-Wasserstein geodesics
2-Wasserstein geodesics S. Kolouri and G. K. Rohde OT Crash Course
2-Wasserstein geodesics S. Kolouri and G. K. Rohde OT Crash Course
Numerical Solvers
Flow Minimization (Angenent, Haker, and Tannenbaum) The flow minimization method finds the optimal transport map following below steps: 1. Obtain an initial mass preserving transport map using the Knothe-Rosenblatt coupling 2. Update the initial map to obtain a curl free mass preserving transport map that minimizes the transport cost f k+1 = f k + ɛ 1 I 0 Df k (f k ( 1 div(f k ))), 1 : Poisson solver (21)
Flow Minimization (Angenent, Haker, and Tannenbaum) The flow minimization method finds the optimal transport map following below steps: 1. Obtain an initial mass preserving transport map using the Knothe-Rosenblatt coupling 2. Update the initial map to obtain a curl free mass preserving transport map that minimizes the transport cost f k+1 = f k + ɛ 1 I 0 Df k (f k ( 1 div(f k ))), 1 : Poisson solver (21) Angenent, S., et al. Minimizing flows for the Monge Kantorovich problem. SIAM 2003
Gradient descent on the dual problem (Chartrand et al.) I For the strictly convex cost function, c(x, y) = 1 x y 2, the dual of 2 Kantorovich problem can be formalized as minimizing, Z Z M (η) = η(x)dµ(x) + X η c (y)dν(y) Y η c (y) := maxx X (x y η(x)) is the Legendre-Fenchel transform of η(x). S. Kolouri and G. K. Rohde OT Crash Course (22)
Gradient descent on the dual problem (Chartrand et al.) I For the strictly convex cost function, c(x, y) = 1 x y 2, the dual of 2 Kantorovich problem can be formalized as minimizing, Z Z M (η) = η(x)dµ(x) + X η c (y)dν(y) (22) Y η c (y) := maxx X (x y η(x)) is the Legendre-Fenchel transform of η(x). I Then the potential transport field, η, is updated to minimize M (η) through, ηk+1 = ηk (I0 det(i Hη cc )I1 (id η cc )), S. Kolouri and G. K. Rohde OT Crash Course H: Hessian matrix (23)
Gradient descent on the dual problem (Chartrand et al.) I For the strictly convex cost function, c(x, y) = 1 x y 2, the dual of 2 Kantorovich problem can be formalized as minimizing, Z Z M (η) = η(x)dµ(x) + X η c (y)dν(y) (22) Y η c (y) := maxx X (x y η(x)) is the Legendre-Fenchel transform of η(x). I Then the potential transport field, η, is updated to minimize M (η) through, ηk+1 = ηk (I0 det(i Hη cc )I1 (id η cc )), H: Hessian matrix (23) Chartrand, R., et al. A gradient descent solution to the Monge-Kantorovich problem. AMS 2009 S. Kolouri and G. K. Rohde OT Crash Course
Linear programming Let µ = N i=1 p iδ xi and µ = M j=1 q jδ yj, where δ xi is a Dirac measure, K(µ, ν) = min γ c(x i, y j )γ ij i j s.t. γ ij = p i, γ ij = q j, γ ij 0 j i Can be solved through the Simplex algorithm or interior point techniques. Computational complexity of solvers: O(N 3 logn)
Linear programming Let µ = N i=1 p iδ xi and µ = M j=1 q jδ yj, where δ xi is a Dirac measure, K(µ, ν) = min γ c(x i, y j )γ ij i j s.t. γ ij = p i, γ ij = q j, γ ij 0 j i Can be solved through the Simplex algorithm or interior point techniques. Computational complexity of solvers: O(N 3 logn) Multi-Scale Approaches To improve computational complexity of several multi-scale approaches have been proposed The idea behind all these multi-scale techniques is to obtain a coarse transport plan and refine the transport plan iteratively.
Entropy Regularization I Cuturi proposed a regularized version of the Kantorovich problem which can be solved in O(N logn ), p (µ, ν) = Wp,λ Z dp (x, y)γ(x, y) + λγ(x, y) ln(γ(x, y))dxdy. inf γ Γ(µ,ν) Ω Ω S. Kolouri and G. K. Rohde OT Crash Course (24)
Entropy Regularization I Cuturi proposed a regularized version of the Kantorovich problem which can be solved in O(N logn ), p (µ, ν) = Wp,λ Z dp (x, y)γ(x, y) + λγ(x, y) ln(γ(x, y))dxdy. inf γ Γ(µ,ν) (24) Ω Ω I It is straightforward to show that the entropy regularized p-wasserstein distance in Equation (24) can be reformulated as, p Wp,λ (µ, ν) = λ inf γ Γ(µ,ν) KL(γ Kλ ), S. Kolouri and G. K. Rohde Kλ (x, y) = exp( OT Crash Course dp (x, y) ) λ (25)
Entropy Regularization I Cuturi proposed a regularized version of the Kantorovich problem which can be solved in O(N logn ), p (µ, ν) = Wp,λ Z dp (x, y)γ(x, y) + λγ(x, y) ln(γ(x, y))dxdy. inf γ Γ(µ,ν) (24) Ω Ω I It is straightforward to show that the entropy regularized p-wasserstein distance in Equation (24) can be reformulated as, p Wp,λ (µ, ν) = λ inf γ Γ(µ,ν) KL(γ Kλ ), Kλ (x, y) = exp( dp (x, y) ) λ (25) Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. NIPS 2013. S. Kolouri and G. K. Rohde OT Crash Course
Summary Introduction: 1. Monge formulation transport maps 2. Kantorovich formulation transport plans
Summary Introduction: 1. Monge formulation transport maps 2. Kantorovich formulation transport plans Transport-based metrics and geometric properties: 1. p-wasserstein metric 2. p-sliced Wasserstein metric 3. 2-Wasserstein geodesics
Summary Introduction: 1. Monge formulation transport maps 2. Kantorovich formulation transport plans Transport-based metrics and geometric properties: 1. p-wasserstein metric 2. p-sliced Wasserstein metric 3. 2-Wasserstein geodesics Numerical solvers: 1. Flow minimization (Monge) 2. Gradient descent on the dual problem (Monge) 3. Linear programming and multi-scale methods (Kantorovich) 4. Entropy regularized solver (Kantorovich)
Summary Introduction: 1. Monge formulation transport maps 2. Kantorovich formulation transport plans Transport-based metrics and geometric properties: 1. p-wasserstein metric 2. p-sliced Wasserstein metric 3. 2-Wasserstein geodesics Numerical solvers: 1. Flow minimization (Monge) 2. Gradient descent on the dual problem (Monge) 3. Linear programming and multi-scale methods (Kantorovich) 4. Entropy regularized solver (Kantorovich) Coming Up Next Transport-based transformations
Thank you!