Optimal Transport in ML

Size: px

Start display at page:

Download "Optimal Transport in ML"

Blaise Wheeler
5 years ago
Views:

1 Optimal Transport in ML Rémi Gilleron Inria Lille & CRIStAL & Univ. Lille Feb Main Source also some figures Computational Optimal transport, G. Peyré and M. Cuturi Rémi Gilleron (Magnet) OT Feb / 52

2 Compare documents using word embeddings Given: word embeddings in R d, a similarity on R d Problem: compare documents (phrases, sentences) Idea: a document as an histogram of frequencies of words in V Figure: Three documents in R 2. Each circle corresponds to a word vector. The size of the circle is proportional to the frequency. Measure the transportation plan between histograms over the vocabulary using the similarity between word vectors Rémi Gilleron (Magnet) OT Feb / 52

3 Unsupervised domain adaptation Unsupervised domain adaptation Source: labeled data; target: unlabeled data; problem: classify data in the target domain. Base idea: transport data from the source domain into the target domain, then learn a classifier. The transportation plan between the two clouds use a ground distance and empirical distributions of the clouds Figure: Source: blue and red; target: green Rémi Gilleron (Magnet) OT Feb / 52

4 Histogram propagation over graphs Example of traffic estimation Given: a graph representing roads Given: at some nodes, captors allow to compute traffic histograms over a 24h period Problem: compute traffic histograms for every node Base idea: use a propagation algorithm over the graph The similarity should be based both on a similarity between histograms and a similarity on the graph which includes spatial information. Rémi Gilleron (Magnet) OT Feb / 52

5 Optimal Transport (OT) What is it? A method for comparing probability distributions with the ability to incorporate spatial information Pros Distance between distributions based on a ground distance Defined for all distributions: discrete, with density, arbitrary Solid mathematical foundations and works well in applications Cons Computing OT is solving an optimization problem Mathematics are not easy. Methods and algorithms depend on the measures (often discrete) and dimensions the ground cost: arbitrary, distance, squared distance, geodesic distance the ground space: R, R d, geodesic Rémi Gilleron (Magnet) OT Feb / 52

6 Research on computational OT Close to Magnet s research problems Word mover distance Optimal transport for domain adaptation Histogram propagation over graphs But also Signal and image processing, Wasserstein distances and divergences, Efficient computation of (regularized) OT, Generative models, Wasserstein GANs, among others. Rémi Gilleron (Magnet) OT Feb / 52

7 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52

8 Intuitions The goal of OT is to define geometric tools useful to compare probability distributions Earth Mover Distance probability distribution = pile of sand move one pile of sand into another one local cost: move one grain of sand from one place to another OT = minimal global cost Mines and Factories Problem Mines produce ressources across a country Factories consume ressources across a country Local cost for distributing one ressource from a mine to a factory OT = least costly transportation plan from mines into factories Rémi Gilleron (Magnet) OT Feb / 52

9 Main Scenarii Distributions Discrete measure α = n i=1 a iδ xi where δ x is the Dirac at x and a R n +. Then, for f C(X ), X f (x)dα(x) = n i=1 a if (x i ). n Probability measure: i=1 a i = 1, a Σ n is an histogram Lagrangian (point clouds): (xi ) i=n i=1, a i = 1 n Eulerian (histograms): grid, ai probability mass at cell i Rémi Gilleron (Magnet) OT Feb / 52

10 Main Scenarii Distributions Discrete measure α = n i=1 a iδ xi where δ x is the Dirac at x and a R n +. Then, for f C(X ), X f (x)dα(x) = n i=1 a if (x i ). n Probability measure: i=1 a i = 1, a Σ n is an histogram Lagrangian (point clouds): (xi ) i=n i=1, a i = 1 n Eulerian (histograms): grid, a i probability mass at cell i Measure with density α dα(x) = ρ α (x)dx. Then, for h C(X ), X h(x)dα(x) = X h(x)ρ α(x)dx. Arbitrary measure Rémi Gilleron (Magnet) OT Feb / 52

11 Main Scenarii Distributions Discrete measure α = n i=1 a iδ xi where δ x is the Dirac at x and a R n +. Then, for f C(X ), X f (x)dα(x) = n i=1 a if (x i ). n Probability measure: i=1 a i = 1, a Σ n is an histogram Lagrangian (point clouds): (xi ) i=n i=1, a i = 1 n Eulerian (histograms): grid, a i probability mass at cell i Measure with density α dα(x) = ρ α (x)dx. Then, for h C(X ), X h(x)dα(x) = X h(x)ρ α(x)dx. Arbitrary measure Ground space and ground cost X with a distance and, in general, X = R d with d = 1 or d > 1 c is a cost function from X Y in R +. When X = Y = R d, in general, c is a distance d or a squared distance d 2. It is a matrix C in the case of discrete measures. Rémi Gilleron (Magnet) OT Feb / 52

12 Monge Problem Transport Map Monge Problem for Discrete Measures Let α = n i=1 a iδ xi and β = m j=1 b jδ yj be two discrete measures. Find a map T from {x 1,..., x n } into {y 1,..., y m } such that, j, b j = {i T (x i )=y j } a i. T defines a pushforward operator T between measures s.t. T α = β. The transport map should minimize i c(x i, T (x i )). Exemples (x 1, 1), (x 2, 2), (x 3, 3), (x 4, 1), (x 5, 1); (y 1, 4), (y 2, 2), (y 3, 2), c = 1 Id with c = d, the Euclidean distance (x 1, 2), (x 2, 2), (x 3, 2); (y 1, 1), (y 2, 2), (y 3, 2), (y 4, 1), c = 1 (x 1, 3), (x 2, 2); (y 1, 4), (y 2, 1), c = 1 Rémi Gilleron (Magnet) OT Feb / 52

13 Monge problem for histograms Optimal assignment problem Let n = m, let a = b = 1 n /n, and let C be a cost matrix in R n + R n + giving the cost of moving x i into y j Find σ in the set Perm(n) of permutations of n elements solving min σ Perm(n) 1 i=n n i=1 C i,σ(i) Remarks Naive algorithm untractable because Perm(n) has n! elements ((2, 1), 1 2 ), ((2, 3), 1 2 ); ((1, 2), 1 2 ), ((3, 2), 1 2 ), Euclidean distance (x 1, 1 2 ), (x 2, 1 2 ), (y 1, 1 4 ), (y 2, 1 4 ); (y 3, 1 4 ), (y 4, 1 4 ), c = 1 Rémi Gilleron (Magnet) OT Feb / 52

14 Relaxation of the Monge problem Limitations of the Monge problem Feasible solutions may not exist Multiple solutions may exist Assignment problem is combinatorial Monge problem for arbitrary measures is not convex Existence and unicity of the Monge map for c squared Euclidean distance and measures with density (Brenier 91) Relax the deterministic nature of transportation (x 1, 3), (x 2, 2); (y 1, 4), (y 2, 1), c = 1. Kantorovitch s relaxation: mass splitting from a source towards several targets a coupling T maps x 1 into y 1, 1 2 of the mass in x 2 into y 1 and into y 2 Where are transported x 1 and x 2? Rémi Gilleron (Magnet) OT Feb / 52

15 Kantorovitch s OT problem for discrete measures Formulation Let a R n + and b R m + be two mass vectors for x 1,..., x n and y 1,..., y m and let C be the cost matrix A coupling is defined by a matrix P R+ n m where P i,j is the amount of mass flowing from x i to y j Set of admissible couplings is U(a, b) = {P P1 m = a, P t 1 n = b}. It is bounded and it is a convex polytope. Kantorovitch s OT problem is L C (a, b) = min < C, P >= min P U(a,b) P U(a,b) C i,j P i,j Mines and Factories Find an optimal transportation plan between mines and factories i,j Rémi Gilleron (Magnet) OT Feb / 52

16 Kantorovitch s OT problem Formulation for arbitrary measures Let α and β be two measures, a coupling π is a joint distribution over X Y Set of admissible couplings U(α, β) = {π P X π = α and P Y π = β} where P X and P Y are the push-forward projections. Kantorovitch s OT problem for a cost function c is L c (α, β) = c(x, y)dπ(x, y) min π U(α,β) X Y For discrete measures Let α = n i=1 a iδ xi and β = m j=1 b jδ yj be two discrete measures Then L c (α, β) = L C (a, b) where C is the cost matrix defined from c on the support of α and β. Rémi Gilleron (Magnet) OT Feb / 52

17 Examples of Kantorovitch s OT problem X = R and Euclidean distance (x 1, 3), (x 2, 2); (y 1, 4), (y 2, 1) with x 1 < x 2, y 1 < y 2 (x 1, 3), (x 2, 2); (y 1, 4), (y 2, 1) with x 1 < x 2, y 1 > y 2 Binary cost matrix /3 2/3??? C = a = 1/3 b = 1/6 P =??? /3 1/6??? Assignment problem on X = R 2 with the Euclidean distance x 1 = (0, 1), x 2 = (0, 2), x 3 = (0, 3), y 1 = (1, 5 2 ), y 2 = (1, 3 2 ), y 3 = (2, 2). 5/2 5/4 5 1/3 1/3??? C = 5/4 5/4 2 a = 1/3 b = 1/3 P =??? 5/4 5/2 5 1/3 1/3??? Rémi Gilleron (Magnet) OT Feb / 52

18 Examples of Kantorovitch s OT problem X = R and Euclidean distance (x 1, 3), (x 2, 2); (y 1, 4), (y 2, 1); x 1 < x 2, y 1 < y 2 : x 1 y 1 ; 1/2x 2 y 1 ; 1/2x 2 y 2 (x 1, 3), (x 2, 2); (y 1, 4), (y 2, 1); x 1 < x 2, y 1 > y 2 : 1/3x 1 y 2 ; 2/3x 1 y 1 ; x 2 y 1 Binary cost matrix /3 2/3 1/3 0 0 C = a = 1/3 b = 1/6 P = 1/6 1/ /3 1/6 1/6 0 1/6 Assignment problem on X = R 2 with the Euclidean distance 5/2 5/4 5 1/3 1/ C = 5/4 5/4 2 a = 1/3 b = 1/3 P = /4 5/2 5 1/3 1/ Rémi Gilleron (Magnet) OT Feb / 52

19 Kantorovitch relaxation is tight for assignment problems Permutation matrices are couplings Let n = m, let a = b = 1 n /n, let C be a cost matrix in R n n + Kantorovitch s problem L C (1 n /n, 1 n /n) = min < C, P > P U(1 n/n,1 n/n) For σ Perm(n), the permutation matrix P σ is in U(1 n /n, 1 n /n) Kantorovitch for matching Proposition: There exists an optimal solution of Kantorovitch s problem which is a permutation matrix associated with an optimal permutation for the assignment problem (optimal transport map for Monge problem between histograms). Proof: extremal points of U(1 n /n, 1 n /n) are permutations (Birkhoff s Theorem) and minimum of a linear objective is reached at extremal points of the polyhedron (Bertsimas and Tsitsiklis). Rémi Gilleron (Magnet) OT Feb / 52

20 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52

21 Definition of the Wasserstein Distance OT defines a distance between measures when C satisfies some properties p-wasserstein distance between measures Let X = Y, let d be a distance on X, let p 1, and let c = d p, then W p is a distance on M 1 +(X ) where W p (α, β) = L d p(α, β) 1/p p-wasserstein Distance between histograms Let n = m, let X = Y = Σ n, let D be a distance, let p 1, and let C = D p the matrix of D p i,j, then W p is a distance on Σ n where W p (a, b) = L D p(a, b) 1/p First Remarks W p (resp. W p ) depends on D (resp. d) W 1 (resp. W 1 ) is the optimal transport cost for C = D (resp. c = d) Rémi Gilleron (Magnet) OT Feb / 52

22 p-wasserstein distance is a distance Proof for histograms Note that C = D p is symmetric and has a null diagonal. Then, W p (a, b) = 0 if and only if a = b easy W p (a, b) = W p (b, a) easy W p (a, c) W p (a, b) + W p (b, c) Let S = Pdiag(1/ b)q where P (resp. Q) is an optimal coupling between a and b (resp. between b and c), and b is b where null values are set to 1 Wp (a, c) (< S, D p >) 1/p because S U(a, c) Then use the triangular inequality for D p and the Minkowski inegality to show that (< S, D p >) 1/p W p (a, b) + W p (b, c). Rémi Gilleron (Magnet) OT Feb / 52

23 p-wasserstein distance properties Geometric intuition W p is a distance while many others are divergences as, for instance, the KL-divergence W p allows to compare singular distributions, for instance discrete ones. While classical distances or divergences do not allow to compare discrete distributions. W p allows to quantify spatial shift between the supports Barycenters of distributions can be defined with ᾱ = arg min α i λ iwp p (α i, α). W p (δ x, δ y ) = d(x, y) and W p (δ x, δ y ) 0 if x y. This allows to define a more general notion of weak convergence for distributions. Rémi Gilleron (Magnet) OT Feb / 52

24 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52

25 Special case: binary cost matrix Binary cost matrix Let a and b be two mass vectors in R n + and let C be the cost matrix 1 n n I n n, then L C (a, b) = 1 2 a b 1 Kronecker cost function and total variation Let α = n i=1 a iδ xi and β = m j=1 b jδ xj be two discrete measures, let c be the Kronecker cost function defined by c(x, y) = 0 if x = y and c(x, y) = 1 otherwise. Then L c (α, β) = TVD(α, β) = 1 2 α β 1 TVD(α, β) = sup a i b i is the total variation distance between α and β. i Rémi Gilleron (Magnet) OT Feb / 52

26 Special case: dimension 1 Discrete case Let X = R, let α = 1 i=n n i=1 δ x i and β = 1 j=n n j=1 δ y j and let us suppose that x 1 < x 2 <... x n and y 1 < y 2 <... y n, then W p is the L p -norm between two vectors of ordered values of α and β. i=n Wp p (α, β) = x i y i p i=1 W p (α, β) changes as soon as the order changes Can be extended to the case n m with the condition: if x i < x i, P i,j 0 and P i,j 0, then necessarily y j y j Arbitrary measure W p (α, β) = C 1 α C 1 β p L p ((0,1]) Rémi Gilleron (Magnet) OT Feb / 52

27 Special case: Gaussian distributions Two Gaussians in R w.r.t. Euclidean distance Let α = N (m 1, σ 1 ) and β = N (m 2, σ 2 ), then the optimal transport map is T (x) = m 2 + (x m 1 ) σ2 σ 1 W 2 2 (α, β) = m 2 m σ 2 σ 1 2 Two Gaussians in R d w.r.t. Euclidean distance Let α = N (m α, Σ α ) and β = N (m β, Σ β ), then the optimal map T is T (x) = m β + A(x m α ) where A = Σ 1 2 α (Σ 1 2 α Σ β Σ 1 2 α ) 1 2 Σ 1 2 α = A t W 2 2 (α, β) = m α m β 2 + tr(σ α + Σ β 2(Σ 1 2 α Σ β Σ 1 2 α ) 1 2 ) W 2 2 (α, β) = m α m β 2 + r s 2 if Σ α = diag(r) and Σ β = diag(s) Rémi Gilleron (Magnet) OT Feb / 52

28 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52

29 Reminder Main cases for discrete measures Points in R d or cells: (x i ) n i=1, (y j) m j=1 Discrete measures α = n i=1 a iδ xi and β = m j=1 b jδ yj, a X = R n +, b Y = R m +, a i mass at x i and b j mass at y j X = R n +, X = Σ n (histograms: sum is 1), a i = 1/n, case n = m c is a cost, c is a distance, c is the Euclidean distance, c is the squared Euclidean distance,... C is the cost matrix of c i,j OT and Wasserstein distance L C (a, b) = min < C, P >= min P U(a,b) P1 m=a, P t i,j C i,jp i,j 1 n=b Let n = m and X = Y = Σ n, let d be a distance, let p 1, and let C = d p the matrix of d p i,j, then W p is a distance on Σ n where W p (a, b) = L C (a, b) 1/p. Note that W p (δ x, δ y ) = d(x, y). Rémi Gilleron (Magnet) OT Feb / 52

30 OT is a linear program Kantorovitch linear program Let us recall that Kantorovitch s OT problem is L C (a, b) = It can be formulated as min < C, P >= min P U(a,b) P1 m=a, P t 1 n=b L C (a, b) = min(c t p) s.t. p R nm p +, Ap = C i,j P i,j i,j [ ] a b where the n m matrices P and C have been replaced by nm dimensional vectors p and c, and admissible couplings are defined with a well chosen matrix A. Rémi Gilleron (Magnet) OT Feb / 52

Simplex network algorithm for OT Algorithm complexity in O(n 3 ) Because one can restrict the search to extremal points of the polytope U(a, b) and using the structure of matrices P expressed with

31 Simplex network algorithm for OT Algorithm complexity in O(n 3 ) Because one can restrict the search to extremal points of the polytope U(a, b) and using the structure of matrices P expressed with bipartite graphs, one can use a network flow solver. For matching problems (n=m, a i = b i = 1 n ), the auction algorithm runs in n 3 C /ɛ and the cost of the output is nɛ suboptimal Rémi Gilleron (Magnet) OT Feb / 52

32 Regularized Optimal Transport Adding a regularization penalty The Kantorovitch s regularized OT problem is L C (a, b) = min < C, P > +λω(p) P U(a,b) What for? Encode prior knowledge Better posed problem w.r.t. stability because more dense couplings Smooth approximate distance w.r.t. input histogram weights and positions of the Diracs Better complexity making the problem convex Regularization: quadratic, entropic, group Lasso, KL divergence,... Rémi Gilleron (Magnet) OT Feb / 52

33 Entropic Regularized Optimal Transport Entropic Regularization The entropic regularized OT problem (Cuturi 2013) is L λ C (a, b) = min < C, P > λh(p) P U(a,b) I.e. L λ C (a, b) = min < C, P > +λ P U(a,b) i,j P i,j (log(p i,j )) Rémi Gilleron (Magnet) OT Feb / 52

34 Entropic Regularized Optimal Transport Entropic Regularization The entropic regularized OT problem (Cuturi 2013) is L λ C (a, b) = min < C, P > λh(p) P U(a,b) I.e. L λ C (a, b) = min < C, P > +λ P U(a,b) i,j P i,j (log(p i,j )) Rémi Gilleron (Magnet) OT Feb / 52

35 Entropic Regularized Optimal Transport Entropic Regularization The entropic regularized OT problem (Cuturi 2013) is L λ C (a, b) = min < C, P > λh(p) (1) P U(a,b) I.e. L λ C (a, b) = min < C, P > +λ P U(a,b) i,j P i,j (log(p i,j )) Convergence with λ Entropy is strongly convex. Objective is λ-strongly convex. Proposition: the unique solution P λ of problem 1 converges to the optimal solution with maximal entropy within the set of all optimal solutions of the Kantorovitch s OT problem. In particular, L λ λ 0 C (a, b) L C (a, b). Also, L λ λ C (a, b) a b = ab t = (a i b j ) i,j Rémi Gilleron (Magnet) OT Feb / 52

36 Computing Entropic Regularized Optimal Transport Entropic regularized OT as matrix scaling Introducing dual variables, expressing the Lagrangian of (1), it can be shown that the solution of (1) has the form P = diag(u)kdiag(v) for (unknown) vectors u R n +, v R m +, and K defined by K i,j = e C i,j λ leads to the Sinkhorn s algorithm: Init v (0) = 1 m Repeat u (l+1) = a and v (l+1) = b Kv (l) K t u (l+1) Results complexity in O(n 2 ) Only matrix operations Convergence but numerical problems and difficulties for small λ Complexity in O(n 2 log(n)ɛ 3 ) GPU version when solving multiple OT problems Rémi Gilleron (Magnet) OT Feb / 52

37 Conclusion on the first part Also in Computational Optimal transport, G. Peyré and M. Cuturi Semi-discrete OT: one is discrete and one is arbitrary (often with density) W 1 OT, i.e. c is a distance W 2 OT for a geodesic distance Approximating OT with discrete samples (Eulerian or Lagrangian) Variational OT, i.e. using Wasserstein distance as a loss function Algorithms for computing Wasserstein barycenters Software for OT Python Optimal Transport library by Rémy Flamary Rémi Gilleron (Magnet) OT Feb / 52

38 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52

39 A new distance between text documents Base ideas Word embeddings: represent every word by a vector Word mover s distance (WMD): the distance between two text documents A and B is the minimum cumulative distance that words from document A need to travel to match exactly the point cloud of document B Kusner et al study k-nn with the WMD for classifying documents Huang et al define metric learning algorithms for the WMD References From word embeddings to document distances, Kusner et al, ICML 15 Supervised word mover s distance, Huang et al, NIPS 16 Rémi Gilleron (Magnet) OT Feb / 52

40 From word embeddings to document distances Representation of text documents A vocabulary of size n The i-th word w i is represented by a word vector x i R d A text document A is represented as an histogram a Σ n defined by a i = where c i is the word count for w i in A c i i=n i=1 c i Word mover s distance between text documents Cost or ground distance is chosen to be the Euclidean distance Word mover s distance between text documents A and B is the W 1 distance between a and b, i.e. WMD(A, B) = W 1 (a, b) = x i x j 2 P i,j min P1 n=a, P t 1 n=b i,j Rémi Gilleron (Magnet) OT Feb / 52

41 WMD for text classification k-nn with WMD works well 8-) Outperforms known methods on several datasets The word2vec embeddings works well on several domains Largest runtime because the time complexity of computing the WMD is O(q 3 log q), where q is the max number of unique words in A or B n is too large for GPU computation of multiple WMD with entropy regularization Prefetch and prune The authors introduce two lower bounds for WMD(A, B): Word centroid distance: WCD(A, B) = Xa Xb 2 Relaxed WMD: RWMD = max{wmd1 (A, B), WMD 2 (A, B)} Use the lower bounds to compute faster an approximation of the k-nearest neighbours for a text query. Rémi Gilleron (Magnet) OT Feb / 52

42 Supervised WMD for text classification WMD to be learned Squared generalized Euclidean distance: c(i, j) = A(x i x j ) 2 2 Histogram reweighting: A is represented as ã = (a w)/(w t a) Then, WMD A,w = i,j A(x i x j ) 2 2 P i,j min P1 n=ã, P t 1 n= b Learning the WMD from labeled data Learn A R r d and w R n s.t. WMD A,w reflects labels. Method: Stochastic neighborhood relaxation of the LOO loss as for NCA Express the gradient w.r.t. A and the gradient w.r.t. w Compute gradients using entropic regularization (O(q 2 ) w.r.t. O(q 3 )) on a subset of neighbors using WCD Clever initialization using WCD Use batch stochastic gradient descent Rémi Gilleron (Magnet) OT Feb / 52

43 Conclusion on WMD for texts Pros Well-defined distance between text documents based on OT It works well 8-) for classification with k-nn Cons = my comments 8-) Choice of the ground distance is ad hoc Euclidean distance. Why not the cosine distance? Squared Euclidean distance when gradient computation is needed Many tricks for supervised WMD to be solved efficiently : loss, choice of the neighbors with WCD, initialization, regularization Experimental results for supervised WMD not convincing Perspective: WMD with Gaussian embeddings; learn Gaussian embeddings with a WMD-based loss. Rémi Gilleron (Magnet) OT Feb / 52

44 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52

45 Domain adaptation with regularized optimal transport Source: Courty et al, ECML 2014, IEEE PAMI Unsupervised domain adaptation Source: labeled data; target: unlabeled data; problem: classify data in the target domain. Base idea: transport data from the source domain into the target domain, then learn a classifier. H Figure: Left: source labeled data (blue and red) and target data (green); right: source points are transported and a linear classifier H is learned Rémi Gilleron (Magnet) OT Feb / 52

46 Solution of Courty et al The OT formulation Ground distance: squared Euclidean distance Entropy regularization for efficiency Introducing domain adaptation in OT: the coupling should be s.t. a target point should not receive mass from points with different labels This leads to P = arg min< C, P > λh(p) + ηl(p) where L(P) P U(a,b) express group label sparsity of P with a L 1 -norm Use an alternating optimization algorithm Learning in the target domain Given a solution ˆP, every source point x i is transported to the barycenter ˆx i of its images Learn a classifier on the images ˆx i with the labels of the x i. Rémi Gilleron (Magnet) OT Feb / 52

47 Discussion on domain adaptation with OT Pros Elegant formulation of domain adaptation based on OT It works well 8-) for balanced problems Other regularizers and other algorithms in long version Extended to the semi-supervised case in another paper Cons = my comments 8-) Antagonism between λh(p) promoting non sparsity ηl(p) promoting group label sparsity Choice of η is not discussed. And choice of λ not so easy. Rémi Gilleron (Magnet) OT Feb / 52

48 Mapping estimation for discrete optimal transport Source: Perrot et al, NIPS 16 Motivations let us consider an OT problem in Kantorovitch s formulation for point clouds (x i ) i=n i=1 and (y j) j=m j=1 The optimal coupling P allows to define a transportation map T by j=m T (x i ) = arg min P(i, j)d(y, y j ) y Y I.e. T maps x i into the barycenter of its images. It is the weighted average when d is the Euclidean distance on Y j=1 But T is defined only for every source point x i Idea: learn a transformation T from X into Y Rémi Gilleron (Magnet) OT Feb / 52

49 Mapping estimation for discrete optimal transport Formulation of Perrot et al they propose the following optimization problem arg min T H,P U(a,b) 1 T (X s ) npx t 2 λ F + nd t max(c) < P, C > + γ R(T ) d s d t Problem jointly convex if H is a convex set of transformations and R is a convex function H si considered to be a set of linear transformations induced by a linear matrix or non linear using kernels Comments and results theoretical bounds are provided but... alternating optimization algorithm It works 8-) Rémi Gilleron (Magnet) OT Feb / 52

50 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52

51 Summary À retenir 8-) OT theory allows to define distances between distributions using spatial information For every type of distributions Computing OT (or Wasserstein distance) is solving an optimization problem. Complexity in O(n 3 ), reduced to O(n 2 ) with entropic regularization Mathematical foundations but not so easy to understand OT in the discussed papers ad hoc choice of the ground distance intricate optimization problems derived from OT ad hoc optimization algorithms But many opportunities to use OT in magnet research problems Rémi Gilleron (Magnet) OT Feb / 52

52 Conclusion For Magnet NLP: Mangoes, cosine distance, Gaussian embeddings, applications Domain adaptation? Distributed learning? For Jan et al, histogram prediction in graphs, Wasserstein propagation for semi-supervised learning, Solomon et al, ICML 14 Some recent papers among others at NIPS 17 and ICLR 18 Joint Distribution Optimal Transportation for Domain Adaptation Near-linear time approximation algorithms for optimal transport... Large Scale Optimal Transport and Mapping Estimation Improved Training of Wasserstein GANs Learning Wasserstein Embeddings Wasserstein Auto-Encoders Improving GANs Using Optimal Transport Rémi Gilleron (Magnet) OT Feb / 52

Generative Models and Optimal Transport

Generative Models and Optimal Transport Marco Cuturi Joint work / work in progress with G. Peyré, A. Genevay (ENS), F. Bach (INRIA), G. Montavon, K-R Müller (TU Berlin) Statistics 0.1 : Density Fitting