Generative Models and Optimal Transport

Size: px

Start display at page:

Download "Generative Models and Optimal Transport"

Barnard Allen
5 years ago
Views:

1 Generative Models and Optimal Transport Marco Cuturi Joint work / work in progress with G. Peyré, A. Genevay (ENS), F. Bach (INRIA), G. Montavon, K-R Müller (TU Berlin)

2 Statistics 0.1 : Density Fitting We collect data NX data = 1 N i=1 x i data 2

3 Statistics 0.1 : Density Fitting We collect data NX data = 1 N i=1 x i p 0 data We fit a parametric family of densities {p, 2 } e.g. =(m, ); p = N (m, ) 2

4 Density Fitting p 1 data

5 Density Fitting data p 2

6 Density Fitting data p done! We stop when there is a good fit.

7 Maximum Likelihood Estimation data NX max 2 1 N i=1 log p (x i ) p done!

8 Maximum Likelihood Estimation data NX max 2 1 N i=1 log p (x i ) log 0 = 1 p (x i ) must be > 0 p done!

9 Maximum Likelihood Estimation Equivalent to a KL projection in the space of probability measures data data {p, 2 } p 1 KL p done! p 2 p done! min KL( datakp ) 2

10 Maximum Likelihood Estimation Equivalent to a KL projection in the space of probability measures data data KL {p, 2 } p 1 p 0 p done! p 2 p done! min KL( datakp ) 2

11 In higher dimensional spaces data min KL( datakp ) 2 8

12 In higher dimensional spaces p data min KL( datakp ) 2 8

13 In higher dimensional spaces p data Data space has dimension

14 Generative Models data 10

15 Generative Models µ latent space data data space 10

16 Generative Models µ f : latent space! data space latent space data data space 10

17 Generative Models µ z latent space f : latent space! data space data z = data space 10

18 Generative Models µ z f : latent space! data space latent space f (z) data data space z = f 10

19 Generative Models µ z f : latent space! data space latent space f (z) data data space z = f 10

20 Generative Models µ f : latent space! data space latent space data f ] µ data space 10

21 Generative Models µ f : latent space! data space latent space data f ] µ data space Push-forward: 8B, f ] µ(b) :=µ(f 1 (B)) 10

22 Generative Models µ f : latent space! data space latent space data f ] µ data space Goal: find such that f ] µ fits data 11

23 Generative Models µ f : latent space! data space latent space data f ] µ data space Goal: find such that f ] µ fits data 11

24 Generative Models µ f : latent space! data space latent space data f ] µ data space Di erence between fitting a push forward measure f ] µ vs. a density p? 12

25 Generative Models µ f : latent space! data space latent space data f ] µ data space MLE max 2 1 N NX i=1 log p (x i ) = min KL( datakp ) 2 13

26 Generative Models µ f : latent space! data space latent space data f ] µ data space MLE max 2 1 N NX i=1 log f ] µ(x i ) min KL( datakf ] µ) 2 14

27 Generative Models µ f : latent space! data space latent space data f ] µ data space MLE max 2 1 N NX i=1 log f ] µ(x i ) min KL( datakf ] µ) 2 14

28 Generative Models µ f : latent space! data space latent space data f ] µ data space Need a more flexible discrepancy function to compare data and f ] µ 15

29 µ Workarounds? latent space data space data Formulation as adversarial problem [GPM 14] min 2 max Accuracy g ((f ] µ, +1), ( data, 1)) classifiers g Use a richer metric for probability measures, able to handle measures with non-overlapping supports: min 2 ( data, p ), not min 2 KL( datakp ) 16

30 Minimum Estimation l 1 17

31 Minimum Kantorovich Estimation Use optimal transport theory, namely Wasserstein distances to define discrepancy. min W ( data,f ] µ) 2 Optimal transport? fertile field in mathematics. Monge Kantorovich Koopmans Dantzig Brenier Otto McCann Villani Nobel 75 Fields 10 18

32 What is Optimal Transport? A geometric toolbox to compare probability measures supported on a metric space. p h 1 p 0 d h 2 Statistical Models Bags of features Brain Activation Maps Empirical Measures, i.e. data µ 19 Color Histograms

33 What is Optimal Transport? A geometric toolbox to compare probability measures supported on a metric space. p p 0 d h 2 Statistical Models Bags of features Brain Activation Maps Empirical Measures, i.e. data µ 20 Color Histograms

34 What is Optimal Transport? A powerful geometric toolbox to compare probability measures. 21 [SDPC.. 15]

35 What is Optimal Transport? A powerful geometric toolbox to compare probability measures. Wasserstein Barycenters [Agueh 11] 21 [SDPC.. 15]

36 What is Optimal Transport? A powerful geometric toolbox to compare probability measures. Wasserstein Barycenters [Agueh 11] 22 [SDPC.. 15]

37 Origins: Monge s Problem 23

38 Origins: Monge s Problem 24

39 Origins: Monge s Problem 24

40 Origins: Monge s Problem 24

41 Origins: Monge s Problem 24

42 Origins: Monge s Problem 24

43 Origins: Monge s Problem 24

44 Origins: Monge s Problem x 24

45 Origins: Monge s Problem x 24

46 Origins: Monge s Problem x y = T (x) 24

47 Origins: Monge s Problem x y = T (x) D(x, T (x)) 24

48 Origins: Monge s Problem a probability space, c :! R. µ, two probability measures in P( ). [Monge 81] problem: Z find a map T :! inf T ] µ= c(x, T (x))µ(dx) x T (x) 25

49 Origins: Monge s Problem a probability space, c :! R. µ, two probability measures in P( ). [Monge 81] problem: find a map T :! [Brenier 87] If = R d, c = k k 2, µ, a.c., then T = ru, u convex. x T (x) 25

50 Monge s Problem a probability space, c :! R. µ, two probability measures in P( ). [Monge 81] problem: Z find a map T :! inf T ] µ= c(x, T (x))µ(dx) x T (x) 26

51 Monge s Problem a probability space, c :! R. µ, two probability measures in P( ). [Monge 81] problem: Z find a map T :! inf T ] µ= c(x, T (x))µ(dx) x 26

52 [Kantorovich 42] Relaxation Instead of maps T :!, consider P 2 P( ) probabilistic maps, i.e. couplings : (µ, ) def = {P 2 P( ) 8A, B, P (A ) =µ(a), P ( B) = (B)} 27

53 [Kantorovich 42] Relaxation (µ, ) def = {P 2 P( ) 8A, B, P (A ) =µ(a), P ( B) = (B)} } { } { } { 0.6 µ(x) ν(y) 0.4 P x y 3 4 P (x, y)

54 [Kantorovich 42] Relaxation Joint Probabilities of (µ, ν) (µ, ) def = {P 2 P( ) 8A, B, Π(µ, ν) = For probability P Π(µ, measures ν), on Ω 2 P (A ) =µ(a), P ( B) = (B)} P ({x},} Ω) =µ({x}) with { marginals } and P µ (Ω, and{ {y}) ν. } =ν({y}) { 0.6 µ(x) ν(y) 0.4 P x y 3 4 P (x, y)

55 Wasserstein Distances Def. For p 1, the p-wasserstein distance between µ, in P( ), defined by a metric D on, ZZ W p p (µ, ) def = inf P 2 (µ, ) D(x, y) p P (dx, dy). PRIMAL 29

56 Wasserstein Distances Def. For p 1, the p-wasserstein distance between µ, in P( ), defined by a metric D on, ZZ W p p (µ, ) def = inf P 2 (µ, ) D(x, y) p P (dx, dy). PRIMAL 29

57 Wasserstein Distances Def. For p 1, the p-wasserstein distance between µ, in P( ), defined by a metric D on, ZZ W p p (µ, ) def = inf P 2 (µ, ) D(x, y) p P (dx, dy). PRIMAL Z Z W p p (µ, ) = sup '2L 1 (µ), 2L 1 ( ) '(x)+ (y)appled p (x,y) 29 'dµ + d. DUAL

58 W is versatile Discrete - Discrete Discrete - Continuous Continuous - Continuous 30

59 W is versatile Discrete - Discrete - Network flow solvers - Entropic regularization Discrete - Continuous low dim. [M 11][KMB 16] [L 15] Continuous - Continuous Stochastic Optimization [GCPB 16] 30

60 Minimum Kantorovich Estimators min W ( data,f ] µ) 2 [Bassetti 06] 1st reference discussing this approach. [MMC 16] use regularization in a finite setting. [ACB 17] (WGAN) [BJGR 17] (Wasserstein ABC). Hot topics: approximate & differentiate W efficiently. Today: ideas from our recent preprint [GPC 17] 31

61 Wasserstein on Empirical Measures nx µ = i=1 a i xi (, D) mx 32 = j=1 b j yj

62 Wasserstein on Empirical Measures nx mx Consider µ = a i xi and = b j yj. i=1 M XY def =[D(x i, y j ) p ] ij j=1 U(a, b) def = {P 2 R n m + P 1 m = a,p T 1 n = b} 2 y 1... y m 3 2 b 1... b m 3 x 1. 6 D(x i, y j ) p 4 x n a 1. 6 P 1 m = a 4 a n 7 5

63 Wasserstein on Empirical Measures nx mx Consider µ = a i xi and = b j yj. i=1 M XY def =[D(x i, y j ) p ] ij j=1 U(a, b) def = {P 2 R n m + P 1 m = a,p T 1 n = b} 2 y 1... y m x 1. 6 D(x i, y j ) p 4 x n b 1... b m a P T 1 4 n = b a n 3 7 5

64 Wasserstein on Empirical Measures nx mx Consider µ = a i xi and = b j yj. i=1 M XY def =[D(x i, y j ) p ] ij j=1 U(a, b) def = {P 2 R n m + P 1 m = a,p T 1 n = b} Def. Optimal Transport Problem W p p (µ, ) = min hp,m XY i P 2U(a,b) 33

65 Discrete OT Problem M XY U(a, b) 34

66 Discrete OT Problem M XY P? U(a, b) 35

67 Discrete OT Problem M XY P? U(a, b) Def. Dual OT problem W p p (µ, ) = max T a + 2R n, 2R m i + j appled(x i,y j ) p 35 T b

68 Discrete OT Problem M XY network flow solver used in practice. O(n 3 log(n)) U(a, b) P? Note: flow/pde formulations [Beckman 61]/[Benamou 98] can be used for p=1/p=2 for a sparse-graph metric/euclidean metric. 35

69 Discrete OT Problem M XY network flow solver used in practice. O(n 3 log(n)) U(a, b) P? 36

70 Discrete OT Problem M XY network flow solver used in practice. O(n 3 log(n)) U(a, b) P? P? Solution unstable and not always unique. 36

71 Discrete OT Problem M XY network flow solver used in practice. O(n 3 log(n)) U(a, b) {P? } P? Solution unstable and not always unique. 36

72 Discrete OT Problem M XY network flow solver used in practice. O(n 3 log(n)) U(a, b) {P? } P? Solution unstable and not always unique. 37

73 Discrete OT Problem M XY network flow solver used in practice. O(n 3 log(n)) U(a, b) P? Solution unstable and not always unique. P? 37

74 Discrete OT Problem M XY network flow solver used in practice. O(n 3 log(n)) U(a, b) P? Solution unstable and not always unique. P? W p p (µ, ) not di erentiable. 37

75 Entropic Regularization [Wilson 62] Def. Regularized Wasserstein, 0 W (µ, ) def = min P 2U(a,b) hp,m XY i E(P ) E(P ) def = nmx i,j=1 P ij (log P ij ) Note: Unique optimal solution because of strong concavity of Entropy 38

76 Entropic Regularization [Wilson 62] Def. Regularized Wasserstein, 0 W (µ, ) def = min P 2U(a,b) hp,m XY i E(P ) µ P Note: Unique optimal solution because of strong concavity of Entropy 38

77 Fast & Scalable Algorithm Prop. If P def = argmin hp,m XY i E(P ) P 2U(a,b) then 9!u 2 R n +, v 2 R m +, such that P = diag(u)kdiag(v), K def = e M XY / 39

78 Fast & Scalable Algorithm Prop. If P def = argmin hp,m XY i E(P ) P 2U(a,b) then 9!u 2 R n +, v 2 R m +, such that P = diag(u)kdiag(v), K def = e M XY / L(P,, )= X ij P ij M ij + P ij log P ij + T (P 1 a)+ T (P T 1 ij = M ij + (log P ij + 1) + i + j (@L/@P ij = 0) )P ij = e i e M ij e j = u i K ij v j 39

79 Fast & Scalable Algorithm Prop. If P def = argmin hp,m XY i E(P ) P 2U(a,b) then 9!u 2 R n +, v 2 R m +, such that P = diag(u)kdiag(v), K def = e M XY / [Sinkhorn 64] fixed-point iterations for (u, v) u a/kv, v b/k T u O(nm) complexity, GPGPU parallel [C 13]. O(n d+1 ) if = {1,...,n} d and separable. D p [S..C.. 15] 39

80 Sinkhorn Divergence Def. For > 0, let W (µ, ) def = hp,m XY i Prop. W (µ, µ) > 0 Def. Normalized Sinkhorn Divergence W (µ, ) def = W (µ, ) 1 2 (W (µ, µ)+w (, )) Prop. If p = 1, W (µ, )!!1 ED(µ, ) 40

81 Algorithmic Formulation Def. For L 1, define W L (µ, ) def = hp L,M XY i, where P L def = diag(u L )Kdiag(v L ), v 0 = 1 m ; l 0, u l def = a/kv l, v l+1 def = b/k T u l. L can be computed recursively, in O(L) kernelk vector products. 41

82 Proposal: Autodiff OT using Sinkhorn Approximate W loss by the transport cost W L after L Sinkhorn iterations. (z1,...,zm) (x1,...,xm) Input data (c(x i,y j )) i,j e C/" C... (y1,...,yn) K nk 1 b`+1 m 1/ 1/ ÊL ( ) ` ` +1 mk > Generative model Sinkhorn ` =1,...,L 1 b` a`+1 h(c K)b L,a L i 42 [GPC 17]

83 Example: MNIST, Learning f 43

84 Example: Generation of Images (a) MMD (b) (c) MMD-GAN τ = 1000 τ=10 CIFAR 10 images In these examples the cost function is also learned adversarially, as a NN mapping onto feature vectors. 44

85 Concluding Remarks Regularized OT is much faster than OT. Regularized OT can interpolate between W and the MMD / Energy distance metrics. The solution of regularized OT is auto-differentiable. Many open problems remain! NIPS 17 WORKSHOP NIPS 17 TUTORIAL 45

Numerical Optimal Transport and Applications

Numerical Optimal Transport and Applications Gabriel Peyré Joint works with: Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, Justin Solomon www.numerical-tours.com Histograms in Imaging