Theory and Applications of Wasserstein Distance. Yunsheng Bai Mar 2, 2018

Similar documents
Wasserstein GAN. Juho Lee. Jan 23, 2017

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Lecture 14: Deep Generative Learning

AdaGAN: Boosting Generative Models

arxiv: v1 [cs.lg] 20 Apr 2017

Generative Adversarial Networks. Presented by Yi Zhang

Do you like to be successful? Able to see the big picture

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

GENERATIVE ADVERSARIAL LEARNING

Lecture 35: December The fundamental statistical distances

Variational Autoencoders (VAEs)

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Natural Language Processing with Deep Learning CS224N/Ling284

Singing Voice Separation using Generative Adversarial Networks

CAUSAL GAN: LEARNING CAUSAL IMPLICIT GENERATIVE MODELS WITH ADVERSARIAL TRAINING

Generative adversarial networks

Distirbutional robustness, regularizing variance, and adversaries

Machine Learning Summer 2018 Exercise Sheet 4

CS 188: Artificial Intelligence Spring Announcements

Clustering with k-means and Gaussian mixture distributions

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

Efficient and Principled Online Classification Algorithms for Lifelon

Lecture 3: Pattern Classification

9 Classification. 9.1 Linear Classifiers

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Information Theory, Statistics, and Decision Trees

Lecture 5 Channel Coding over Continuous Channels

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Clustering with k-means and Gaussian mixture distributions

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Fantope Regularization in Metric Learning

A Unified View of Deep Generative Models

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS

CPSC 540: Machine Learning

The connection of dropout and Bayesian statistics

arxiv: v7 [cs.lg] 27 Jul 2018

Generative models for missing value completion

Machine Learning for Signal Processing Bayes Classification and Regression

Generative Adversarial Networks

Riemannian Metric Learning for Symmetric Positive Definite Matrices

Improving Visual Semantic Embedding By Adversarial Contrastive Estimation

COMP90051 Statistical Machine Learning

Chapter 20. Deep Generative Models

Wasserstein Generative Adversarial Networks

Gaussian Mixture Distance for Information Retrieval

Basic math for biology

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Lecture 13 : Variational Inference: Mean Field Approximation

Expectation Propagation in Dynamical Systems

An Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University

Active and Semi-supervised Kernel Classification

A brief introduction to Conditional Random Fields

13 : Variational Inference: Loopy Belief Propagation and Mean Field

CSCE 478/878 Lecture 6: Bayesian Learning

Notes on Machine Learning for and

Information theoretic perspectives on learning algorithms

Course 10. Kernel methods. Classical and deep neural networks.

PATTERN RECOGNITION AND MACHINE LEARNING

EM Algorithm & High Dimensional Data

MMD GAN 1 Fisher GAN 2

Empirical Risk Minimization

Maxout Networks. Hien Quoc Dang

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Some theoretical properties of GANs. Gérard Biau Toulouse, September 2018

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

Deep Generative Models. (Unsupervised Learning)

topics about f-divergence

arxiv: v2 [cs.lg] 21 Aug 2018

The Success of Deep Generative Models

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Energy-Based Generative Adversarial Network

Variational Autoencoders

Clustering with k-means and Gaussian mixture distributions

Probabilistic Graphical Models

Experiments on the Consciousness Prior

Quantitative Biology II Lecture 4: Variational Methods

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

Generative Adversarial Networks

STA 414/2104, Spring 2014, Practice Problem Set #1

Generative MaxEnt Learning for Multiclass Classification

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Summary of A Few Recent Papers about Discrete Generative models

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Nonparametric Bayesian Methods (Gaussian Processes)

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Final Exam December 12, 2017

Final Exam December 12, 2017

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Neural Map. Structured Memory for Deep RL. Emilio Parisotto

Lecture 13: More uses of Language Models

Multiclass Classification-1

Lecture 3: Pattern Classification. Pattern classification

Training Generative Adversarial Networks Via Turing Test

FINAL: CS 6375 (Machine Learning) Fall 2014

TTIC 31230, Fundamentals of Deep Learning, Winter David McAllester. The Fundamental Equations of Deep Learning

Brief Introduction of Machine Learning Techniques for Content Analysis

arxiv: v3 [stat.ml] 6 Dec 2017

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Transcription:

Theory and Applications of Wasserstein Distance Yunsheng Bai Mar 2, 2018

Roadmap 1. 2. 3. Why Study Wasserstein Distance? Elementary Distances Between Two Distributions Applications of Wasserstein Distance

Why Study Wasserstein Distance?

Motivation: Why Study Wasserstein Distance? 1. Recent trend in Machine Learning a. b. c. 2. Appears in ICLR 2018 top 10 papers Explosion in DBLP Wasserstein keyword search results A wide variety of applications including GAN Useful in our project; used by MNE (Nikolentzos, Giannis, Polykarpos Meladianos, and Michalis Vazirgiannis. "Matching Node Embeddings for Graph Similarity." AAAI. 2017.)

http://search.iclr2018.smerity.com/

Ranked 7th http://search.iclr2018.smerity.com/

Tolstikhin, Ilya, et al. "Wasserstein Auto-Encoders." arxiv preprint arxiv:1711.01558 (2017). https://openreview.net/forum?id=hkl7n1-0b

http://dblp.uni-trier.de/

Credit: Yichao Zhou

What Researchers Say on Wasserstein Distance 1. Learning under a Wasserstein loss, a.k.a. Wasserstein loss minimization (WLM), is an emerging research topic for gaining insights from a large set of structured objects (Ye, Jianbo, James Z. Wang, and Jia Li. "A Simulated Annealing Based Inexact Oracle for Wasserstein Loss Minimization." arxiv preprint arxiv:1608.03859 (2016).) 2. The use of the EMD has been pioneered in the computer vision literature (Rubner et al., 1998; Ren et al., 2011). Several publications investigate approximations of the EMD for image retrieval applications (Grauman & Darrell, 2004; Shirdhonkar & Jacobs, 2008; Levina & Bickel, 2001). As word embeddings improve in quality, document retrieval enters an analogous setup, where each word is associated with a highly informative feature vector. (Kusner, Matt, et al. "From word embeddings to document distances." ICML. 2015.)

Roadmap 1. 2. 3. Why Study Wasserstein Distance? Elementary Distances Between Two Distributions Applications of Wasserstein Distance

Elementary Distances Between Two Distributions

Infimum and Supremum greatest lower bound least upper bound A set T of real numbers (red and green balls), a subset S of T (green balls), and the infimum of S. max sup https://en.wikipedia.org/wiki/infimum_and_supremum

Total Variation (TV) Distance 1. 2. 3. 4. : compact metric set : set of all the Borel subsets of : space of probability measures defined on : two distributions (r -- real ; g -- learnable) largest possible difference between the probabilities that the two probability distributions can assign to the same event (Wikipedia) Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arxiv preprint arxiv:1701.07875 (2017).

Example (θ, 0) Pθ P0 (θ, 1) (0, 1) (0, 0) Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arxiv preprint arxiv:1701.07875 (2017).

Example (θ, 0) Pθ P0 (θ, 1) (0, 1) (0, 0) Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arxiv preprint arxiv:1701.07875 (2017).

Kullback-Leibler (KL) Divergence = EPr[log(Pr(X))/Pg(X)] 1. 2. 3. : space of probability measures defined on : two absolutely continuous distributions expectation of the logarithmic difference between the probabilities P and Q, where the expectation is taken using the probabilities P (Wikipedia) Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arxiv preprint arxiv:1701.07875 (2017).

KL Divergence: Information Theory cross entropy of P and Q entropy of P expected extra message-length per datum that must be communicated if a code that is optimal for a given (wrong) distribution Q is used, compared to using a code based on the true distribution P (Wikipedia) https://en.wikipedia.org/wiki/kullback%e2%80%93leibler_divergence

KL Divergence: Information Theory cross entropy of P and Q https://en.wikipedia.org/wiki/kullback%e2%80%93leibler_divergence entropy of P

KL Divergence: Information Theory Shannon's source coding theorem says at least log(1/pi) bits are needed to encode this symbol. The more likely the symbol appears, the less information it contains, and the less bits are needed to encode this symbol. Intuition 1 https://en.wikipedia.org/wiki/entropy_(information_theory)

KL Divergence: Information Theory Entropy measures the average number of bits per symbol needed to encode the data. The more skewed the distribution is, the more predictable it is, the less surprise we will have, the more information it contains, the smaller the entropy it has, and the less average number of bits we need to encode. Intuition 2 https://en.wikipedia.org/wiki/entropy_(information_theory)

KL Divergence: Information Theory For every symbol, if a wrong distribution Q is used, cross entropy H(P, Q) measures the new average number of bits per symbol. Use more bits than necessary -> could be + if p>0 and q=0 Save some bits https://en.wikipedia.org/wiki/entropy_(information_theory)

KL Divergence: Information Theory Using Intuition 1, we should be able to understand Intuition 1 comes from Shannon's source coding theorem. Gibbs' inequality is used to prove both the theorem and the ineq. Gibbs' inequality in essence uses https://en.wikipedia.org/wiki/kullback%e2%80%93leibler_divergence

Kullback-Leibler (KL) Divergence :( 1. Asymmetric. Mathematically, a distance metric should be symmetric. KL distance 2. Possibly infinite. Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arxiv preprint arxiv:1701.07875 (2017).

Example (θ, 0) Pθ P0 (θ, 1) (0, 1) (0, 0) Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arxiv preprint arxiv:1701.07875 (2017).

f-divergences Bousquet, Olivier, et al. "From optimal transport to generative modeling: the VEGAN cookbook." arxiv preprint arxiv:1705.07642 (2017). https://en.wikipedia.org/wiki/f-divergence

Jensen-Shannon (JS) Divergence P0 Pθ (P +P )/2 (θ, 1) 0 θ (θ, 0) not continuous and does not provide a usable gradient :( (0, 1) (0, 0) Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arxiv preprint arxiv:1701.07875 (2017).

Generative Adversarial Networks (GAN) Credit: Junheng Hao

JS Divergence and GAN Pr, Pdata: real data distribution Pg: generator distribution Pz: prior on input noise D (D*): (optimal) discriminator G: generator In the limit, the maximum of the discriminator objective is the JS divergence! Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arxiv preprint arxiv:1701.07875 (2017). Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems. 2014.

Earth-Mover (EM) Distance (Wasserstein-1)? (θ, 0) Pθ P0 (θ, 1) (0, 1) (0, 0) Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arxiv preprint arxiv:1701.07875 (2017).

Earth-Mover (EM) Distance (Wasserstein-1)

Earth-Mover (EM) Distance (Wasserstein-1) https://vincentherrmann.github.io/blog/wasserstein/

Earth-Mover (EM) Distance (Wasserstein-1) Credit: Junheng Hao

Earth-Mover (EM) Distance (Wasserstein-1) Optimal transport problem: for a distribution of mass, we wish to transport the mass so that it is transformed into nu(x) Cost of transporting a unit mass from x to the point y Amount of mass to move from x to y is a joint probability distribution with marginals and https://en.wikipedia.org/wiki/wasserstein_metric

Earth-Mover (EM) Distance (Wasserstein-1) The move is reversible -> symmetric distance. https://en.wikipedia.org/wiki/wasserstein_metric

Earth-Mover (EM) Distance (Wasserstein-1) Cost of transporting a unit mass from x to the point y is a joint probability distribution with marginals and Total cost of a transport plan Cost of the optimal plan ( is the set of all joint distributions whose marginals are and ) If c= x-y, the distance is called W-1 (p=1). https://en.wikipedia.org/wiki/wasserstein_metric

Earth-Mover (EM) Distance (Wasserstein-1) (θ, 1) Pθ P0 (θ, 0) (0, 1) (0, 0) Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arxiv preprint arxiv:1701.07875 (2017).

Continuity and Differentiability Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arxiv preprint arxiv:1701.07875 (2017).

EM Distance Intuitively Makes Sense Focus on p(x) and p(y), and ignore x and y. Looks at event probabilities, but ignore the events (when probabilities occur). Bag of Ps. (θ, 1) Pθ P0 (θ, 0) (0, 1) (0, 0) Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arxiv preprint arxiv:1701.07875 (2017).

EM Distance Intuitively Makes Sense Compare two distributions using the separation information explicitly through the distance x-y. (θ, 1) Pθ P0 (θ, 0) (0, 1) (0, 0) Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arxiv preprint arxiv:1701.07875 (2017).

Problem: Reasonable and Efficient Approximation Finding the optimal is virtually impossible! See examples in the next section. https://en.wikipedia.org/wiki/wasserstein_metric

History: From Math to Computer Science The concept was first introduced by Gaspard Monge (a French mathematician) in 1781, and anchors the field of transportation theory. The name "Wasserstein distance" was coined by R. L. Dobrushin (a Russian mathematician) in 1970, after Leonid Vaseršteĭn (a Russian-American mathematician) who introduced the concept in 1969. The use of the EMD as a distance measure for monochromatic images was described in 1989 by S. Peleg, M. Werman and H. Rom (Jerusalem CS researchers). The name "earth movers' distance" was proposed by J. Stolfi (a Brazilian CS Professor) in 1994, and was used in print in 1998. https://en.wikipedia.org/wiki/wasserstein_metric https://en.wikipedia.org/wiki/earth_mover%27s_distance

Roadmap 1. 2. 3. Why Study Wasserstein Distance? Elementary Distances Between Two Distributions Applications of Wasserstein Distance

Applications of Wasserstein Distance

Computer Vision: Image Retrieval dist= embi-embj 1 Grauman, Kristen, and Trevor Darrell. "Fast contour matching using approximate earth mover's distance." Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on. Vol. 1. IEEE, 2004.

Computer Vision: Image Retrieval To embed a set of 2-D contour points, a hierarchy of grids is imposed on the image coordinates. This is a fast way to approximate EMD. Contour point coordinates identify the events. Grauman, Kristen, and Trevor Darrell. "Fast contour matching using approximate earth mover's distance." Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on. Vol. 1. IEEE, 2004.

Computer Vision: Image Retrieval Input shapes 33 points are randomly chosen from its contour to be duplicated (circled points) EMD flow Grauman, Kristen, and Trevor Darrell. "Fast contour matching using approximate earth mover's distance." Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on. Vol. 1. IEEE, 2004.

Computer Vision: Image Retrieval Real image queries: examples of query contours from a real person (left,blue) and 5 NN retrieved from synthetic database of 136,500 images. Grauman, Kristen, and Trevor Darrell. "Fast contour matching using approximate earth mover's distance." Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on. Vol. 1. IEEE, 2004.

Computer Vision: GAN In my limited GAN experience, one of the big problems is that the loss doesn t really mean anything, thanks to adversarial training, which makes it hard to judge if models are training or not. Reinforcement learning has a similar problem with its loss functions, but there we at least get mean episode reward. Even a rough quantitative measure of training progress could be good enough to use automated hyperparam optimization tricks, like Bayesian optimization. -- A current Google Engineer on the Brain Robotics team https://www.alexirpan.com/2017/02/22/wasserstein-gan.html

Computer Vision: GAN Wasserstein GAN has significant practical benefits: (1) a meaningful loss metric that correlates with the generator s convergence and sample quality Use the JS divergence Increasing error :( Use the EM divergence Lower error -> beter sample quality :) Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arxiv preprint arxiv:1701.07875 (2017).

Computer Vision: GAN Wasserstein GAN has significant practical benefits: (2) improved stability of the optimization process DCGAN + WGAN :) DCGAN + GAN :) Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arxiv preprint arxiv:1701.07875 (2017).

Computer Vision: GAN No batch normalization, WGAN :) MLP + WGAN No batch normalization, GAN :( MLP + GAN (mode collapse) :( Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arxiv preprint arxiv:1701.07875 (2017).

Duality for Wasserstein-1 Kantorovich-Rubinstein Theorem Finding an optimal f is easier than finding an optimal https://www.alexirpan.com/2017/02/22/wasserstein-gan.html

Duality for Wasserstein-1 Kantorovich-Rubinstein Theorem Finding an optimal f is easier than finding an optimal Double integral single integral 2-D 1-D f https://www.alexirpan.com/2017/02/22/wasserstein-gan.html

Finding an optimal f is easier than finding an optimal f parameterized by weights w Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arxiv preprint arxiv:1701.07875 (2017).

f parameterized by weights Credit: Junheng Hao

Matrix-Level Calculation of Wasserstein-1 = Frobenius inner product (sum of all the element-wise products) https://www.alexirpan.com/2017/02/22/wasserstein-gan.html

Matrix-Level Calculation of Wasserstein-1 = (x1,y1) (x1,y2) (x1,y3) (x1,y4) (x1,y5) (x2,y1) (x2,y2) (x2,y3) (x2,y4) (x2,y5) (x3,y1) (x3,y2) (x3,y3) (x3,y4) (x3,y5) (x4,y1) (x4,y2) (x4,y3) (x4,y4) (x4,y5) (x5,y1) (x5,y2) (x5,y3) (x5,y4) (x5,y5) Frobenius inner product (sum of all the element-wise products) Assume x=y, i.e. the two probability distributions share the same events. https://www.alexirpan.com/2017/02/22/wasserstein-gan.html 0 1 2 3 4 1 0 1 2 3 2 1 0 1 2 3 2 1 0 1 4 3 2 1 0

Calculating Wasserstein-1 Linear Programming https://www.alexirpan.com/2017/02/22/wasserstein-gan.html

Calculating Wasserstein-1 Linear Programming import numpy as np from scipy.optimize import linprog # We construct our A matrix by creating two 3-way tensors, # and then reshaping and concatenating them A_r = np.zeros((l, l, l)) A_t = np.zeros((l, l, l)) for i in range(l): for j in range(l): A_r[i, i, j] = 1 A_t[i, j, i] = 1 A = np.concatenate((a_r.reshape((l, l**2)), A_t.reshape((l, l**2))), axis=0) b = np.concatenate((p_r, P_t), axis=0) c = D.reshape((l**2)) opt_res = linprog(c, A_eq=A, b_eq=b) emd = opt_res.fun gamma = opt_res.x.reshape((l, l)) https://www.alexirpan.com/2017/02/22/wasserstein-gan.html

Linear Programming: Always Has a Dual Form Weak Duality Theorem: x: transport plan (to find) c: distance vector z: cost A: selector matrix b: probability distributions https://www.alexirpan.com/2017/02/22/wasserstein-gan.html

Farkas Lemma: Solvability of Ax=b Strong Duality See details on strong duality at https://www.alexirpan.com/2017/02/22/wasserstein-gan.html

Dual Form https://www.alexirpan.com/2017/02/22/wasserstein-gan.html x: transport plan (to find) c: distance vector z: cost A: selector matrix b: probability distributions

x: transport plan (to find) c: distance vector z: cost A: selector matrix b: probability distributions Dual Form Pr(x1) Pr(xn) P (x1).. P (xn) + f(x1)...... f(xn) g(x1)...... g(xn) g https://www.alexirpan.com/2017/02/22/wasserstein-gan.html

f(x) = -g(x) max max + g https://www.alexirpan.com/2017/02/22/wasserstein-gan.html

f(x) max g(x) should be equal to -f(x)! The max is 0. g(x) <= -f(x)

f(x) = -g(x) max max????? https://www.alexirpan.com/2017/02/22/wasserstein-gan.html

f must be a 1-Lipschitz Function f(xi)-f(xj)<=di,j= xi-xj for all i,j pairs Slopes must be within [-1, 1] A real-valued function f : R R is called Lipschitz continuous if there exists a positive real constant K such that, for all real x1and x2, https://www.alexirpan.com/2017/02/22/wasserstein-gan.html K=1

Duality for Wasserstein-1 Kantorovich-Rubinstein Theorem Finding an optimal f is easier than finding an optimal https://www.alexirpan.com/2017/02/22/wasserstein-gan.html

GAN vs WGAN Constraint D(x) to always be a probabiity within [0, 1]. JS Divergence L :( Authors tend to call f the critic instead of the discriminator - it s not explicitly trying to classify inputs as real or fake Differentiable nearly everywhere can (and should) train f to convergence before each generator update. https://www.alexirpan.com/2017/02/22/wasserstein-gan.html

can (and should) train f to convergence before each generator update generator update Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arxiv preprint arxiv:1701.07875 (2017). https://www.alexirpan.com/2017/02/22/wasserstein-gan.html

Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems. 2014.

Computer Vision: GAN I heard that in Wasserstein GAN, you can (and should) train the discriminator to convergence. If true, it would remove needing to balance generator updates with discriminator updates, which feels like one of the big sources of black magic for making GANs train. In my limited GAN experience, one of the big problems is that the loss doesn t really mean anything, thanks to adversarial training, which makes it hard to judge if models are training or not. -- A current Google Engineer on the Brain Robotics team https://www.alexirpan.com/2017/02/22/wasserstein-gan.html

Natural Language Processing: Doc Classification Query Doc Collection Use average word embeddings fast Query Query Cand Use EMD Result Kusner, Matt, et al. "From word embeddings to document distances." ICML. 2015.

Natural Language Processing: Doc Classification Finding the optimal T matrix takes O(n3logn) time where n is the # of unique words. flow matrix Word embeddings x identify the events. Kusner, Matt, et al. "From word embeddings to document distances." ICML. 2015.

Graph/Network Classification P(nodei)=1/n1 Finding the optimal T matrix takes O(n3logn) time where n is the # of nodes of two graphs. 0.885 Label may be considered. flow matrix Node embeddings u identify the events. Nikolentzos, Giannis, Polykarpos Meladianos, and Michalis Vazirgiannis. "Matching Node Embeddings for Graph Similarity." AAAI. 2017.

Approximation: O(n3logn) O(n2) Don t need a differentiable EMD distance. Just need the distance value. Remove the second constraint. Then the optimal solution is obvious. If only allowed to move to ONE word -- must be the nearest neighbor. nearest neighbor search/ matching Kusner, Matt, et al. "From word embeddings to document distances." ICML. 2015.

Thank you!