Optimal Transport in ML

Size: px
Start display at page:

Download "Optimal Transport in ML"

Transcription

1 Optimal Transport in ML Rémi Gilleron Inria Lille & CRIStAL & Univ. Lille Feb Main Source also some figures Computational Optimal transport, G. Peyré and M. Cuturi Rémi Gilleron (Magnet) OT Feb / 52

2 Compare documents using word embeddings Given: word embeddings in R d, a similarity on R d Problem: compare documents (phrases, sentences) Idea: a document as an histogram of frequencies of words in V Figure: Three documents in R 2. Each circle corresponds to a word vector. The size of the circle is proportional to the frequency. Measure the transportation plan between histograms over the vocabulary using the similarity between word vectors Rémi Gilleron (Magnet) OT Feb / 52

3 Unsupervised domain adaptation Unsupervised domain adaptation Source: labeled data; target: unlabeled data; problem: classify data in the target domain. Base idea: transport data from the source domain into the target domain, then learn a classifier. The transportation plan between the two clouds use a ground distance and empirical distributions of the clouds Figure: Source: blue and red; target: green Rémi Gilleron (Magnet) OT Feb / 52

4 Histogram propagation over graphs Example of traffic estimation Given: a graph representing roads Given: at some nodes, captors allow to compute traffic histograms over a 24h period Problem: compute traffic histograms for every node Base idea: use a propagation algorithm over the graph The similarity should be based both on a similarity between histograms and a similarity on the graph which includes spatial information. Rémi Gilleron (Magnet) OT Feb / 52

5 Optimal Transport (OT) What is it? A method for comparing probability distributions with the ability to incorporate spatial information Pros Distance between distributions based on a ground distance Defined for all distributions: discrete, with density, arbitrary Solid mathematical foundations and works well in applications Cons Computing OT is solving an optimization problem Mathematics are not easy. Methods and algorithms depend on the measures (often discrete) and dimensions the ground cost: arbitrary, distance, squared distance, geodesic distance the ground space: R, R d, geodesic Rémi Gilleron (Magnet) OT Feb / 52

6 Research on computational OT Close to Magnet s research problems Word mover distance Optimal transport for domain adaptation Histogram propagation over graphs But also Signal and image processing, Wasserstein distances and divergences, Efficient computation of (regularized) OT, Generative models, Wasserstein GANs, among others. Rémi Gilleron (Magnet) OT Feb / 52

7 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52

8 Intuitions The goal of OT is to define geometric tools useful to compare probability distributions Earth Mover Distance probability distribution = pile of sand move one pile of sand into another one local cost: move one grain of sand from one place to another OT = minimal global cost Mines and Factories Problem Mines produce ressources across a country Factories consume ressources across a country Local cost for distributing one ressource from a mine to a factory OT = least costly transportation plan from mines into factories Rémi Gilleron (Magnet) OT Feb / 52

9 Main Scenarii Distributions Discrete measure α = n i=1 a iδ xi where δ x is the Dirac at x and a R n +. Then, for f C(X ), X f (x)dα(x) = n i=1 a if (x i ). n Probability measure: i=1 a i = 1, a Σ n is an histogram Lagrangian (point clouds): (xi ) i=n i=1, a i = 1 n Eulerian (histograms): grid, ai probability mass at cell i Rémi Gilleron (Magnet) OT Feb / 52

10 Main Scenarii Distributions Discrete measure α = n i=1 a iδ xi where δ x is the Dirac at x and a R n +. Then, for f C(X ), X f (x)dα(x) = n i=1 a if (x i ). n Probability measure: i=1 a i = 1, a Σ n is an histogram Lagrangian (point clouds): (xi ) i=n i=1, a i = 1 n Eulerian (histograms): grid, a i probability mass at cell i Measure with density α dα(x) = ρ α (x)dx. Then, for h C(X ), X h(x)dα(x) = X h(x)ρ α(x)dx. Arbitrary measure Rémi Gilleron (Magnet) OT Feb / 52

11 Main Scenarii Distributions Discrete measure α = n i=1 a iδ xi where δ x is the Dirac at x and a R n +. Then, for f C(X ), X f (x)dα(x) = n i=1 a if (x i ). n Probability measure: i=1 a i = 1, a Σ n is an histogram Lagrangian (point clouds): (xi ) i=n i=1, a i = 1 n Eulerian (histograms): grid, a i probability mass at cell i Measure with density α dα(x) = ρ α (x)dx. Then, for h C(X ), X h(x)dα(x) = X h(x)ρ α(x)dx. Arbitrary measure Ground space and ground cost X with a distance and, in general, X = R d with d = 1 or d > 1 c is a cost function from X Y in R +. When X = Y = R d, in general, c is a distance d or a squared distance d 2. It is a matrix C in the case of discrete measures. Rémi Gilleron (Magnet) OT Feb / 52

12 Monge Problem Transport Map Monge Problem for Discrete Measures Let α = n i=1 a iδ xi and β = m j=1 b jδ yj be two discrete measures. Find a map T from {x 1,..., x n } into {y 1,..., y m } such that, j, b j = {i T (x i )=y j } a i. T defines a pushforward operator T between measures s.t. T α = β. The transport map should minimize i c(x i, T (x i )). Exemples (x 1, 1), (x 2, 2), (x 3, 3), (x 4, 1), (x 5, 1); (y 1, 4), (y 2, 2), (y 3, 2), c = 1 Id with c = d, the Euclidean distance (x 1, 2), (x 2, 2), (x 3, 2); (y 1, 1), (y 2, 2), (y 3, 2), (y 4, 1), c = 1 (x 1, 3), (x 2, 2); (y 1, 4), (y 2, 1), c = 1 Rémi Gilleron (Magnet) OT Feb / 52

13 Monge problem for histograms Optimal assignment problem Let n = m, let a = b = 1 n /n, and let C be a cost matrix in R n + R n + giving the cost of moving x i into y j Find σ in the set Perm(n) of permutations of n elements solving min σ Perm(n) 1 i=n n i=1 C i,σ(i) Remarks Naive algorithm untractable because Perm(n) has n! elements ((2, 1), 1 2 ), ((2, 3), 1 2 ); ((1, 2), 1 2 ), ((3, 2), 1 2 ), Euclidean distance (x 1, 1 2 ), (x 2, 1 2 ), (y 1, 1 4 ), (y 2, 1 4 ); (y 3, 1 4 ), (y 4, 1 4 ), c = 1 Rémi Gilleron (Magnet) OT Feb / 52

14 Relaxation of the Monge problem Limitations of the Monge problem Feasible solutions may not exist Multiple solutions may exist Assignment problem is combinatorial Monge problem for arbitrary measures is not convex Existence and unicity of the Monge map for c squared Euclidean distance and measures with density (Brenier 91) Relax the deterministic nature of transportation (x 1, 3), (x 2, 2); (y 1, 4), (y 2, 1), c = 1. Kantorovitch s relaxation: mass splitting from a source towards several targets a coupling T maps x 1 into y 1, 1 2 of the mass in x 2 into y 1 and into y 2 Where are transported x 1 and x 2? Rémi Gilleron (Magnet) OT Feb / 52

15 Kantorovitch s OT problem for discrete measures Formulation Let a R n + and b R m + be two mass vectors for x 1,..., x n and y 1,..., y m and let C be the cost matrix A coupling is defined by a matrix P R+ n m where P i,j is the amount of mass flowing from x i to y j Set of admissible couplings is U(a, b) = {P P1 m = a, P t 1 n = b}. It is bounded and it is a convex polytope. Kantorovitch s OT problem is L C (a, b) = min < C, P >= min P U(a,b) P U(a,b) C i,j P i,j Mines and Factories Find an optimal transportation plan between mines and factories i,j Rémi Gilleron (Magnet) OT Feb / 52

16 Kantorovitch s OT problem Formulation for arbitrary measures Let α and β be two measures, a coupling π is a joint distribution over X Y Set of admissible couplings U(α, β) = {π P X π = α and P Y π = β} where P X and P Y are the push-forward projections. Kantorovitch s OT problem for a cost function c is L c (α, β) = c(x, y)dπ(x, y) min π U(α,β) X Y For discrete measures Let α = n i=1 a iδ xi and β = m j=1 b jδ yj be two discrete measures Then L c (α, β) = L C (a, b) where C is the cost matrix defined from c on the support of α and β. Rémi Gilleron (Magnet) OT Feb / 52

17 Examples of Kantorovitch s OT problem X = R and Euclidean distance (x 1, 3), (x 2, 2); (y 1, 4), (y 2, 1) with x 1 < x 2, y 1 < y 2 (x 1, 3), (x 2, 2); (y 1, 4), (y 2, 1) with x 1 < x 2, y 1 > y 2 Binary cost matrix /3 2/3??? C = a = 1/3 b = 1/6 P =??? /3 1/6??? Assignment problem on X = R 2 with the Euclidean distance x 1 = (0, 1), x 2 = (0, 2), x 3 = (0, 3), y 1 = (1, 5 2 ), y 2 = (1, 3 2 ), y 3 = (2, 2). 5/2 5/4 5 1/3 1/3??? C = 5/4 5/4 2 a = 1/3 b = 1/3 P =??? 5/4 5/2 5 1/3 1/3??? Rémi Gilleron (Magnet) OT Feb / 52

18 Examples of Kantorovitch s OT problem X = R and Euclidean distance (x 1, 3), (x 2, 2); (y 1, 4), (y 2, 1); x 1 < x 2, y 1 < y 2 : x 1 y 1 ; 1/2x 2 y 1 ; 1/2x 2 y 2 (x 1, 3), (x 2, 2); (y 1, 4), (y 2, 1); x 1 < x 2, y 1 > y 2 : 1/3x 1 y 2 ; 2/3x 1 y 1 ; x 2 y 1 Binary cost matrix /3 2/3 1/3 0 0 C = a = 1/3 b = 1/6 P = 1/6 1/ /3 1/6 1/6 0 1/6 Assignment problem on X = R 2 with the Euclidean distance 5/2 5/4 5 1/3 1/ C = 5/4 5/4 2 a = 1/3 b = 1/3 P = /4 5/2 5 1/3 1/ Rémi Gilleron (Magnet) OT Feb / 52

19 Kantorovitch relaxation is tight for assignment problems Permutation matrices are couplings Let n = m, let a = b = 1 n /n, let C be a cost matrix in R n n + Kantorovitch s problem L C (1 n /n, 1 n /n) = min < C, P > P U(1 n/n,1 n/n) For σ Perm(n), the permutation matrix P σ is in U(1 n /n, 1 n /n) Kantorovitch for matching Proposition: There exists an optimal solution of Kantorovitch s problem which is a permutation matrix associated with an optimal permutation for the assignment problem (optimal transport map for Monge problem between histograms). Proof: extremal points of U(1 n /n, 1 n /n) are permutations (Birkhoff s Theorem) and minimum of a linear objective is reached at extremal points of the polyhedron (Bertsimas and Tsitsiklis). Rémi Gilleron (Magnet) OT Feb / 52

20 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52

21 Definition of the Wasserstein Distance OT defines a distance between measures when C satisfies some properties p-wasserstein distance between measures Let X = Y, let d be a distance on X, let p 1, and let c = d p, then W p is a distance on M 1 +(X ) where W p (α, β) = L d p(α, β) 1/p p-wasserstein Distance between histograms Let n = m, let X = Y = Σ n, let D be a distance, let p 1, and let C = D p the matrix of D p i,j, then W p is a distance on Σ n where W p (a, b) = L D p(a, b) 1/p First Remarks W p (resp. W p ) depends on D (resp. d) W 1 (resp. W 1 ) is the optimal transport cost for C = D (resp. c = d) Rémi Gilleron (Magnet) OT Feb / 52

22 p-wasserstein distance is a distance Proof for histograms Note that C = D p is symmetric and has a null diagonal. Then, W p (a, b) = 0 if and only if a = b easy W p (a, b) = W p (b, a) easy W p (a, c) W p (a, b) + W p (b, c) Let S = Pdiag(1/ b)q where P (resp. Q) is an optimal coupling between a and b (resp. between b and c), and b is b where null values are set to 1 Wp (a, c) (< S, D p >) 1/p because S U(a, c) Then use the triangular inequality for D p and the Minkowski inegality to show that (< S, D p >) 1/p W p (a, b) + W p (b, c). Rémi Gilleron (Magnet) OT Feb / 52

23 p-wasserstein distance properties Geometric intuition W p is a distance while many others are divergences as, for instance, the KL-divergence W p allows to compare singular distributions, for instance discrete ones. While classical distances or divergences do not allow to compare discrete distributions. W p allows to quantify spatial shift between the supports Barycenters of distributions can be defined with ᾱ = arg min α i λ iwp p (α i, α). W p (δ x, δ y ) = d(x, y) and W p (δ x, δ y ) 0 if x y. This allows to define a more general notion of weak convergence for distributions. Rémi Gilleron (Magnet) OT Feb / 52

24 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52

25 Special case: binary cost matrix Binary cost matrix Let a and b be two mass vectors in R n + and let C be the cost matrix 1 n n I n n, then L C (a, b) = 1 2 a b 1 Kronecker cost function and total variation Let α = n i=1 a iδ xi and β = m j=1 b jδ xj be two discrete measures, let c be the Kronecker cost function defined by c(x, y) = 0 if x = y and c(x, y) = 1 otherwise. Then L c (α, β) = TVD(α, β) = 1 2 α β 1 TVD(α, β) = sup a i b i is the total variation distance between α and β. i Rémi Gilleron (Magnet) OT Feb / 52

26 Special case: dimension 1 Discrete case Let X = R, let α = 1 i=n n i=1 δ x i and β = 1 j=n n j=1 δ y j and let us suppose that x 1 < x 2 <... x n and y 1 < y 2 <... y n, then W p is the L p -norm between two vectors of ordered values of α and β. i=n Wp p (α, β) = x i y i p i=1 W p (α, β) changes as soon as the order changes Can be extended to the case n m with the condition: if x i < x i, P i,j 0 and P i,j 0, then necessarily y j y j Arbitrary measure W p (α, β) = C 1 α C 1 β p L p ((0,1]) Rémi Gilleron (Magnet) OT Feb / 52

27 Special case: Gaussian distributions Two Gaussians in R w.r.t. Euclidean distance Let α = N (m 1, σ 1 ) and β = N (m 2, σ 2 ), then the optimal transport map is T (x) = m 2 + (x m 1 ) σ2 σ 1 W 2 2 (α, β) = m 2 m σ 2 σ 1 2 Two Gaussians in R d w.r.t. Euclidean distance Let α = N (m α, Σ α ) and β = N (m β, Σ β ), then the optimal map T is T (x) = m β + A(x m α ) where A = Σ 1 2 α (Σ 1 2 α Σ β Σ 1 2 α ) 1 2 Σ 1 2 α = A t W 2 2 (α, β) = m α m β 2 + tr(σ α + Σ β 2(Σ 1 2 α Σ β Σ 1 2 α ) 1 2 ) W 2 2 (α, β) = m α m β 2 + r s 2 if Σ α = diag(r) and Σ β = diag(s) Rémi Gilleron (Magnet) OT Feb / 52

28 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52

29 Reminder Main cases for discrete measures Points in R d or cells: (x i ) n i=1, (y j) m j=1 Discrete measures α = n i=1 a iδ xi and β = m j=1 b jδ yj, a X = R n +, b Y = R m +, a i mass at x i and b j mass at y j X = R n +, X = Σ n (histograms: sum is 1), a i = 1/n, case n = m c is a cost, c is a distance, c is the Euclidean distance, c is the squared Euclidean distance,... C is the cost matrix of c i,j OT and Wasserstein distance L C (a, b) = min < C, P >= min P U(a,b) P1 m=a, P t i,j C i,jp i,j 1 n=b Let n = m and X = Y = Σ n, let d be a distance, let p 1, and let C = d p the matrix of d p i,j, then W p is a distance on Σ n where W p (a, b) = L C (a, b) 1/p. Note that W p (δ x, δ y ) = d(x, y). Rémi Gilleron (Magnet) OT Feb / 52

30 OT is a linear program Kantorovitch linear program Let us recall that Kantorovitch s OT problem is L C (a, b) = It can be formulated as min < C, P >= min P U(a,b) P1 m=a, P t 1 n=b L C (a, b) = min(c t p) s.t. p R nm p +, Ap = C i,j P i,j i,j [ ] a b where the n m matrices P and C have been replaced by nm dimensional vectors p and c, and admissible couplings are defined with a well chosen matrix A. Rémi Gilleron (Magnet) OT Feb / 52

31 Simplex network algorithm for OT Algorithm complexity in O(n 3 ) Because one can restrict the search to extremal points of the polytope U(a, b) and using the structure of matrices P expressed with bipartite graphs, one can use a network flow solver. For matching problems (n=m, a i = b i = 1 n ), the auction algorithm runs in n 3 C /ɛ and the cost of the output is nɛ suboptimal Rémi Gilleron (Magnet) OT Feb / 52

32 Regularized Optimal Transport Adding a regularization penalty The Kantorovitch s regularized OT problem is L C (a, b) = min < C, P > +λω(p) P U(a,b) What for? Encode prior knowledge Better posed problem w.r.t. stability because more dense couplings Smooth approximate distance w.r.t. input histogram weights and positions of the Diracs Better complexity making the problem convex Regularization: quadratic, entropic, group Lasso, KL divergence,... Rémi Gilleron (Magnet) OT Feb / 52

33 Entropic Regularized Optimal Transport Entropic Regularization The entropic regularized OT problem (Cuturi 2013) is L λ C (a, b) = min < C, P > λh(p) P U(a,b) I.e. L λ C (a, b) = min < C, P > +λ P U(a,b) i,j P i,j (log(p i,j )) Rémi Gilleron (Magnet) OT Feb / 52

34 Entropic Regularized Optimal Transport Entropic Regularization The entropic regularized OT problem (Cuturi 2013) is L λ C (a, b) = min < C, P > λh(p) P U(a,b) I.e. L λ C (a, b) = min < C, P > +λ P U(a,b) i,j P i,j (log(p i,j )) Rémi Gilleron (Magnet) OT Feb / 52

35 Entropic Regularized Optimal Transport Entropic Regularization The entropic regularized OT problem (Cuturi 2013) is L λ C (a, b) = min < C, P > λh(p) (1) P U(a,b) I.e. L λ C (a, b) = min < C, P > +λ P U(a,b) i,j P i,j (log(p i,j )) Convergence with λ Entropy is strongly convex. Objective is λ-strongly convex. Proposition: the unique solution P λ of problem 1 converges to the optimal solution with maximal entropy within the set of all optimal solutions of the Kantorovitch s OT problem. In particular, L λ λ 0 C (a, b) L C (a, b). Also, L λ λ C (a, b) a b = ab t = (a i b j ) i,j Rémi Gilleron (Magnet) OT Feb / 52

36 Computing Entropic Regularized Optimal Transport Entropic regularized OT as matrix scaling Introducing dual variables, expressing the Lagrangian of (1), it can be shown that the solution of (1) has the form P = diag(u)kdiag(v) for (unknown) vectors u R n +, v R m +, and K defined by K i,j = e C i,j λ leads to the Sinkhorn s algorithm: Init v (0) = 1 m Repeat u (l+1) = a and v (l+1) = b Kv (l) K t u (l+1) Results complexity in O(n 2 ) Only matrix operations Convergence but numerical problems and difficulties for small λ Complexity in O(n 2 log(n)ɛ 3 ) GPU version when solving multiple OT problems Rémi Gilleron (Magnet) OT Feb / 52

37 Conclusion on the first part Also in Computational Optimal transport, G. Peyré and M. Cuturi Semi-discrete OT: one is discrete and one is arbitrary (often with density) W 1 OT, i.e. c is a distance W 2 OT for a geodesic distance Approximating OT with discrete samples (Eulerian or Lagrangian) Variational OT, i.e. using Wasserstein distance as a loss function Algorithms for computing Wasserstein barycenters Software for OT Python Optimal Transport library by Rémy Flamary Rémi Gilleron (Magnet) OT Feb / 52

38 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52

39 A new distance between text documents Base ideas Word embeddings: represent every word by a vector Word mover s distance (WMD): the distance between two text documents A and B is the minimum cumulative distance that words from document A need to travel to match exactly the point cloud of document B Kusner et al study k-nn with the WMD for classifying documents Huang et al define metric learning algorithms for the WMD References From word embeddings to document distances, Kusner et al, ICML 15 Supervised word mover s distance, Huang et al, NIPS 16 Rémi Gilleron (Magnet) OT Feb / 52

40 From word embeddings to document distances Representation of text documents A vocabulary of size n The i-th word w i is represented by a word vector x i R d A text document A is represented as an histogram a Σ n defined by a i = where c i is the word count for w i in A c i i=n i=1 c i Word mover s distance between text documents Cost or ground distance is chosen to be the Euclidean distance Word mover s distance between text documents A and B is the W 1 distance between a and b, i.e. WMD(A, B) = W 1 (a, b) = x i x j 2 P i,j min P1 n=a, P t 1 n=b i,j Rémi Gilleron (Magnet) OT Feb / 52

41 WMD for text classification k-nn with WMD works well 8-) Outperforms known methods on several datasets The word2vec embeddings works well on several domains Largest runtime because the time complexity of computing the WMD is O(q 3 log q), where q is the max number of unique words in A or B n is too large for GPU computation of multiple WMD with entropy regularization Prefetch and prune The authors introduce two lower bounds for WMD(A, B): Word centroid distance: WCD(A, B) = Xa Xb 2 Relaxed WMD: RWMD = max{wmd1 (A, B), WMD 2 (A, B)} Use the lower bounds to compute faster an approximation of the k-nearest neighbours for a text query. Rémi Gilleron (Magnet) OT Feb / 52

42 Supervised WMD for text classification WMD to be learned Squared generalized Euclidean distance: c(i, j) = A(x i x j ) 2 2 Histogram reweighting: A is represented as ã = (a w)/(w t a) Then, WMD A,w = i,j A(x i x j ) 2 2 P i,j min P1 n=ã, P t 1 n= b Learning the WMD from labeled data Learn A R r d and w R n s.t. WMD A,w reflects labels. Method: Stochastic neighborhood relaxation of the LOO loss as for NCA Express the gradient w.r.t. A and the gradient w.r.t. w Compute gradients using entropic regularization (O(q 2 ) w.r.t. O(q 3 )) on a subset of neighbors using WCD Clever initialization using WCD Use batch stochastic gradient descent Rémi Gilleron (Magnet) OT Feb / 52

43 Conclusion on WMD for texts Pros Well-defined distance between text documents based on OT It works well 8-) for classification with k-nn Cons = my comments 8-) Choice of the ground distance is ad hoc Euclidean distance. Why not the cosine distance? Squared Euclidean distance when gradient computation is needed Many tricks for supervised WMD to be solved efficiently : loss, choice of the neighbors with WCD, initialization, regularization Experimental results for supervised WMD not convincing Perspective: WMD with Gaussian embeddings; learn Gaussian embeddings with a WMD-based loss. Rémi Gilleron (Magnet) OT Feb / 52

44 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52

45 Domain adaptation with regularized optimal transport Source: Courty et al, ECML 2014, IEEE PAMI Unsupervised domain adaptation Source: labeled data; target: unlabeled data; problem: classify data in the target domain. Base idea: transport data from the source domain into the target domain, then learn a classifier. H Figure: Left: source labeled data (blue and red) and target data (green); right: source points are transported and a linear classifier H is learned Rémi Gilleron (Magnet) OT Feb / 52

46 Solution of Courty et al The OT formulation Ground distance: squared Euclidean distance Entropy regularization for efficiency Introducing domain adaptation in OT: the coupling should be s.t. a target point should not receive mass from points with different labels This leads to P = arg min< C, P > λh(p) + ηl(p) where L(P) P U(a,b) express group label sparsity of P with a L 1 -norm Use an alternating optimization algorithm Learning in the target domain Given a solution ˆP, every source point x i is transported to the barycenter ˆx i of its images Learn a classifier on the images ˆx i with the labels of the x i. Rémi Gilleron (Magnet) OT Feb / 52

47 Discussion on domain adaptation with OT Pros Elegant formulation of domain adaptation based on OT It works well 8-) for balanced problems Other regularizers and other algorithms in long version Extended to the semi-supervised case in another paper Cons = my comments 8-) Antagonism between λh(p) promoting non sparsity ηl(p) promoting group label sparsity Choice of η is not discussed. And choice of λ not so easy. Rémi Gilleron (Magnet) OT Feb / 52

48 Mapping estimation for discrete optimal transport Source: Perrot et al, NIPS 16 Motivations let us consider an OT problem in Kantorovitch s formulation for point clouds (x i ) i=n i=1 and (y j) j=m j=1 The optimal coupling P allows to define a transportation map T by j=m T (x i ) = arg min P(i, j)d(y, y j ) y Y I.e. T maps x i into the barycenter of its images. It is the weighted average when d is the Euclidean distance on Y j=1 But T is defined only for every source point x i Idea: learn a transformation T from X into Y Rémi Gilleron (Magnet) OT Feb / 52

49 Mapping estimation for discrete optimal transport Formulation of Perrot et al they propose the following optimization problem arg min T H,P U(a,b) 1 T (X s ) npx t 2 λ F + nd t max(c) < P, C > + γ R(T ) d s d t Problem jointly convex if H is a convex set of transformations and R is a convex function H si considered to be a set of linear transformations induced by a linear matrix or non linear using kernels Comments and results theoretical bounds are provided but... alternating optimization algorithm It works 8-) Rémi Gilleron (Magnet) OT Feb / 52

50 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52

51 Summary À retenir 8-) OT theory allows to define distances between distributions using spatial information For every type of distributions Computing OT (or Wasserstein distance) is solving an optimization problem. Complexity in O(n 3 ), reduced to O(n 2 ) with entropic regularization Mathematical foundations but not so easy to understand OT in the discussed papers ad hoc choice of the ground distance intricate optimization problems derived from OT ad hoc optimization algorithms But many opportunities to use OT in magnet research problems Rémi Gilleron (Magnet) OT Feb / 52

52 Conclusion For Magnet NLP: Mangoes, cosine distance, Gaussian embeddings, applications Domain adaptation? Distributed learning? For Jan et al, histogram prediction in graphs, Wasserstein propagation for semi-supervised learning, Solomon et al, ICML 14 Some recent papers among others at NIPS 17 and ICLR 18 Joint Distribution Optimal Transportation for Domain Adaptation Near-linear time approximation algorithms for optimal transport... Large Scale Optimal Transport and Mapping Estimation Improved Training of Wasserstein GANs Learning Wasserstein Embeddings Wasserstein Auto-Encoders Improving GANs Using Optimal Transport Rémi Gilleron (Magnet) OT Feb / 52

Generative Models and Optimal Transport

Generative Models and Optimal Transport Generative Models and Optimal Transport Marco Cuturi Joint work / work in progress with G. Peyré, A. Genevay (ENS), F. Bach (INRIA), G. Montavon, K-R Müller (TU Berlin) Statistics 0.1 : Density Fitting

More information

Mathematical Foundations of Data Sciences

Mathematical Foundations of Data Sciences Mathematical Foundations of Data Sciences Gabriel Peyré CNRS & DMA École Normale Supérieure gabriel.peyre@ens.fr https://mathematical-tours.github.io www.numerical-tours.com January 7, 2018 Chapter 18

More information

softened probabilistic

softened probabilistic Justin Solomon MIT Understand geometry from a softened probabilistic standpoint. Somewhere over here. Exactly here. One of these two places. Query 1 2 Which is closer, 1 or 2? Query 1 2 Which is closer,

More information

Submodular Functions Properties Algorithms Machine Learning

Submodular Functions Properties Algorithms Machine Learning Submodular Functions Properties Algorithms Machine Learning Rémi Gilleron Inria Lille - Nord Europe & LIFL & Univ Lille Jan. 12 revised Aug. 14 Rémi Gilleron (Mostrare) Submodular Functions Jan. 12 revised

More information

Optimal Transport: A Crash Course

Optimal Transport: A Crash Course Optimal Transport: A Crash Course Soheil Kolouri and Gustavo K. Rohde HRL Laboratories, University of Virginia Introduction What is Optimal Transport? The optimal transport problem seeks the most efficient

More information

Numerical Optimal Transport and Applications

Numerical Optimal Transport and Applications Numerical Optimal Transport and Applications Gabriel Peyré Joint works with: Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, Justin Solomon www.numerical-tours.com Histograms in Imaging

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Vikas Sindhwani, Partha Niyogi, Mikhail Belkin Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of

More information

Supervised Word Mover s Distance

Supervised Word Mover s Distance Supervised Word Mover s Distance Anonymous Author(s) Affiliation Address email Abstract 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Accurately measuring the similarity between text documents lies at

More information

Inderjit Dhillon The University of Texas at Austin

Inderjit Dhillon The University of Texas at Austin Inderjit Dhillon The University of Texas at Austin ( Universidad Carlos III de Madrid; 15 th June, 2012) (Based on joint work with J. Brickell, S. Sra, J. Tropp) Introduction 2 / 29 Notion of distance

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Optimal Transport for Domain Adaptation

Optimal Transport for Domain Adaptation Optimal Transport for Domain Adaptation Nicolas Courty, Rémi Flamary, Devis Tuia, Alain Rakotomamonjy To cite this version: Nicolas Courty, Rémi Flamary, Devis Tuia, Alain Rakotomamonjy. Optimal Transport

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Optimal transport for machine learning

Optimal transport for machine learning Optimal transport for machine learning Rémi Flamary AG GDR ISIS, Sète, 16 Novembre 2017 2 / 37 Collaborators N. Courty A. Rakotomamonjy D. Tuia A. Habrard M. Cuturi M. Perrot C. Févotte V. Emiya V. Seguy

More information

Convex relaxation for Combinatorial Penalties

Convex relaxation for Combinatorial Penalties Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,

More information

Large Scale Semi-supervised Linear SVMs. University of Chicago

Large Scale Semi-supervised Linear SVMs. University of Chicago Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.

More information

Geometric Inference for Probability distributions

Geometric Inference for Probability distributions Geometric Inference for Probability distributions F. Chazal 1 D. Cohen-Steiner 2 Q. Mérigot 2 1 Geometrica, INRIA Saclay, 2 Geometrica, INRIA Sophia-Antipolis 2009 June 1 Motivation What is the (relevant)

More information

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision

More information

On Optimal Frame Conditioners

On Optimal Frame Conditioners On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,

More information

Joint distribution optimal transportation for domain adaptation

Joint distribution optimal transportation for domain adaptation Optimal Transport for Domain Adaptation (TPAMI 2016) Joint distribution optimal transportation for domain adaptation (NIPS 2017) Joint distribution optimal transportation for domain adaptation Nicolas

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Wasserstein Training of Boltzmann Machines

Wasserstein Training of Boltzmann Machines Wasserstein Training of Boltzmann Machines Grégoire Montavon, Klaus-Rober Muller, Marco Cuturi Presenter: Shiyu Liang December 1, 2016 Coordinated Science Laboratory Department of Electrical and Computer

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and

More information

Manifold Regularization

Manifold Regularization 9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Stability of boundary measures

Stability of boundary measures Stability of boundary measures F. Chazal D. Cohen-Steiner Q. Mérigot INRIA Saclay - Ile de France LIX, January 2008 Point cloud geometry Given a set of points sampled near an unknown shape, can we infer

More information

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

Optimal Transport and Wasserstein Distance

Optimal Transport and Wasserstein Distance Optimal Transport and Wasserstein Distance The Wasserstein distance which arises from the idea of optimal transport is being used more and more in Statistics and Machine Learning. In these notes we review

More information

Convex Optimization in Classification Problems

Convex Optimization in Classification Problems New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

Linear and Logistic Regression. Dr. Xiaowei Huang

Linear and Logistic Regression. Dr. Xiaowei Huang Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics

More information

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines: Maximum Margin Classifiers Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind

More information

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views A Randomized Approach for Crowdsourcing in the Presence of Multiple Views Presenter: Yao Zhou joint work with: Jingrui He - 1 - Roadmap Motivation Proposed framework: M2VW Experimental results Conclusion

More information

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering Shuyang Ling Courant Institute of Mathematical Sciences, NYU Aug 13, 2018 Joint

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China) Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

A Closed-form Gradient for the 1D Earth Mover s Distance for Spectral Deep Learning on Biological Data

A Closed-form Gradient for the 1D Earth Mover s Distance for Spectral Deep Learning on Biological Data A Closed-form Gradient for the D Earth Mover s Distance for Spectral Deep Learning on Biological Data Manuel Martinez, Makarand Tapaswi, and Rainer Stiefelhagen Karlsruhe Institute of Technology, Karlsruhe,

More information

Unsupervised Learning

Unsupervised Learning CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality

More information

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods) Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March

More information

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = 30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

Penalized Barycenters in the Wasserstein space

Penalized Barycenters in the Wasserstein space Penalized Barycenters in the Wasserstein space Elsa Cazelles, joint work with Jérémie Bigot & Nicolas Papadakis Université de Bordeaux & CNRS Journées IOP - Du 5 au 8 Juillet 2017 Bordeaux Elsa Cazelles

More information

Convex Optimization M2

Convex Optimization M2 Convex Optimization M2 Lecture 8 A. d Aspremont. Convex Optimization M2. 1/57 Applications A. d Aspremont. Convex Optimization M2. 2/57 Outline Geometrical problems Approximation problems Combinatorial

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Integer Programming ISE 418. Lecture 8. Dr. Ted Ralphs

Integer Programming ISE 418. Lecture 8. Dr. Ted Ralphs Integer Programming ISE 418 Lecture 8 Dr. Ted Ralphs ISE 418 Lecture 8 1 Reading for This Lecture Wolsey Chapter 2 Nemhauser and Wolsey Sections II.3.1, II.3.6, II.4.1, II.4.2, II.5.4 Duality for Mixed-Integer

More information

Semidefinite Programming Basics and Applications

Semidefinite Programming Basics and Applications Semidefinite Programming Basics and Applications Ray Pörn, principal lecturer Åbo Akademi University Novia University of Applied Sciences Content What is semidefinite programming (SDP)? How to represent

More information

Online Manifold Regularization: A New Learning Setting and Empirical Study

Online Manifold Regularization: A New Learning Setting and Empirical Study Online Manifold Regularization: A New Learning Setting and Empirical Study Andrew B. Goldberg 1, Ming Li 2, Xiaojin Zhu 1 1 Computer Sciences, University of Wisconsin Madison, USA. {goldberg,jerryzhu}@cs.wisc.edu

More information

ICS-E4030 Kernel Methods in Machine Learning

ICS-E4030 Kernel Methods in Machine Learning ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38 Convex optimization Convex optimisation This

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

A Least Squares Formulation for Canonical Correlation Analysis

A Least Squares Formulation for Canonical Correlation Analysis A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

arxiv: v1 [cs.lg] 13 Nov 2018

arxiv: v1 [cs.lg] 13 Nov 2018 SEMI-DUAL REGULARIZED OPTIMAL TRANSPORT MARCO CUTURI AND GABRIEL PEYRÉ arxiv:1811.05527v1 [cs.lg] 13 Nov 2018 Abstract. Variational problems that involve Wasserstein distances and more generally optimal

More information

Supervised Metric Learning with Generalization Guarantees

Supervised Metric Learning with Generalization Guarantees Supervised Metric Learning with Generalization Guarantees Aurélien Bellet Laboratoire Hubert Curien, Université de Saint-Etienne, Université de Lyon Reviewers: Pierre Dupont (UC Louvain) and Jose Oncina

More information

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net

More information

Brief Introduction to Machine Learning

Brief Introduction to Machine Learning Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Learning SVM Classifiers with Indefinite Kernels

Learning SVM Classifiers with Indefinite Kernels Learning SVM Classifiers with Indefinite Kernels Suicheng Gu and Yuhong Guo Dept. of Computer and Information Sciences Temple University Support Vector Machines (SVMs) (Kernel) SVMs are widely used in

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

The role of dimensionality reduction in classification

The role of dimensionality reduction in classification The role of dimensionality reduction in classification Weiran Wang and Miguel Á. Carreira-Perpiñán Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

BBM402-Lecture 20: LP Duality

BBM402-Lecture 20: LP Duality BBM402-Lecture 20: LP Duality Lecturer: Lale Özkahya Resources for the presentation: https://courses.engr.illinois.edu/cs473/fa2016/lectures.html An easy LP? which is compact form for max cx subject to

More information

arxiv: v2 [cs.cl] 1 Jan 2019

arxiv: v2 [cs.cl] 1 Jan 2019 Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2

More information

18.6 Regression and Classification with Linear Models

18.6 Regression and Classification with Linear Models 18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight

More information

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1, Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

Optimal mass transport as a distance measure between images

Optimal mass transport as a distance measure between images Optimal mass transport as a distance measure between images Axel Ringh 1 1 Department of Mathematics, KTH Royal Institute of Technology, Stockholm, Sweden. 21 st of June 2018 INRIA, Sophia-Antipolis, France

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Signal Recovery from Permuted Observations

Signal Recovery from Permuted Observations EE381V Course Project Signal Recovery from Permuted Observations 1 Problem Shanshan Wu (sw33323) May 8th, 2015 We start with the following problem: let s R n be an unknown n-dimensional real-valued signal,

More information

Polyhedral Approaches to Online Bipartite Matching

Polyhedral Approaches to Online Bipartite Matching Polyhedral Approaches to Online Bipartite Matching Alejandro Toriello joint with Alfredo Torrico, Shabbir Ahmed Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Industrial

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Structured deep models: Deep learning on graphs and beyond

Structured deep models: Deep learning on graphs and beyond Structured deep models: Deep learning on graphs and beyond Hidden layer Hidden layer Input Output ReLU ReLU, 25 May 2018 CompBio Seminar, University of Cambridge In collaboration with Ethan Fetaya, Rianne

More information

Lecture 21: Minimax Theory

Lecture 21: Minimax Theory Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways

More information

Algorithms for Picture Analysis. Lecture 07: Metrics. Axioms of a Metric

Algorithms for Picture Analysis. Lecture 07: Metrics. Axioms of a Metric Axioms of a Metric Picture analysis always assumes that pictures are defined in coordinates, and we apply the Euclidean metric as the golden standard for distance (or derived, such as area) measurements.

More information

Sparse and Robust Optimization and Applications

Sparse and Robust Optimization and Applications Sparse and and Statistical Learning Workshop Les Houches, 2013 Robust Laurent El Ghaoui with Mert Pilanci, Anh Pham EECS Dept., UC Berkeley January 7, 2013 1 / 36 Outline Sparse Sparse Sparse Probability

More information

Correlation Autoencoder Hashing for Supervised Cross-Modal Search

Correlation Autoencoder Hashing for Supervised Cross-Modal Search Correlation Autoencoder Hashing for Supervised Cross-Modal Search Yue Cao, Mingsheng Long, Jianmin Wang, and Han Zhu School of Software Tsinghua University The Annual ACM International Conference on Multimedia

More information

Curvature and the continuity of optimal transportation maps

Curvature and the continuity of optimal transportation maps Curvature and the continuity of optimal transportation maps Young-Heon Kim and Robert J. McCann Department of Mathematics, University of Toronto June 23, 2007 Monge-Kantorovitch Problem Mass Transportation

More information