Optimal Transport in ML
|
|
- Blaise Wheeler
- 5 years ago
- Views:
Transcription
1 Optimal Transport in ML Rémi Gilleron Inria Lille & CRIStAL & Univ. Lille Feb Main Source also some figures Computational Optimal transport, G. Peyré and M. Cuturi Rémi Gilleron (Magnet) OT Feb / 52
2 Compare documents using word embeddings Given: word embeddings in R d, a similarity on R d Problem: compare documents (phrases, sentences) Idea: a document as an histogram of frequencies of words in V Figure: Three documents in R 2. Each circle corresponds to a word vector. The size of the circle is proportional to the frequency. Measure the transportation plan between histograms over the vocabulary using the similarity between word vectors Rémi Gilleron (Magnet) OT Feb / 52
3 Unsupervised domain adaptation Unsupervised domain adaptation Source: labeled data; target: unlabeled data; problem: classify data in the target domain. Base idea: transport data from the source domain into the target domain, then learn a classifier. The transportation plan between the two clouds use a ground distance and empirical distributions of the clouds Figure: Source: blue and red; target: green Rémi Gilleron (Magnet) OT Feb / 52
4 Histogram propagation over graphs Example of traffic estimation Given: a graph representing roads Given: at some nodes, captors allow to compute traffic histograms over a 24h period Problem: compute traffic histograms for every node Base idea: use a propagation algorithm over the graph The similarity should be based both on a similarity between histograms and a similarity on the graph which includes spatial information. Rémi Gilleron (Magnet) OT Feb / 52
5 Optimal Transport (OT) What is it? A method for comparing probability distributions with the ability to incorporate spatial information Pros Distance between distributions based on a ground distance Defined for all distributions: discrete, with density, arbitrary Solid mathematical foundations and works well in applications Cons Computing OT is solving an optimization problem Mathematics are not easy. Methods and algorithms depend on the measures (often discrete) and dimensions the ground cost: arbitrary, distance, squared distance, geodesic distance the ground space: R, R d, geodesic Rémi Gilleron (Magnet) OT Feb / 52
6 Research on computational OT Close to Magnet s research problems Word mover distance Optimal transport for domain adaptation Histogram propagation over graphs But also Signal and image processing, Wasserstein distances and divergences, Efficient computation of (regularized) OT, Generative models, Wasserstein GANs, among others. Rémi Gilleron (Magnet) OT Feb / 52
7 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52
8 Intuitions The goal of OT is to define geometric tools useful to compare probability distributions Earth Mover Distance probability distribution = pile of sand move one pile of sand into another one local cost: move one grain of sand from one place to another OT = minimal global cost Mines and Factories Problem Mines produce ressources across a country Factories consume ressources across a country Local cost for distributing one ressource from a mine to a factory OT = least costly transportation plan from mines into factories Rémi Gilleron (Magnet) OT Feb / 52
9 Main Scenarii Distributions Discrete measure α = n i=1 a iδ xi where δ x is the Dirac at x and a R n +. Then, for f C(X ), X f (x)dα(x) = n i=1 a if (x i ). n Probability measure: i=1 a i = 1, a Σ n is an histogram Lagrangian (point clouds): (xi ) i=n i=1, a i = 1 n Eulerian (histograms): grid, ai probability mass at cell i Rémi Gilleron (Magnet) OT Feb / 52
10 Main Scenarii Distributions Discrete measure α = n i=1 a iδ xi where δ x is the Dirac at x and a R n +. Then, for f C(X ), X f (x)dα(x) = n i=1 a if (x i ). n Probability measure: i=1 a i = 1, a Σ n is an histogram Lagrangian (point clouds): (xi ) i=n i=1, a i = 1 n Eulerian (histograms): grid, a i probability mass at cell i Measure with density α dα(x) = ρ α (x)dx. Then, for h C(X ), X h(x)dα(x) = X h(x)ρ α(x)dx. Arbitrary measure Rémi Gilleron (Magnet) OT Feb / 52
11 Main Scenarii Distributions Discrete measure α = n i=1 a iδ xi where δ x is the Dirac at x and a R n +. Then, for f C(X ), X f (x)dα(x) = n i=1 a if (x i ). n Probability measure: i=1 a i = 1, a Σ n is an histogram Lagrangian (point clouds): (xi ) i=n i=1, a i = 1 n Eulerian (histograms): grid, a i probability mass at cell i Measure with density α dα(x) = ρ α (x)dx. Then, for h C(X ), X h(x)dα(x) = X h(x)ρ α(x)dx. Arbitrary measure Ground space and ground cost X with a distance and, in general, X = R d with d = 1 or d > 1 c is a cost function from X Y in R +. When X = Y = R d, in general, c is a distance d or a squared distance d 2. It is a matrix C in the case of discrete measures. Rémi Gilleron (Magnet) OT Feb / 52
12 Monge Problem Transport Map Monge Problem for Discrete Measures Let α = n i=1 a iδ xi and β = m j=1 b jδ yj be two discrete measures. Find a map T from {x 1,..., x n } into {y 1,..., y m } such that, j, b j = {i T (x i )=y j } a i. T defines a pushforward operator T between measures s.t. T α = β. The transport map should minimize i c(x i, T (x i )). Exemples (x 1, 1), (x 2, 2), (x 3, 3), (x 4, 1), (x 5, 1); (y 1, 4), (y 2, 2), (y 3, 2), c = 1 Id with c = d, the Euclidean distance (x 1, 2), (x 2, 2), (x 3, 2); (y 1, 1), (y 2, 2), (y 3, 2), (y 4, 1), c = 1 (x 1, 3), (x 2, 2); (y 1, 4), (y 2, 1), c = 1 Rémi Gilleron (Magnet) OT Feb / 52
13 Monge problem for histograms Optimal assignment problem Let n = m, let a = b = 1 n /n, and let C be a cost matrix in R n + R n + giving the cost of moving x i into y j Find σ in the set Perm(n) of permutations of n elements solving min σ Perm(n) 1 i=n n i=1 C i,σ(i) Remarks Naive algorithm untractable because Perm(n) has n! elements ((2, 1), 1 2 ), ((2, 3), 1 2 ); ((1, 2), 1 2 ), ((3, 2), 1 2 ), Euclidean distance (x 1, 1 2 ), (x 2, 1 2 ), (y 1, 1 4 ), (y 2, 1 4 ); (y 3, 1 4 ), (y 4, 1 4 ), c = 1 Rémi Gilleron (Magnet) OT Feb / 52
14 Relaxation of the Monge problem Limitations of the Monge problem Feasible solutions may not exist Multiple solutions may exist Assignment problem is combinatorial Monge problem for arbitrary measures is not convex Existence and unicity of the Monge map for c squared Euclidean distance and measures with density (Brenier 91) Relax the deterministic nature of transportation (x 1, 3), (x 2, 2); (y 1, 4), (y 2, 1), c = 1. Kantorovitch s relaxation: mass splitting from a source towards several targets a coupling T maps x 1 into y 1, 1 2 of the mass in x 2 into y 1 and into y 2 Where are transported x 1 and x 2? Rémi Gilleron (Magnet) OT Feb / 52
15 Kantorovitch s OT problem for discrete measures Formulation Let a R n + and b R m + be two mass vectors for x 1,..., x n and y 1,..., y m and let C be the cost matrix A coupling is defined by a matrix P R+ n m where P i,j is the amount of mass flowing from x i to y j Set of admissible couplings is U(a, b) = {P P1 m = a, P t 1 n = b}. It is bounded and it is a convex polytope. Kantorovitch s OT problem is L C (a, b) = min < C, P >= min P U(a,b) P U(a,b) C i,j P i,j Mines and Factories Find an optimal transportation plan between mines and factories i,j Rémi Gilleron (Magnet) OT Feb / 52
16 Kantorovitch s OT problem Formulation for arbitrary measures Let α and β be two measures, a coupling π is a joint distribution over X Y Set of admissible couplings U(α, β) = {π P X π = α and P Y π = β} where P X and P Y are the push-forward projections. Kantorovitch s OT problem for a cost function c is L c (α, β) = c(x, y)dπ(x, y) min π U(α,β) X Y For discrete measures Let α = n i=1 a iδ xi and β = m j=1 b jδ yj be two discrete measures Then L c (α, β) = L C (a, b) where C is the cost matrix defined from c on the support of α and β. Rémi Gilleron (Magnet) OT Feb / 52
17 Examples of Kantorovitch s OT problem X = R and Euclidean distance (x 1, 3), (x 2, 2); (y 1, 4), (y 2, 1) with x 1 < x 2, y 1 < y 2 (x 1, 3), (x 2, 2); (y 1, 4), (y 2, 1) with x 1 < x 2, y 1 > y 2 Binary cost matrix /3 2/3??? C = a = 1/3 b = 1/6 P =??? /3 1/6??? Assignment problem on X = R 2 with the Euclidean distance x 1 = (0, 1), x 2 = (0, 2), x 3 = (0, 3), y 1 = (1, 5 2 ), y 2 = (1, 3 2 ), y 3 = (2, 2). 5/2 5/4 5 1/3 1/3??? C = 5/4 5/4 2 a = 1/3 b = 1/3 P =??? 5/4 5/2 5 1/3 1/3??? Rémi Gilleron (Magnet) OT Feb / 52
18 Examples of Kantorovitch s OT problem X = R and Euclidean distance (x 1, 3), (x 2, 2); (y 1, 4), (y 2, 1); x 1 < x 2, y 1 < y 2 : x 1 y 1 ; 1/2x 2 y 1 ; 1/2x 2 y 2 (x 1, 3), (x 2, 2); (y 1, 4), (y 2, 1); x 1 < x 2, y 1 > y 2 : 1/3x 1 y 2 ; 2/3x 1 y 1 ; x 2 y 1 Binary cost matrix /3 2/3 1/3 0 0 C = a = 1/3 b = 1/6 P = 1/6 1/ /3 1/6 1/6 0 1/6 Assignment problem on X = R 2 with the Euclidean distance 5/2 5/4 5 1/3 1/ C = 5/4 5/4 2 a = 1/3 b = 1/3 P = /4 5/2 5 1/3 1/ Rémi Gilleron (Magnet) OT Feb / 52
19 Kantorovitch relaxation is tight for assignment problems Permutation matrices are couplings Let n = m, let a = b = 1 n /n, let C be a cost matrix in R n n + Kantorovitch s problem L C (1 n /n, 1 n /n) = min < C, P > P U(1 n/n,1 n/n) For σ Perm(n), the permutation matrix P σ is in U(1 n /n, 1 n /n) Kantorovitch for matching Proposition: There exists an optimal solution of Kantorovitch s problem which is a permutation matrix associated with an optimal permutation for the assignment problem (optimal transport map for Monge problem between histograms). Proof: extremal points of U(1 n /n, 1 n /n) are permutations (Birkhoff s Theorem) and minimum of a linear objective is reached at extremal points of the polyhedron (Bertsimas and Tsitsiklis). Rémi Gilleron (Magnet) OT Feb / 52
20 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52
21 Definition of the Wasserstein Distance OT defines a distance between measures when C satisfies some properties p-wasserstein distance between measures Let X = Y, let d be a distance on X, let p 1, and let c = d p, then W p is a distance on M 1 +(X ) where W p (α, β) = L d p(α, β) 1/p p-wasserstein Distance between histograms Let n = m, let X = Y = Σ n, let D be a distance, let p 1, and let C = D p the matrix of D p i,j, then W p is a distance on Σ n where W p (a, b) = L D p(a, b) 1/p First Remarks W p (resp. W p ) depends on D (resp. d) W 1 (resp. W 1 ) is the optimal transport cost for C = D (resp. c = d) Rémi Gilleron (Magnet) OT Feb / 52
22 p-wasserstein distance is a distance Proof for histograms Note that C = D p is symmetric and has a null diagonal. Then, W p (a, b) = 0 if and only if a = b easy W p (a, b) = W p (b, a) easy W p (a, c) W p (a, b) + W p (b, c) Let S = Pdiag(1/ b)q where P (resp. Q) is an optimal coupling between a and b (resp. between b and c), and b is b where null values are set to 1 Wp (a, c) (< S, D p >) 1/p because S U(a, c) Then use the triangular inequality for D p and the Minkowski inegality to show that (< S, D p >) 1/p W p (a, b) + W p (b, c). Rémi Gilleron (Magnet) OT Feb / 52
23 p-wasserstein distance properties Geometric intuition W p is a distance while many others are divergences as, for instance, the KL-divergence W p allows to compare singular distributions, for instance discrete ones. While classical distances or divergences do not allow to compare discrete distributions. W p allows to quantify spatial shift between the supports Barycenters of distributions can be defined with ᾱ = arg min α i λ iwp p (α i, α). W p (δ x, δ y ) = d(x, y) and W p (δ x, δ y ) 0 if x y. This allows to define a more general notion of weak convergence for distributions. Rémi Gilleron (Magnet) OT Feb / 52
24 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52
25 Special case: binary cost matrix Binary cost matrix Let a and b be two mass vectors in R n + and let C be the cost matrix 1 n n I n n, then L C (a, b) = 1 2 a b 1 Kronecker cost function and total variation Let α = n i=1 a iδ xi and β = m j=1 b jδ xj be two discrete measures, let c be the Kronecker cost function defined by c(x, y) = 0 if x = y and c(x, y) = 1 otherwise. Then L c (α, β) = TVD(α, β) = 1 2 α β 1 TVD(α, β) = sup a i b i is the total variation distance between α and β. i Rémi Gilleron (Magnet) OT Feb / 52
26 Special case: dimension 1 Discrete case Let X = R, let α = 1 i=n n i=1 δ x i and β = 1 j=n n j=1 δ y j and let us suppose that x 1 < x 2 <... x n and y 1 < y 2 <... y n, then W p is the L p -norm between two vectors of ordered values of α and β. i=n Wp p (α, β) = x i y i p i=1 W p (α, β) changes as soon as the order changes Can be extended to the case n m with the condition: if x i < x i, P i,j 0 and P i,j 0, then necessarily y j y j Arbitrary measure W p (α, β) = C 1 α C 1 β p L p ((0,1]) Rémi Gilleron (Magnet) OT Feb / 52
27 Special case: Gaussian distributions Two Gaussians in R w.r.t. Euclidean distance Let α = N (m 1, σ 1 ) and β = N (m 2, σ 2 ), then the optimal transport map is T (x) = m 2 + (x m 1 ) σ2 σ 1 W 2 2 (α, β) = m 2 m σ 2 σ 1 2 Two Gaussians in R d w.r.t. Euclidean distance Let α = N (m α, Σ α ) and β = N (m β, Σ β ), then the optimal map T is T (x) = m β + A(x m α ) where A = Σ 1 2 α (Σ 1 2 α Σ β Σ 1 2 α ) 1 2 Σ 1 2 α = A t W 2 2 (α, β) = m α m β 2 + tr(σ α + Σ β 2(Σ 1 2 α Σ β Σ 1 2 α ) 1 2 ) W 2 2 (α, β) = m α m β 2 + r s 2 if Σ α = diag(r) and Σ β = diag(s) Rémi Gilleron (Magnet) OT Feb / 52
28 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52
29 Reminder Main cases for discrete measures Points in R d or cells: (x i ) n i=1, (y j) m j=1 Discrete measures α = n i=1 a iδ xi and β = m j=1 b jδ yj, a X = R n +, b Y = R m +, a i mass at x i and b j mass at y j X = R n +, X = Σ n (histograms: sum is 1), a i = 1/n, case n = m c is a cost, c is a distance, c is the Euclidean distance, c is the squared Euclidean distance,... C is the cost matrix of c i,j OT and Wasserstein distance L C (a, b) = min < C, P >= min P U(a,b) P1 m=a, P t i,j C i,jp i,j 1 n=b Let n = m and X = Y = Σ n, let d be a distance, let p 1, and let C = d p the matrix of d p i,j, then W p is a distance on Σ n where W p (a, b) = L C (a, b) 1/p. Note that W p (δ x, δ y ) = d(x, y). Rémi Gilleron (Magnet) OT Feb / 52
30 OT is a linear program Kantorovitch linear program Let us recall that Kantorovitch s OT problem is L C (a, b) = It can be formulated as min < C, P >= min P U(a,b) P1 m=a, P t 1 n=b L C (a, b) = min(c t p) s.t. p R nm p +, Ap = C i,j P i,j i,j [ ] a b where the n m matrices P and C have been replaced by nm dimensional vectors p and c, and admissible couplings are defined with a well chosen matrix A. Rémi Gilleron (Magnet) OT Feb / 52
31 Simplex network algorithm for OT Algorithm complexity in O(n 3 ) Because one can restrict the search to extremal points of the polytope U(a, b) and using the structure of matrices P expressed with bipartite graphs, one can use a network flow solver. For matching problems (n=m, a i = b i = 1 n ), the auction algorithm runs in n 3 C /ɛ and the cost of the output is nɛ suboptimal Rémi Gilleron (Magnet) OT Feb / 52
32 Regularized Optimal Transport Adding a regularization penalty The Kantorovitch s regularized OT problem is L C (a, b) = min < C, P > +λω(p) P U(a,b) What for? Encode prior knowledge Better posed problem w.r.t. stability because more dense couplings Smooth approximate distance w.r.t. input histogram weights and positions of the Diracs Better complexity making the problem convex Regularization: quadratic, entropic, group Lasso, KL divergence,... Rémi Gilleron (Magnet) OT Feb / 52
33 Entropic Regularized Optimal Transport Entropic Regularization The entropic regularized OT problem (Cuturi 2013) is L λ C (a, b) = min < C, P > λh(p) P U(a,b) I.e. L λ C (a, b) = min < C, P > +λ P U(a,b) i,j P i,j (log(p i,j )) Rémi Gilleron (Magnet) OT Feb / 52
34 Entropic Regularized Optimal Transport Entropic Regularization The entropic regularized OT problem (Cuturi 2013) is L λ C (a, b) = min < C, P > λh(p) P U(a,b) I.e. L λ C (a, b) = min < C, P > +λ P U(a,b) i,j P i,j (log(p i,j )) Rémi Gilleron (Magnet) OT Feb / 52
35 Entropic Regularized Optimal Transport Entropic Regularization The entropic regularized OT problem (Cuturi 2013) is L λ C (a, b) = min < C, P > λh(p) (1) P U(a,b) I.e. L λ C (a, b) = min < C, P > +λ P U(a,b) i,j P i,j (log(p i,j )) Convergence with λ Entropy is strongly convex. Objective is λ-strongly convex. Proposition: the unique solution P λ of problem 1 converges to the optimal solution with maximal entropy within the set of all optimal solutions of the Kantorovitch s OT problem. In particular, L λ λ 0 C (a, b) L C (a, b). Also, L λ λ C (a, b) a b = ab t = (a i b j ) i,j Rémi Gilleron (Magnet) OT Feb / 52
36 Computing Entropic Regularized Optimal Transport Entropic regularized OT as matrix scaling Introducing dual variables, expressing the Lagrangian of (1), it can be shown that the solution of (1) has the form P = diag(u)kdiag(v) for (unknown) vectors u R n +, v R m +, and K defined by K i,j = e C i,j λ leads to the Sinkhorn s algorithm: Init v (0) = 1 m Repeat u (l+1) = a and v (l+1) = b Kv (l) K t u (l+1) Results complexity in O(n 2 ) Only matrix operations Convergence but numerical problems and difficulties for small λ Complexity in O(n 2 log(n)ɛ 3 ) GPU version when solving multiple OT problems Rémi Gilleron (Magnet) OT Feb / 52
37 Conclusion on the first part Also in Computational Optimal transport, G. Peyré and M. Cuturi Semi-discrete OT: one is discrete and one is arbitrary (often with density) W 1 OT, i.e. c is a distance W 2 OT for a geodesic distance Approximating OT with discrete samples (Eulerian or Lagrangian) Variational OT, i.e. using Wasserstein distance as a loss function Algorithms for computing Wasserstein barycenters Software for OT Python Optimal Transport library by Rémy Flamary Rémi Gilleron (Magnet) OT Feb / 52
38 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52
39 A new distance between text documents Base ideas Word embeddings: represent every word by a vector Word mover s distance (WMD): the distance between two text documents A and B is the minimum cumulative distance that words from document A need to travel to match exactly the point cloud of document B Kusner et al study k-nn with the WMD for classifying documents Huang et al define metric learning algorithms for the WMD References From word embeddings to document distances, Kusner et al, ICML 15 Supervised word mover s distance, Huang et al, NIPS 16 Rémi Gilleron (Magnet) OT Feb / 52
40 From word embeddings to document distances Representation of text documents A vocabulary of size n The i-th word w i is represented by a word vector x i R d A text document A is represented as an histogram a Σ n defined by a i = where c i is the word count for w i in A c i i=n i=1 c i Word mover s distance between text documents Cost or ground distance is chosen to be the Euclidean distance Word mover s distance between text documents A and B is the W 1 distance between a and b, i.e. WMD(A, B) = W 1 (a, b) = x i x j 2 P i,j min P1 n=a, P t 1 n=b i,j Rémi Gilleron (Magnet) OT Feb / 52
41 WMD for text classification k-nn with WMD works well 8-) Outperforms known methods on several datasets The word2vec embeddings works well on several domains Largest runtime because the time complexity of computing the WMD is O(q 3 log q), where q is the max number of unique words in A or B n is too large for GPU computation of multiple WMD with entropy regularization Prefetch and prune The authors introduce two lower bounds for WMD(A, B): Word centroid distance: WCD(A, B) = Xa Xb 2 Relaxed WMD: RWMD = max{wmd1 (A, B), WMD 2 (A, B)} Use the lower bounds to compute faster an approximation of the k-nearest neighbours for a text query. Rémi Gilleron (Magnet) OT Feb / 52
42 Supervised WMD for text classification WMD to be learned Squared generalized Euclidean distance: c(i, j) = A(x i x j ) 2 2 Histogram reweighting: A is represented as ã = (a w)/(w t a) Then, WMD A,w = i,j A(x i x j ) 2 2 P i,j min P1 n=ã, P t 1 n= b Learning the WMD from labeled data Learn A R r d and w R n s.t. WMD A,w reflects labels. Method: Stochastic neighborhood relaxation of the LOO loss as for NCA Express the gradient w.r.t. A and the gradient w.r.t. w Compute gradients using entropic regularization (O(q 2 ) w.r.t. O(q 3 )) on a subset of neighbors using WCD Clever initialization using WCD Use batch stochastic gradient descent Rémi Gilleron (Magnet) OT Feb / 52
43 Conclusion on WMD for texts Pros Well-defined distance between text documents based on OT It works well 8-) for classification with k-nn Cons = my comments 8-) Choice of the ground distance is ad hoc Euclidean distance. Why not the cosine distance? Squared Euclidean distance when gradient computation is needed Many tricks for supervised WMD to be solved efficiently : loss, choice of the neighbors with WCD, initialization, regularization Experimental results for supervised WMD not convincing Perspective: WMD with Gaussian embeddings; learn Gaussian embeddings with a WMD-based loss. Rémi Gilleron (Magnet) OT Feb / 52
44 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52
45 Domain adaptation with regularized optimal transport Source: Courty et al, ECML 2014, IEEE PAMI Unsupervised domain adaptation Source: labeled data; target: unlabeled data; problem: classify data in the target domain. Base idea: transport data from the source domain into the target domain, then learn a classifier. H Figure: Left: source labeled data (blue and red) and target data (green); right: source points are transported and a linear classifier H is learned Rémi Gilleron (Magnet) OT Feb / 52
46 Solution of Courty et al The OT formulation Ground distance: squared Euclidean distance Entropy regularization for efficiency Introducing domain adaptation in OT: the coupling should be s.t. a target point should not receive mass from points with different labels This leads to P = arg min< C, P > λh(p) + ηl(p) where L(P) P U(a,b) express group label sparsity of P with a L 1 -norm Use an alternating optimization algorithm Learning in the target domain Given a solution ˆP, every source point x i is transported to the barycenter ˆx i of its images Learn a classifier on the images ˆx i with the labels of the x i. Rémi Gilleron (Magnet) OT Feb / 52
47 Discussion on domain adaptation with OT Pros Elegant formulation of domain adaptation based on OT It works well 8-) for balanced problems Other regularizers and other algorithms in long version Extended to the semi-supervised case in another paper Cons = my comments 8-) Antagonism between λh(p) promoting non sparsity ηl(p) promoting group label sparsity Choice of η is not discussed. And choice of λ not so easy. Rémi Gilleron (Magnet) OT Feb / 52
48 Mapping estimation for discrete optimal transport Source: Perrot et al, NIPS 16 Motivations let us consider an OT problem in Kantorovitch s formulation for point clouds (x i ) i=n i=1 and (y j) j=m j=1 The optimal coupling P allows to define a transportation map T by j=m T (x i ) = arg min P(i, j)d(y, y j ) y Y I.e. T maps x i into the barycenter of its images. It is the weighted average when d is the Euclidean distance on Y j=1 But T is defined only for every source point x i Idea: learn a transformation T from X into Y Rémi Gilleron (Magnet) OT Feb / 52
49 Mapping estimation for discrete optimal transport Formulation of Perrot et al they propose the following optimization problem arg min T H,P U(a,b) 1 T (X s ) npx t 2 λ F + nd t max(c) < P, C > + γ R(T ) d s d t Problem jointly convex if H is a convex set of transformations and R is a convex function H si considered to be a set of linear transformations induced by a linear matrix or non linear using kernels Comments and results theoretical bounds are provided but... alternating optimization algorithm It works 8-) Rémi Gilleron (Magnet) OT Feb / 52
50 Plan 1 Optimal Transport (OT) Monge Problem and Kantorovitch Problem Wasserstein Distance Special Cases 2 Algorithms for OT 3 Word Mover s Distance 4 OT for Domain Adaptation 5 Conclusion Rémi Gilleron (Magnet) OT Feb / 52
51 Summary À retenir 8-) OT theory allows to define distances between distributions using spatial information For every type of distributions Computing OT (or Wasserstein distance) is solving an optimization problem. Complexity in O(n 3 ), reduced to O(n 2 ) with entropic regularization Mathematical foundations but not so easy to understand OT in the discussed papers ad hoc choice of the ground distance intricate optimization problems derived from OT ad hoc optimization algorithms But many opportunities to use OT in magnet research problems Rémi Gilleron (Magnet) OT Feb / 52
52 Conclusion For Magnet NLP: Mangoes, cosine distance, Gaussian embeddings, applications Domain adaptation? Distributed learning? For Jan et al, histogram prediction in graphs, Wasserstein propagation for semi-supervised learning, Solomon et al, ICML 14 Some recent papers among others at NIPS 17 and ICLR 18 Joint Distribution Optimal Transportation for Domain Adaptation Near-linear time approximation algorithms for optimal transport... Large Scale Optimal Transport and Mapping Estimation Improved Training of Wasserstein GANs Learning Wasserstein Embeddings Wasserstein Auto-Encoders Improving GANs Using Optimal Transport Rémi Gilleron (Magnet) OT Feb / 52
Generative Models and Optimal Transport
Generative Models and Optimal Transport Marco Cuturi Joint work / work in progress with G. Peyré, A. Genevay (ENS), F. Bach (INRIA), G. Montavon, K-R Müller (TU Berlin) Statistics 0.1 : Density Fitting
More informationMathematical Foundations of Data Sciences
Mathematical Foundations of Data Sciences Gabriel Peyré CNRS & DMA École Normale Supérieure gabriel.peyre@ens.fr https://mathematical-tours.github.io www.numerical-tours.com January 7, 2018 Chapter 18
More informationsoftened probabilistic
Justin Solomon MIT Understand geometry from a softened probabilistic standpoint. Somewhere over here. Exactly here. One of these two places. Query 1 2 Which is closer, 1 or 2? Query 1 2 Which is closer,
More informationSubmodular Functions Properties Algorithms Machine Learning
Submodular Functions Properties Algorithms Machine Learning Rémi Gilleron Inria Lille - Nord Europe & LIFL & Univ Lille Jan. 12 revised Aug. 14 Rémi Gilleron (Mostrare) Submodular Functions Jan. 12 revised
More informationOptimal Transport: A Crash Course
Optimal Transport: A Crash Course Soheil Kolouri and Gustavo K. Rohde HRL Laboratories, University of Virginia Introduction What is Optimal Transport? The optimal transport problem seeks the most efficient
More informationNumerical Optimal Transport and Applications
Numerical Optimal Transport and Applications Gabriel Peyré Joint works with: Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, Justin Solomon www.numerical-tours.com Histograms in Imaging
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationBeyond the Point Cloud: From Transductive to Semi-Supervised Learning
Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Vikas Sindhwani, Partha Niyogi, Mikhail Belkin Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of
More informationSupervised Word Mover s Distance
Supervised Word Mover s Distance Anonymous Author(s) Affiliation Address email Abstract 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Accurately measuring the similarity between text documents lies at
More informationInderjit Dhillon The University of Texas at Austin
Inderjit Dhillon The University of Texas at Austin ( Universidad Carlos III de Madrid; 15 th June, 2012) (Based on joint work with J. Brickell, S. Sra, J. Tropp) Introduction 2 / 29 Notion of distance
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationOptimal Transport for Domain Adaptation
Optimal Transport for Domain Adaptation Nicolas Courty, Rémi Flamary, Devis Tuia, Alain Rakotomamonjy To cite this version: Nicolas Courty, Rémi Flamary, Devis Tuia, Alain Rakotomamonjy. Optimal Transport
More information(Kernels +) Support Vector Machines
(Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationOptimal transport for machine learning
Optimal transport for machine learning Rémi Flamary AG GDR ISIS, Sète, 16 Novembre 2017 2 / 37 Collaborators N. Courty A. Rakotomamonjy D. Tuia A. Habrard M. Cuturi M. Perrot C. Févotte V. Emiya V. Seguy
More informationConvex relaxation for Combinatorial Penalties
Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,
More informationLarge Scale Semi-supervised Linear SVMs. University of Chicago
Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.
More informationGeometric Inference for Probability distributions
Geometric Inference for Probability distributions F. Chazal 1 D. Cohen-Steiner 2 Q. Mérigot 2 1 Geometrica, INRIA Saclay, 2 Geometrica, INRIA Sophia-Antipolis 2009 June 1 Motivation What is the (relevant)
More informationPrinciples of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata
Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision
More informationOn Optimal Frame Conditioners
On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,
More informationJoint distribution optimal transportation for domain adaptation
Optimal Transport for Domain Adaptation (TPAMI 2016) Joint distribution optimal transportation for domain adaptation (NIPS 2017) Joint distribution optimal transportation for domain adaptation Nicolas
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationTutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba
Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation
More informationWasserstein Training of Boltzmann Machines
Wasserstein Training of Boltzmann Machines Grégoire Montavon, Klaus-Rober Muller, Marco Cuturi Presenter: Shiyu Liang December 1, 2016 Coordinated Science Laboratory Department of Electrical and Computer
More informationThe University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.
The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and
More informationManifold Regularization
9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,
More informationEE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015
EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,
More informationActive and Semi-supervised Kernel Classification
Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationStability of boundary measures
Stability of boundary measures F. Chazal D. Cohen-Steiner Q. Mérigot INRIA Saclay - Ile de France LIX, January 2008 Point cloud geometry Given a set of points sampled near an unknown shape, can we infer
More informationFantope Regularization in Metric Learning
Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationSUPPORT VECTOR MACHINE
SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition
More informationOptimal Transport and Wasserstein Distance
Optimal Transport and Wasserstein Distance The Wasserstein distance which arises from the idea of optimal transport is being used more and more in Statistics and Machine Learning. In these notes we review
More informationConvex Optimization in Classification Problems
New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1
More informationClustering. CSL465/603 - Fall 2016 Narayanan C Krishnan
Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification
More informationLinear and Logistic Regression. Dr. Xiaowei Huang
Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics
More informationSupport Vector Machines: Maximum Margin Classifiers
Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind
More informationA Randomized Approach for Crowdsourcing in the Presence of Multiple Views
A Randomized Approach for Crowdsourcing in the Presence of Multiple Views Presenter: Yao Zhou joint work with: Jingrui He - 1 - Roadmap Motivation Proposed framework: M2VW Experimental results Conclusion
More informationCertifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering
Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering Shuyang Ling Courant Institute of Mathematical Sciences, NYU Aug 13, 2018 Joint
More informationLecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016
Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,
More informationHOMEWORK #4: LOGISTIC REGRESSION
HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your
More informationNetwork Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)
Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu
More informationFinal Exam, Machine Learning, Spring 2009
Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationA Closed-form Gradient for the 1D Earth Mover s Distance for Spectral Deep Learning on Biological Data
A Closed-form Gradient for the D Earth Mover s Distance for Spectral Deep Learning on Biological Data Manuel Martinez, Makarand Tapaswi, and Rainer Stiefelhagen Karlsruhe Institute of Technology, Karlsruhe,
More informationUnsupervised Learning
CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality
More informationMachine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)
Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March
More informationMatrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =
30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can
More information9 Classification. 9.1 Linear Classifiers
9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive
More informationHOMEWORK #4: LOGISTIC REGRESSION
HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction
More informationPenalized Barycenters in the Wasserstein space
Penalized Barycenters in the Wasserstein space Elsa Cazelles, joint work with Jérémie Bigot & Nicolas Papadakis Université de Bordeaux & CNRS Journées IOP - Du 5 au 8 Juillet 2017 Bordeaux Elsa Cazelles
More informationConvex Optimization M2
Convex Optimization M2 Lecture 8 A. d Aspremont. Convex Optimization M2. 1/57 Applications A. d Aspremont. Convex Optimization M2. 2/57 Outline Geometrical problems Approximation problems Combinatorial
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationInteger Programming ISE 418. Lecture 8. Dr. Ted Ralphs
Integer Programming ISE 418 Lecture 8 Dr. Ted Ralphs ISE 418 Lecture 8 1 Reading for This Lecture Wolsey Chapter 2 Nemhauser and Wolsey Sections II.3.1, II.3.6, II.4.1, II.4.2, II.5.4 Duality for Mixed-Integer
More informationSemidefinite Programming Basics and Applications
Semidefinite Programming Basics and Applications Ray Pörn, principal lecturer Åbo Akademi University Novia University of Applied Sciences Content What is semidefinite programming (SDP)? How to represent
More informationOnline Manifold Regularization: A New Learning Setting and Empirical Study
Online Manifold Regularization: A New Learning Setting and Empirical Study Andrew B. Goldberg 1, Ming Li 2, Xiaojin Zhu 1 1 Computer Sciences, University of Wisconsin Madison, USA. {goldberg,jerryzhu}@cs.wisc.edu
More informationICS-E4030 Kernel Methods in Machine Learning
ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38 Convex optimization Convex optimisation This
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationA Least Squares Formulation for Canonical Correlation Analysis
A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationarxiv: v1 [cs.lg] 13 Nov 2018
SEMI-DUAL REGULARIZED OPTIMAL TRANSPORT MARCO CUTURI AND GABRIEL PEYRÉ arxiv:1811.05527v1 [cs.lg] 13 Nov 2018 Abstract. Variational problems that involve Wasserstein distances and more generally optimal
More informationSupervised Metric Learning with Generalization Guarantees
Supervised Metric Learning with Generalization Guarantees Aurélien Bellet Laboratoire Hubert Curien, Université de Saint-Etienne, Université de Lyon Reviewers: Pierre Dupont (UC Louvain) and Jose Oncina
More informationMLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT
MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net
More informationBrief Introduction to Machine Learning
Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector
More informationSupport Vector Machine
Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationLearning SVM Classifiers with Indefinite Kernels
Learning SVM Classifiers with Indefinite Kernels Suicheng Gu and Yuhong Guo Dept. of Computer and Information Sciences Temple University Support Vector Machines (SVMs) (Kernel) SVMs are widely used in
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationThe role of dimensionality reduction in classification
The role of dimensionality reduction in classification Weiran Wang and Miguel Á. Carreira-Perpiñán Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationBBM402-Lecture 20: LP Duality
BBM402-Lecture 20: LP Duality Lecturer: Lale Özkahya Resources for the presentation: https://courses.engr.illinois.edu/cs473/fa2016/lectures.html An easy LP? which is compact form for max cx subject to
More informationarxiv: v2 [cs.cl] 1 Jan 2019
Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2
More information18.6 Regression and Classification with Linear Models
18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight
More informationOn the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,
Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationOptimal mass transport as a distance measure between images
Optimal mass transport as a distance measure between images Axel Ringh 1 1 Department of Mathematics, KTH Royal Institute of Technology, Stockholm, Sweden. 21 st of June 2018 INRIA, Sophia-Antipolis, France
More informationApproximate Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.
More informationLecture 2: Linear Algebra Review
EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1
More informationSignal Recovery from Permuted Observations
EE381V Course Project Signal Recovery from Permuted Observations 1 Problem Shanshan Wu (sw33323) May 8th, 2015 We start with the following problem: let s R n be an unknown n-dimensional real-valued signal,
More informationPolyhedral Approaches to Online Bipartite Matching
Polyhedral Approaches to Online Bipartite Matching Alejandro Toriello joint with Alfredo Torrico, Shabbir Ahmed Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Industrial
More informationSupport Vector Machines and Kernel Methods
2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University
More informationLogistic Regression. William Cohen
Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationStructured deep models: Deep learning on graphs and beyond
Structured deep models: Deep learning on graphs and beyond Hidden layer Hidden layer Input Output ReLU ReLU, 25 May 2018 CompBio Seminar, University of Cambridge In collaboration with Ethan Fetaya, Rianne
More informationLecture 21: Minimax Theory
Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways
More informationAlgorithms for Picture Analysis. Lecture 07: Metrics. Axioms of a Metric
Axioms of a Metric Picture analysis always assumes that pictures are defined in coordinates, and we apply the Euclidean metric as the golden standard for distance (or derived, such as area) measurements.
More informationSparse and Robust Optimization and Applications
Sparse and and Statistical Learning Workshop Les Houches, 2013 Robust Laurent El Ghaoui with Mert Pilanci, Anh Pham EECS Dept., UC Berkeley January 7, 2013 1 / 36 Outline Sparse Sparse Sparse Probability
More informationCorrelation Autoencoder Hashing for Supervised Cross-Modal Search
Correlation Autoencoder Hashing for Supervised Cross-Modal Search Yue Cao, Mingsheng Long, Jianmin Wang, and Han Zhu School of Software Tsinghua University The Annual ACM International Conference on Multimedia
More informationCurvature and the continuity of optimal transportation maps
Curvature and the continuity of optimal transportation maps Young-Heon Kim and Robert J. McCann Department of Mathematics, University of Toronto June 23, 2007 Monge-Kantorovitch Problem Mass Transportation
More information