Supplementary material for the paper Optimal Transport for Domain Adaptation

Similar documents
Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions

1 Lyapunov Stability Theory

A Remark on the Uniform Convergence of Some Sequences of Functions

Solving Constrained Flow-Shop Scheduling. Problems with Three Machines

arxiv: v1 [cs.lg] 22 Feb 2015

MATH 247/Winter Notes on the adjoint and on normal operators.

Chapter 9 Jordan Block Matrices

CS286.2 Lecture 4: Dinur s Proof of the PCP Theorem

TESTS BASED ON MAXIMUM LIKELIHOOD

Mu Sequences/Series Solutions National Convention 2014

Research Article A New Iterative Method for Common Fixed Points of a Finite Family of Nonexpansive Mappings

Department of Agricultural Economics. PhD Qualifier Examination. August 2011

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

Summary of the lecture in Biostatistics

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

CHAPTER 4 RADICAL EXPRESSIONS

Point Estimation: definition of estimators

arxiv:math/ v1 [math.gm] 8 Dec 2005

PROJECTION PROBLEM FOR REGULAR POLYGONS

Runtime analysis RLS on OneMax. Heuristic Optimization

Econometric Methods. Review of Estimation

Chapter 5 Properties of a Random Sample

Unimodality Tests for Global Optimization of Single Variable Functions Using Statistical Methods

CSE 5526: Introduction to Neural Networks Linear Regression

= lim. (x 1 x 2... x n ) 1 n. = log. x i. = M, n

Lecture 9: Tolerant Testing

Bounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy

Dimensionality Reduction and Learning

UNIT 2 SOLUTION OF ALGEBRAIC AND TRANSCENDENTAL EQUATIONS

LECTURE 24 LECTURE OUTLINE

Bayes (Naïve or not) Classifiers: Generative Approach

0/1 INTEGER PROGRAMMING AND SEMIDEFINTE PROGRAMMING

A tighter lower bound on the circuit size of the hardest Boolean functions

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Rademacher Complexity. Examples

Arithmetic Mean and Geometric Mean

X ε ) = 0, or equivalently, lim

QR Factorization and Singular Value Decomposition COS 323

PTAS for Bin-Packing

ECON 5360 Class Notes GMM

Multiple Regression. More than 2 variables! Grade on Final. Multiple Regression 11/21/2012. Exam 2 Grades. Exam 2 Re-grades

( ) 2 2. Multi-Layer Refraction Problem Rafael Espericueta, Bakersfield College, November, 2006

1 Convergence of the Arnoldi method for eigenvalue problems

Introduction to local (nonparametric) density estimation. methods

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

Aitken delta-squared generalized Juncgk-type iterative procedure

1 Solution to Problem 6.40

The internal structure of natural numbers, one method for the definition of large prime numbers, and a factorization test

hp calculators HP 30S Statistics Averages and Standard Deviations Average and Standard Deviation Practice Finding Averages and Standard Deviations

1 Mixed Quantum State. 2 Density Matrix. CS Density Matrices, von Neumann Entropy 3/7/07 Spring 2007 Lecture 13. ψ = α x x. ρ = p i ψ i ψ i.

Exercises for Square-Congruence Modulo n ver 11

Support vector machines

9 U-STATISTICS. Eh =(m!) 1 Eh(X (1),..., X (m ) ) i.i.d

Investigation of Partially Conditional RP Model with Response Error. Ed Stanek

The Mathematical Appendix

18.413: Error Correcting Codes Lab March 2, Lecture 8

Complete Convergence and Some Maximal Inequalities for Weighted Sums of Random Variables

d dt d d dt dt Also recall that by Taylor series, / 2 (enables use of sin instead of cos-see p.27 of A&F) dsin

For combinatorial problems we might need to generate all permutations, combinations, or subsets of a set.

Assignment 5/MATH 247/Winter Due: Friday, February 19 in class (!) (answers will be posted right after class)

Strong Convergence of Weighted Averaged Approximants of Asymptotically Nonexpansive Mappings in Banach Spaces without Uniform Convexity

Functions of Random Variables

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

Special Instructions / Useful Data

Class 13,14 June 17, 19, 2015

Unsupervised Learning and Other Neural Networks

F. Inequalities. HKAL Pure Mathematics. 進佳數學團隊 Dr. Herbert Lam 林康榮博士. [Solution] Example Basic properties

An Introduction to. Support Vector Machine

Analysis of Lagrange Interpolation Formula

Numerical Analysis Formulae Booklet

Beam Warming Second-Order Upwind Method

Generalized Linear Regression with Regularization

CIS 800/002 The Algorithmic Foundations of Data Privacy October 13, Lecture 9. Database Update Algorithms: Multiplicative Weights

Homework 1: Solutions Sid Banerjee Problem 1: (Practice with Asymptotic Notation) ORIE 4520: Stochastics at Scale Fall 2015

ANALYSIS ON THE NATURE OF THE BASIC EQUATIONS IN SYNERGETIC INTER-REPRESENTATION NETWORK

Chapter 2 - Free Vibration of Multi-Degree-of-Freedom Systems - II

Chapter 14 Logistic Regression Models

Regression and the LMS Algorithm

Simulation Output Analysis

Lecture 3. Sampling, sampling distributions, and parameter estimation

ESS Line Fitting

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

Parameter, Statistic and Random Samples

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

EVALUATION OF FUNCTIONAL INTEGRALS BY MEANS OF A SERIES AND THE METHOD OF BOREL TRANSFORM

DIFFERENTIAL GEOMETRIC APPROACH TO HAMILTONIAN MECHANICS

BERNSTEIN COLLOCATION METHOD FOR SOLVING NONLINEAR DIFFERENTIAL EQUATIONS. Aysegul Akyuz Dascioglu and Nese Isler

A conic cutting surface method for linear-quadraticsemidefinite

KLT Tracker. Alignment. 1. Detect Harris corners in the first frame. 2. For each Harris corner compute motion between consecutive frames

LINEARLY CONSTRAINED MINIMIZATION BY USING NEWTON S METHOD

Kernel-based Methods and Support Vector Machines

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

L5 Polynomial / Spline Curves

2006 Jamie Trahan, Autar Kaw, Kevin Martin University of South Florida United States of America

CS 1675 Introduction to Machine Learning Lecture 12 Support vector machines

Chapter 4 Multiple Random Variables

PGE 310: Formulation and Solution in Geosystems Engineering. Dr. Balhoff. Interpolation

Application of Legendre Bernstein basis transformations to degree elevation and degree reduction

Transcription:

1 Supplemetary materal for the paper Optmal Trasport for Doma Adaptato Ncolas Courty, Member, IEEE, Rem Flamary, Devs Tua, Seor Member, IEEE, Ala Rakotomamojy, Member, IEEE I. OPTIMAL TRANSPORT FOR AFFINE TRANSFORMATIONS Our ma hypothess s that the trasformato T( ) preserves the codtoal dstrbuto P s (y x s ) = P t (y T(x s )) Hece, f we are able to estmate a trasformato T ( ) such that for all x, T (x) = T(x), we ca compute the exact codtoal probablty dstrbuto of ay example x the target doma through P t (y x) = P s (y T 1 (x)) supposg that T s vertble. We ca remark that the ablty of our approach to recover the codtoal probablty of a example essetally depeds o how well the codto T (x) = T(x) x s respected. I the followg, we prove that, for emprcal dstrbutos ad uder some mld codtos (postve defte A), the estmated trasportato map T obtaed by mmzg the l 2 groud metrc does recover exactly the true trasformato T. We wat to show that, f µ t (x) = µ s (Ax+b) where A S + ad b R d, the the estmated optmal trasportato pla T (x) = Ax + b. Ths bols dow to provg the followg theorem. Theorem 1.1: Let µ s ad µ t be two dscrete dstrbutos each composed of dracs as defed Equato (1) the paper. If the followg codtos hold 1) The source samples µ s are x s Rd, 1,..., such that x s xs j f j. 2) All weghts the source ad target dstrbutos are equal to 1. 3) The target samples are defed as x t = Axs + b.e. a affe traformato of the source samples. 4) b R d ad A S + s a strctly postve defte matrx. 5) The cost fucto c(x s, x t ) = x s x t 2. the the soluto T of the optmal trasport problem gves T (x s ) = Axs + b = xt 1,...,. Frst, ote that the OT problem (ad partcular the sum ad postvty costrats) forces the soluto γ to be a doubly stochastc matrx weghted by 1. Sce the objectve fucto s lear, the soluto wll be a vertex of the set of doubly stochastc matrces whch s a permutato matrx as stated by the Brkhoff-vo Neuma Theorem [1]. To prove Theorem 1.1 we frst show that, gve the above codtos, f the soluto of the dscrete OT problem s the detty matrx γ = 1 I, the the correspodg trasportato T (x) gves T (x s ) = Axs +b = xt 1,...,. Ths property results from the terpolato formula Equato (14), whch states that a terpolato alog the Wasserste mafold from the source to target dstrbuto wth t [, 1] s defed as: ˆµ =,j Whe γ = 1 I, t leads to γ (, j)δ (1 t)x s +tx t j. ˆµ = 1 δ (1 t)x s +tx t. Ths equato shows that the mass from each sample x s s trasported wthout ay splttg to ts correspodg target sample x t hece T (x s ) = xt. Ideed, we have (1 t)x s + tx t = (1 t)x s + t(ax s + b) (1) = (1 t ta)x s + tb, whch for t = 1 gves T (x s ) = xt. Hece, the OT soluto γ = 1 I yelds a exact recostructo of the trasformato o the samples ad T (x s ) = Axs + b = x t 1,...,. Cosequetly, to prove Theorem 1.1 we just eed to prove that the OT soluto γ s equal to 1 I gve the above codtos. Ths s doe the followg for some partcular cases of affe trasformatos ad for the geeral case where A S + ad b R d. For better readablty we wll deote a sample the source doma x s as x, whereas a sample the target doma s x t. A. Case wth o trasformato (A = I d ad b =, Fgure 1) I ths case, there s o dsplacemet. γ = argm γ, C F (2) The soluto s obvously γ = 1 I, sce t does ot volate the costrats ad s the oly possble soluto wth a loss (ay mass ot o the dagoal of γ would mply a crease the loss).

2.4.2 Ital dstrbutos 2 trasp.1.9.8 2.5 2. 1.5 Ital dstrbutos 2 trasp.1.9.8.7.7. 4.6 1. 4.6.5.5.2 6.4.5 6.4.4..5 1. 1.5 8 2 4 6 8.3.2.1...5 1 2 3 4 5 8 2 4 6 8.3.2.1. Fg. 1: Optmal trasport the affe trasformato 1: case wth o trasformato Fg. 3: Optmal trasport the affe trasformato 3: geeral case 2.5 2. 1.5 1..5 Ital dstrbutos 2 4 6 trasp.1.9.8.7.6.5.4 wll have a loss strctly larger that the loss of the detty permutato. Therefore, the soluto of the optmzato problem s the detty permutato. Ths shows that a traslato of the data does ot chage the soluto of the optmal trasport...5..5 1. 1.5 2. 2.5 3. 3.5 8 2 4 6 8 Fg. 2: Optmal trasport the affe trasformato 2: case wth traslato B. Case wth traslato (A = I d ad b R d, Fgure 2) I ths case, the metrc matrx becomes C such that C(, j) = x x j b 2 = x x j 2 + b 2 2b (x x j ). I ths case the soluto γ = I. The correspodg permutato perm I leads to the loss J(I) = 1 x x b 2 = b 2. I the geeral case, the loss for a permutato perm ca be expressed as = 1 x x perm() 2 + b 2 = b 2 + 1 x x perm() 2 Sce x x j, f j, for ay permutato term perm() the x x perm() 2 >. Ths meas that ay permutato that s ot the detty permutato.3.2.1. C. Geeral case (A S + ad b R d, Fgure 3) I ths case, we wll set b = sce, as prove prevously, a traslato does ot chage the optmal trasport soluto. The loss ca be expressed as J(γ) = 1 x Ax perm() 2 = 1 = 1 x 2 + Ax perm() 2 2x Ax perm() x 2 + Ax 2 2 x Ax perm(). The soluto of the optmzato wll the be the permutato that maxmzes the term x Ax perm(). A s a postve defte matrx, whch meas that the prevous term ca be see as a sum of scalar products A1/2 x, A 1/2 x perm() = x, x perm(), where x = A 1/2 x. It s easy to show that J(γ) = 1 x x perm() b 2 x x perm() 2 = 2 x 2 2 x x perm(), = 1 x x perm() 2 + whch meas that x b 2 2 2 x x perm(). (x x perm() ) b The detty permutato s the a maxmum of the cross = 1 x x perm() 2 + b 2 2 scalar product ad a mmum of J. Sce x x j, f (x x ) b j the equalty stads oly for the detty permutato. Ths meas that oce aga the soluto of the optmzato problem s the detty matrx. Note that f A has egatve egevalues, the trasport fals recostructg the orgal trasformato, as llustrated Fgure 4. II. INFLUENCE OF THE CHOICE OF THE COST FUNCTION We exame here role played by the type of cost fucto C used the adaptato. Ths cost fucto allows aturally to take to accout the specfcty of the data space.

3 Ital dstrbutos A. The geeralzed codtoal gradet algorthm 3 2 2 trasp.1.9.8.7 We are terested the problem of mmzg uder costrats a composte fucto such as 1 4 6.6.5.4 m F (γ) = f(γ) + g(γ), (3) 1..5..5 1. 1.5 2. 2.5 8 2 4 6 8 Fg. 4: Optmal trasport the affe trasformato 4: case wth egatve egevalues I the case of the Offce-Caltech dataset, the features are hstograms. We have tested our OT-IT adaptato strategy wth alteratve costs that operate drectly o hstograms (l 2 ad l 1 orms), as t s a commo practce to cosder those metrcs whe comparg dstrbutos. Also, as mostly doe the lterature, we cosder the pre-pocessed features (ormalzato + zscore) together wth Eucldea ad squared Eucldea orms. As t ca be see Table I, the best results are ot always obtaed wth the same metrc. I the paper, we choose to oly cosder the l 2 orm o preprocessed data, as t s the oe that gave the best result o average, but t s relevat to questo what s the best metrc for a adaptato task. Several other metrcs could be cosdered, lke geodesc dstaces the case where data lve o a mafold. Also, a better talored metrc could be leart from the data. Some recet works are cosderg ths opto [2], [3] for other tasks. Ths costtutes a terestg perspectve for the followg extesos of our methodology for the doma adaptato problem. TABLE I: Results of chagg the cost fucto matrx C o the Offce-Caltech dataset. Domas l 2 l 2 +preprocessg l 2 2 +preprocessg l 1 C A 36.1 39.14 43.1 39.77 C W 25.8 35.59 37.63 37.63 C D 31.21 43.31 42.68 42.68 A C 32.41 34.64 34.46 38.29 A W 32.88 34.92 32.2 34.24 A D 35.3 36.94 33.76 36.94 W C 22.35 32.59 31.79 25.56 W A 27.56 39.98 38. 31.84 W D 84.71 9.45 89.81 92.99 D C 26.36 31.79 32.32 3.63 D A 29.54 32.36 3.69 32.67 D W 74.92 87.8 88.81 87.12 mea 38.17 44.96 44.6 44.2 III. REMARKS ON GENERALIZED CONDITIONAL GRADIENT We frst preset the geeralzed codtoal gradet algorthm as troduced by Bredes et al. [4] ad, the subsequet sectos, we wll provde addtoal propertes that are of terest whe applyg ths algorthm for regularzed optmal trasport problems..3.2.1. where both f( ) ad g( ) are covex ad dfferetable fuctos ad B s a compact set of R. Oe mght wat to beeft from ths composte structure durg the optmzato procedure. For stace, f we have a effcet solver for optmzg m f, γ + g(γ) (4) we propose ths solver the optmzato scheme stead of learzg the whole objectve fucto (as oe would do wth a codtoal gradet algorthm [5]). The resultg approach s defed Algorthm 1. Coceptually, our algorthm les -betwee the orgal optmzato problem ad the codtoal gradet. Ideed, f we do ot cosder ay learzato, step 3 of the algorthm s equvalet to solvg the orgal problem ad oe terate wll suffce for covergece. If we use a full learzato as the codtoal gradet approach, step 3 s equvalet to solvg a rough approxmato of the orgal problem. By learzg oly a part of the objectve fucto, we thus optmze a better approxmato of that fucto. Ths wll lead to a provably better certfcate of optmalty tha the oe of the codtoal gradet algorthm [6]. Ths makes us thus beleve that f a effcet solver of the partally learzed problem s avalable, our algorthm s of strog terest. Note that ths partal learzato dea has already bee troduced by Bredes et al. [4] for solvg problem (3). Ther theoretcal results related to the resultg algorthm apply whe f( ) s dfferetable, g( ), s covex ad both f ad g satsfy some others mld codtos lke coercvty. These results state that the geeralzed codtoal gradet algorthm s a descet method ad that ay lmt pot of the algorthm s a statoary pot of f + g. I what follows, we provde some results whe f ad g are dfferetable. Some of these results provde ovel sghts o the geeralzed gradet algorthms (relato betwee optmalty ad mmzer of the search drecto, covergece rate, ad optmalty certfcate) whle some are redudat to those proposed by Bredes (covergece). B. Covergece of the algorthm 1) Reformulatg the search drecto: Before dscussg covergece of the algorthm, we frst reformulated ts step 3 so as to make ts propertes more accessble ad ts covergece aalyss more ameable. The reformulato we propose s s k = argm f(γ k ), s γ k + g(s) g(γ k ) (5) ad t s easy to ote that the problem le 3 of Algorthm 1 s equvalet to ths oe ad leads to the same soluto.

4 Algorthm 1 Geeralzed Codtoal Gradet (GGG) 1: Italze k = ad γ P 2: repeat 3: Solve problem s k = argm f(γ k ), γ + g(γ) 4: Fd the optmal step α k wth γ = s k γ k α k = argm α 1 f(γ k + α γ) + g(γ k + α γ) or choose α k so that t satsfes the Armjo rule. 5: γ k+1 γ k + α k γ, set k k + 1 6: utl Covergece 2) Relato betwee mmzers of Problems 3 ad 5 : The above formulato allows us to derve a property that hghlgths the relato betwee problems 3 ad 5. Proposto 3.1: x s a mmzer of problem (3) f ad oly f γ = argm f(γ ), s γ + g(s) g(γ ). (6) Proof: The proof reles o optmalty codtos of costraed covex optmzato problem. Ideed, for a covex ad dfferetable f ad g, γ s soluto of problem (3) f ad oly f [7] f(γ ) g(γ ) N B (γ ), (7) where N B (γ) s the ormal coe of B at γ. I the same way, a mmzer s of problem (5) at γ k ca also be characterzed as f(γ k ) g(s ) N B (s ). (8) Now suppose that γ s a mmzer of problem (3). It s easy to see that f we choose γ k = γ, the because γ satsfes Equato (7), Equato (8) also holds. Coversely, f γ s a mmzer of problem (5) at γ, the γ also satsfes Equato (7). 3) Itermedate results ad gap certfcate: We prove several lemmas ad exhbt a gap certfcate that provdes a boud o the dfferece of the objectve value alog the teratos to the optmal objectve value. As oe may remark, our algorthm s very smlar to a codtoal gradet algorthm, the Frak-Wolfe algorthm. As such, our proof of covergece of the algorthm wll follow smlar les as those used by Bertsekas. Our covergece results s based o the followg proposto ad defto gve [5]. Proposto 3.2: from [5]. Let {γ k } be a sequece geerated by the feasble drecto method γ k+1 = γ k +α k γ wth γ k = s k γ k. Assume that { γ k } s gradetrelated ad that α k s chose by the lmted mmzato or the Armjo rule, the every lmt pot of {γ k } s a statoary pot. Defto A sequece γ k s sad to be gradet-related to the sequece γ k f, for ay subsequece of {γ k } k K that coverges to a o-statoary pot, the correspodg subsequece { γ k } k K s bouded ad satsfes lm sup F (γ k ) γ k < Bascally, ths property says that f a subsequece coverges to a o-statoary pot, the at the lmt pot the feasble drecto defed by γ s stll a descet drecto. Before provg that the sequece defed by { γ k } s gradet-related, we prove useful lemmas. Lemma 3.3: For ay γ k B, each γ k = s k γ k defes a feasble descet drecto. Proof: By defto, s k belogs to the covex set B. Hece, for ay α k [, 1], γ k+1 defes a feasble pot. Hece γ k s a feasble drecto. Now let us show that t also defes a descet drecto. By defto of the mmzer s k, we have for all s B f(γ k ), s k γ k + g(s k ) g(γ k ) f(γ k ), s γ k + g(s) g(γ k ). Because the above equalty also holds for s = γ k, we have f(γ k ), s k γ k + g(s k ) g(γ k ). (9) By covexty of g( ) we have g(s k ) g(γ k ) g(γ k ), s k γ k, whch, plugged equato, (9) leads to f(γ k ) + g(γ k ), s k γ k ad thus F (γ k ), s k γ k, whch proves that γ k s a descet drecto. The ext lemma provdes a terestg feature of our algorthm. Ideed, the lemma states that the dfferece betwee the optmal objectve value ad the curret objectve value ca be easly motored. Lemma 3.4: For all γ k B, the followg property holds [ f(γ k ), s γ k +g(s) g(γ k ) ] F (γ ) F (γ k ), m where γ s a mmzer of F. I addto, f γ k does ot belog to the set of mmzers of F ( ), the the secod equalty s strct. Proof: By covexty of f, we have f(γ ) f(γ k ) f(γ k ) (γ γ k ). By addg g(γ ) g(γ k ) to both sdes of the equalty, we obta F (γ ) F (γ k ) f(γ k ) (γ γ k ) + g(γ ) g(γ k ) ad because γ s a mmzer of F, we also have F (γ ) F (γ k ). Hece, the followg holds f(γ k ), γ γ k +g(γ ) g(γ k ) F (γ ) F (γ k )

5 ad we also have 5) Rate of covergece: We ca show that the objectve value F (x k ) coverges towards F (x ) a lear rate f we m f(γ k ), s γ k +g(s) g(γ k ) F (γ ) F (γ k ), have some addtoal smoothess codto/dts of F ( ). We ca easly prove ths statemet by followg the steps whch cocludes the frst part of the proof. Fally, f γ k s ot a mmzer of F, the we aturally have > F (γ ) F (γ k ). proposed by Jagg et al. [6] for the codtoal gradet algorthm. We make the hypothess that there exsts a costat C F so that for ay x, y B ad ay α [, 1], the equalty F ((1 α)x + αy) F (x) + α F (x) (y x) + C F 2 α2 4) Proof of covergece: Now that we have all the peces of the proof, let us show the key gredet. Lemma 3.5: The sequece { γ k } of our algorthm s gradet-related. Proof: For showg that our drecto sequece s gradet-related, we have to show that, gve a subsequece {γ k } k K that coverges to a o-statoary pot γ, the sequece { γ k } k K s bouded ad that lm sup F (γ k ) γ k <. Boudedess of the sequece aturally derves from the facts that s k B, γ k B ad the set B s bouded. The secod part of the proof starts by showg that F (γ k ), s k γ k = f(γ k ) + g(γ), s k γ k f(γ k ), s k γ k + g(s k ) g(γ k ), where the last equalty s obtaed owg to the covexty of g. Because that equalty holds for the mmzer, t also holds for ay vector s B : F (γ k ), s k γ k f(γ k ), s γ k + g(s) g(γ k ). Takg lmt yelds to lm sup F (γ k ), s k γ k f( γ), s γ +g(s) g( γ) for all s B. As such, ths equalty also holds for the mmzer lm sup F (γ k ), s k γ k m f( γ), s γ + g(s) g( γ). Now, sce γ s ot statoary, t s ot optmal ad t does ot belog to the mmzer of F, hece accordg to the above lemma 3.4, m f( γ), s γ + g(s) g( γ) <, whch cocludes the proof. Ths latter lemma proves that our drecto sequece s gradet-related, thus proposto 3.2 apples. holds. Based o ths equalty, for a sequece {x k } obtaed from the geeralzed codtoal gradet algorthm we have F (x k+1 ) F (x ) = F ((1 α k )x k + α k s k ) F (x ) F (x k ) F (x ) + α k F (x k ) (s k x k ) + C F 2 α2 k. (1) Let us deote h(x k ) = F (x k ) F (x ). Now by addg to both sdes of the equalty α k [g(s k ) g(x k )], we have h(x k+1 ) + α k [g(s k ) g(x k )] h(x k ) + α k [ f(x k ) (s k x k ) + g(s k ) g(x k )] +α k g(x k ) (s k x k ) + C F 2 α2 k h(x k ) + α k [ f(x k ) (x x k ) + g(x ) g(x k )] +α k g(x k ) (s k x k ) + C F 2 α2 k, where the secod equalty comes from the defto of the search drecto s k. Because f( ) s covex, we have f(x ) f(x k ) f(x k ) (x x k ). Thus, we have h(x k+1 ) + α k [g(s k ) g(x k )] h(x k ) + α k [f(x ) f(x k ) + g(x ) g(x k )] +α k g(x k ) (s k x k ) + C F 2 α2 k (1 α k )h(x k ) + α k g(x k ) (s k x k ) + C F 2 α2 k. Now, aga owg to the covexty of g( ), we have g(s k ) + g(x k ) g(x k ) (s k x k ). Usg ths fact the last above equalty leads to h(x k+1 ) (1 α k )h(x k ) + C F 2 α2 k. (11) Based o ths result, we ca ow state the followg Theorem 3.6: For each k 1, the terates x k of Algorthm 1 satsfy F (x k ) F (x ) 2C F k + 2. Proof: The proof stads o Equato (11) ad o the same ducto as the oe used by Jagg et al [6]. Note that ths covergece property would also hold f we choose the step sze as α k = 2 k+2.

6 TABLE II: Computatoal tme (secods) for the best set of regularzato parameters OT-exact OT-IT OT-GL OT-Laplace U M 86. 4.6 92.5 55.6 M U 85. 2.3 75.4 2.5 P1 P2 131. 3.8 432.7 333.2 P1 P3 133.9 27.9 456.6 296.2 P1 P4 132.5 21.3 276.5 153.4 P2 P1 13.9 38.3 666.8 413.1 P2 P3 65.1 14.3 418.9 182.8 P2 P4 67.8 11.7 27.1 12. P3 P1 135.7 36.4 519. 389. P3 P2 67.6 14.3 297.8 156.6 P3 P4 62.2 9.8 248.3 18.8 P4 P1 134.6 22.9 54.4 272.6 P4 P2 66.7 11.1 262.5 123. P4 P3 65.3 12.6 269.6 127.5 C A 26.9 1.2 22.5 17.8 C W 6.8.3 6.6 7.4 C D 3.4.2 3.5 2.4 A C 26.5 1.1 29.4 23.8 A W 5.6.3 6. 6. A D 2.9.2 3.2 4.1 W C 6.8.3 16.2 7.6 W A 5.7.3 4.1 7.4 W D.8.1 2.2 1.1 D C 3.6.2 14.7 5.9 D A 3..2 12.5 5.1 D W.8.1 3.8 1.1 mea 86.8 2.4 349.1 246.4 Fg. 5: Example of the evoluto of the objectve value alog (top) the teratos ad (bottom) the rug tmes for the geeralzed codtoal gradet (CGS) ad the codtoal gradet (CG + Mosek) algorthms. 6) Computatoal performaces: Let us frst show that the GCG algorthm s more effcet tha a classcal codtoal gradet method as the oe used [8]. We llustrate ths Fgure 5, showg the covergece (top pael) ad the correspodg computatoal tmes (bottom pael). We take as a example the case of computg the OT-GL trasport pla of dgts 1 ad 2 USPS to those MNIST. For ths example, we have allowed a maxmum of 5 teratos. Regardg covergece of the cost fucto alog the teratos, we ca clearly see that, whle GCG reaches early-optmal objectve value aroud 1 teratos, the CG approach s stll far from covergece after 5 teratos. I addto, the per-terato cost s sgfcatly hgher for the codtoal gradet algorthm. For ths example, we save a order of magtude of rug tme yet CG has ot coverged. We ow study the computatoal performaces of the dfferet optmal trasport strateges the vsual object recogto tasks cosdered above. We use Pytho mplemetatos of the dfferet OT methods. The test were ru o a smple Macbook pro stato, wth a 2.4 Ghz processor. The orgal OT-Exact soluto to the optmal trasport problem s computed wth the MOSEK [9] lear programmg solver 1, whereas the other strateges follow 1 other publcly avalable solvers were cosdered, but t tured out ths partcular oe was a order of magtude faster tha the others our ow mplemetato based o the Skhor-Kopp method. For the regularzed optmal trasport OT-GL ad OT-Laplace we used the codtoal gradet splttg algorthm (the source code wll be made avalable upo the acceptace of the artcle). We report Table II the computatoal tme eeded by the OT methods. As expected, OT-IT s the less computatoally tesve. The soluto of the exact optmal trasport (OT-exact) s loger to compute by a factor 4. Also, as expected, the two regularzed versos OT- GL ad OT-Laplace are the most demadg methods. We recall here that the maxmum umber of er loop of the GCG approach was set to 1, meag that each of those methods made 1 calls to the Skhor-Kopp solver used by OT-IT. However, the added computatoal cost s mostly due to the le search procedure (le 4 Algorthm 1), whch volves several computatos of the cost fucto. We expla the dfferece betwee OT-GL ad OT-Laplace by the dfferece of computato tme eeded by ths procedure. Summg up, oe ca otce that, eve for large problems (case P1 P4 for stace, volvg 3332 3329 varables), the computato tme s ot prohbtve ad remas tractable. REFERENCES [1] J. vo Neuma, A certa zero-sum two-perso game equvalet to the optmal assgmet problem, Cotrbutos to the Theory of Games, vol. 2, 1953. [2] G. Ze, E. Rcc, ad N. Sebe, Smultaeous groud metrc learg ad matrx factorzato wth earth mover s dstace, Patter Recogto (ICPR), 214 22d Iteratoal Coferece o, Aug 214, pp. 369 3695.

[3] M. Cutur ad D. Avs, Groud metrc learg, J. Mach. Lear. Res., vol. 15, o. 1, pp. 533 564, Ja. 214. [4] Krsta Bredes, Drk Lorez, ad Peter Maass, Equvalece of a geeralzed codtoal gradet method ad the method of surrogate fuctoals, Zetrum für Techomathematk, 25. [5] Dmtr P Bertsekas, Nolear programmg, Athea scetfc Belmot, 1999. [6] Mart Jagg, Revstg frak-wolfe: Projecto-free sparse covex optmzato, Proceedgs of the 3th Iteratoal Coferece o Mache Learg (ICML-13), 213, pp. 427 435. [7] D. Bertsekas, A. Nedc, ad A. Ozdaglar, Covex Aalyss ad Optmzato, Athea Scetfc, 23. [8] Sra Ferradas, Ncolas Papadaks, Gabrel Peyré, ad Jea-Fraços Aujol, Regularzed dscrete optmal trasport, SIAM Joural o Imagg Sceces, vol. 7, o. 3, 214. [9] Erlg Aderse ad Kud Aderse, The mosek teror pot optmzer for lear programmg: A mplemetato of the homogeeous algorthm, Hgh Performace Optmzato, vol. 33 of Appled Optmzato, pp. 197 232. Sprger US, 2. 7