OPTIMAL PRIMAL-DUAL METHODS FOR A CLASS OF SADDLE POINT PROBLEMS

Size: px

Start display at page:

Download "OPTIMAL PRIMAL-DUAL METHODS FOR A CLASS OF SADDLE POINT PROBLEMS"

Kelly White
5 years ago
Views:

1 OPTIMAL PRIMAL-DUAL METHODS FOR A CLASS OF SADDLE POIT PROBLEMS YUMEI CHE, GUAGHUI LA, AD YUYUA OUYAG Absrac. We presen a novel acceleraed primal-dual APD) mehod for solving a class of deerminisic and sochasic saddle poin problems SPP). The basic idea of his algorihm is o incorporae a muli-sep acceleraion scheme ino he primaldual mehod wihou smoohing he objecive funcion. For deerminisic SPP, he APD mehod achieves he same opimal rae of convergence as eserov s smoohing echnique. Our sochasic APD mehod exhibis an opimal rae of convergence for sochasic SPP no only in erms of is dependence on he number of he ieraion, bu also on a variey of problem parameers. To he bes of our knowledge, his is he firs ime ha such an opimal algorihm has been developed for sochasic SPP in he lieraure. Furhermore, for boh deerminisic and sochasic SPP, he developed APD algorihms can deal wih he siuaion when he feasible region is unbounded, as long as a saddle poin exiss. In he unbounded case, we incorporae he modified erminaion crierion inroduced by Moneiro and Svaier in solving SPP problem posed as monoone inclusion, and demonsrae ha he rae of convergence of he APD mehod depends on he disance from he iniial poin o he se of opimal soluions. Keywords: saddle poin problem, opimal mehods, sochasic approximaion, sochasic programming, complexiy, large deviaion. Inroducion. Le X and Y denoe he finie-dimensional vecor spaces equipped wih an inner produc, and norm, and X X, Y Y be given closed convex ses. The basic problem of ineres in his paper is he saddle-poin problem SPP) given in he form of: } min fx) := max Gx) + Kx, y Jy)..) x X y Y Here, Gx) is a general smooh convex funcion and K is a linear operaor such ha, for some L G, L K 0, Gu) Gx) Gx), u x L G u x and Ku Kx L K u x, x, u X,.) and J : Y R is a relaively simple, proper, convex, lower semi-coninuous l.s.c.) funcion i.e., problem.5) is easy o solve). In paricular, if J is he convex conjugae of some convex funcion F and Y Y, hen.) is equivalen o he primal problem: min Gx) + F Kx)..3) x X Problems of hese ypes have recenly found many applicaions in daa analysis, especially in imaging processing and machine learning. In many of hese applicaions, Gx) is a convex daa fideliy erm, while F Kx) is a cerain regularizaion, e.g., oal variaion 47], low rank ensor, 50], overlapped group lasso 9, 30], and graph regularizaion 9, 49]. This paper focuses on firs-order mehods for solving boh deerminisic SPP, where exac firs-order informaion on f is available, and sochasic SPP, where we only have access o inexac informaion abou f. Le us sar by reviewing a few exising firs-order mehods in boh cases... Deerminisic SPP. Since he objecive funcion f defined in.) is nonsmooh in general, radiional nonsmooh opimizaion mehods, e.g., subgradien mehods, would exhibi an O/ ) rae of convergence when applied o.) 36], where denoes he number of ieraions. However, following he Deparmen of Mahemaics, Universiy of Florida yun@mah.ufl.edu). This auhor was parially suppored by SF grans DMS-5568, IIP-3784 and DMS Deparmen of Indusrial and Sysem Engineering, Universiy of Florida glan@ise.ufl.edu). This auhor was parially suppored by SF gran CMMI , OR gran , SF DMS-39050, and SF CAREER Award CMMI Deparmen of Indusrial and Sysem Engineering, Universiy of Florida ouyang@ufl.edu). Par of he research was done while he auhor was a PhD suden a he Deparmen of Mahemaics, Universiy of Florida. This auhor was parially suppored by AFRL Mahemaical Modeling Opimizaion Insiue. The auhors acknowledge he Universiy of Florida Research Compuing hp://researchcompuing.ufl.edu) for providing compuaional resources.

2 breakhrough paper by eserov 4], much research effor has been devoed o he developmen of more efficien mehods for solving problem.). ) Smoohing echniques. In 4], eserov proposed o approximae he nonsmooh objecive funcion f in.) by a smooh one wih Lipschiz-coninuous gradien. Then, he smooh approximaion funcion is minimized by an acceleraed gradien mehod in 39, 40]. eserov demonsraed in 4] ha, if X and Y are compac, hen he rae of convergence of his smoohing scheme applied o.) can be bounded by: O LG + L K ),.4) which significanly improves he previous bound O/ ). I can be seen ha he rae of convergence in.4) is acually opimal, based on he following observaions: a) There exiss a funcion G wih Lipschiz coninuous gradiens, such ha for any firs-order mehod, he rae of convergence for solving min Gx) is a mos O L G / ) 40]. x X b) There exiss b Y, where Y is a convex compac se of R m for some m > 0, and a linear bounded operaor K, such ha for any firs-order mehod, he rae of convergence for solving min Kx, y max x X y Y Jy) := min max Kx b, y is a mos O L K/) 37, 34]. x X y Y eserov s smoohing echnique has been exensively sudied see, e.g., 38,, 6, 0, 4, 5, 4, 5]). Observe ha in order o properly apply hese smoohing echniques, we need o assume eiher X or Y o be bounded. ) Primal-dual mehods. While eserov s smoohing scheme or is varians rely on a smooh approximaion o he original problem.), primal-dual mehods work direcly wih he original saddle-poin problem. This ype of mehod was firs presened by Arrow e al. ] and named as he primal-dual hybrid gradien PDHG) mehod in 5]. The resuls in 5, 9, ] showed ha he PDHG algorihm, if employed wih well-chosen sepsize policies, exhibis very fas convergence in pracice, especially for some imaging applicaions. Recenly Chambolle and Pork 9] presened a unified form of primal-dual algorihms, and demonsraed ha, wih a properly specified sepsize policy and averaging scheme, hese algorihms can also achieve he O/) rae of convergence. They also discussed possible ways o exend primal-dual algorihms o deal wih he case when boh X and Y are unbounded. In he original work of Chambolle and Pork, hey assume G o be relaively simple so ha he subproblems can be solved efficienly. Wih lile addiional effor, one can show ha, by linearizing G a each sep, heir mehod can also be applied for a general smooh convex funcion G and he rae of convergence of his modified algorihm is given by ) LG + L K O..5) The rae of convergence in.4) has a significanly beer dependence on L G han ha in.5). Therefore, eserov s smoohing scheme allows a very large Lipschiz consan L G as big as O)) wihou affecing he rae of convergence up o a consan facor of ). This is desirable in many daa analysis applicaions e.g., image processing), where L G is usually significanly bigger han L K. oe ha he primal-dual mehods are also relaed o he Douglas-Rachford spliing mehod, 9] and a pre-condiioned version of he alernaing direcion mehod of mulipliers 3, 6] see, e.g., 9,, 8, 33] for deailed reviews on he relaionship beween he primal-dual mehods and oher algorihms, as well as recen heoreical developmens). 3) Exragradien mehods for variaion inequaliy VI) reformulaion. Moivaed by eserov s work, emirovski presened a mirror-prox mehod, by modifying Korpelevich s exragradien algorihm 3], for solving a more general class of variaional inequaliies 34] see also 0]). Similar o he primal-dual mehods menioned above, he exragradien mehods updae ieraes on boh he primal space X and dual space Y, and do no require any smoohing echnique. The difference is ha each ieraion of he exragradien mehods requires an exra gradien descen sep. emirovski s mehod, when specialized o.), also exhibis a rae of convergence given by.5), which, in view of our previous discussion, is no opimal in erms of is dependence

3 on L G. I can be shown ha, in some special cases e.g., G is quadraic), one can wrie explicily he srongly concave) dual funcion of Gx) and obain a resul similar o.4), e.g., by applying an improved algorihm in 0]. However, his approach would increase he dimension of he problem and canno be applied for a general smooh funcion G. I should be noed ha, while emirovski s iniial work only considers he case when boh X and Y are bounded, Moneiro and Svaier 3] recenly showed ha exragradien mehods can deal wih unbounded ses X and Y by using a slighly modified erminaion crierion... Sochasic SPP. While deerminisic SPP has been exensively explored, he sudy on sochasic firs-order mehods for sochasic SPP is sill quie limied. In he sochasic seing, we assume ha here exiss a sochasic oracle SO) ha can provide unbiased esimaors o he gradien operaors Gx) and Kx, K T y). More specifically, a he i-h call o SO, x i, y i ) X Y being he inpu, he oracle will oupu he sochasic gradien Ĝx i), ˆK x x i ), ˆK y y i )) Gx i, ξ i ), K x x i, ξ i ), K y y i, ξ i )) such ha EĜx i)] = Gx i ), ˆKx x E i ) ˆK y y i ) )] Kxi = K T y i )..6) Here ξ i R d } i= is a sequence of i.i.d. random variables. In addiion, we assume ha, for some σ x,g, σ y, σ x,k 0, he following assumpion holds for all x i X and y i Y : A. E Ĝx i) Gx i ) ] σx,g, E ˆK x x i ) Kx i ] σy and E ˆK y y i ) K T y i ] σ x,k. Someimes we simply denoe σ x := σx,g + σ x,k for he sake of noaional convenience. Sochasic SPP ofen appears in machine learning applicaions. For example, for problems given in he form of.3), Gx) resp. F Kx)) can be used o denoe a smooh resp. nonsmooh) expeced convex loss funcion. I should also be noed ha deerminisic SPP is a special case of he above seing wih σ x = σ y = 0. In view of he classic complexiy heory for convex programming 36, ], a lower bound on he rae of convergence for solving sochasic SPP is given by LG Ω + L K + σ ) x + σ y,.7) where he firs wo erms follow from he discussion afer.4) and he las erm follows from Secion 5.3 and 6.3 of 36]. However, o he bes of our knowledge, here does no exis an opimal algorihm in he lieraure which exhibis exacly he same rae of convergence as in.7), alhough here are a few general-purpose sochasic opimizaion algorihms which possess differen nearly opimal raes of convergence when applied o above sochasic SPP. ) Mirror-descen sochasic approximaion MD-SA). The MD-SA mehod developed by emirovski e al. in 35] originaes from he classical sochasic approximaion SA) of Robbins and Monro 46]. The classical SA mimics he simple gradien descen mehod by replacing exac gradiens wih sochasic gradiens, bu can only be applied o solve srongly convex problems see also Polyak 44] and Polyak and Judisky 45], and emirovski e al. 35] for an accoun for he earlier developmen of SA mehods). By properly modifying he classical SA, emirovski e al. showed in 35] ha he MD-SA mehod can opimally solve general nonsmooh sochasic programming problems. The rae of convergence of his algorihm, when applied o he sochasic SPP, is given by see Secion 3 of 35]) O L G + L K + σ x + σ y ) }..8) However, he above bound is significanly worse han he lower bound in.7) in erms of is dependence on boh L G and L K. ) Sochasic mirror-prox SMP). In order o improve he convergence of he MD-SA mehod, Judisky e al. ] developed a sochasic counerpar of emirovski s mirror-prox mehod for solving general variaional 3

4 inequaliies. The sochasic mirror-prox mehod, when specialized o he above sochasic SPP, yields a rae of convergence given by LG + L K O + σ } x + σ y..9) oe however, ha he above bound is sill significanly worse han he lower bound in.7) in erms of is dependence on L G. 3) Acceleraed sochasic approximaion AC-SA). More recenly, Lan presened in 4] see also 4, 5]) a unified opimal mehod for solving smooh, nonsmooh and sochasic opimizaion by developing a sochasic version of eserov s mehod 39, 40]. The developed AC-SA algorihm in 4], when applied o he aforemenioned sochasic SPP, possesses he rae of convergence given by LG O + L K + σ x + σ y ) }..0) However, since he nonsmooh erm in f of.) has cerain special srucure, he above bound is sill significanly worse han he lower bound in.7) in erms of is dependence on L K. I should be noed ha some improvemen for AC-SA has been made by Lin e al. 8] by applying he smoohing echnique o.). However, such an improvemen works only for he case when Y is bounded and σ y = σ x,k = 0. Oherwise, he rae of convergence of he AC-SA algorihm will depend on he variance of he sochasic gradiens compued for he smooh approximaion problem, which is usually unknown and difficul o characerize see Secion 3 for more discussions). Therefore, none of he sochasic opimizaion algorihms menioned above could achieve he lower bound on he rae of convergence in.7)..3. Conribuion of his paper. Our conribuion in his paper mainly consiss of he following hree aspecs. Firsly, we presen a new primal-dual ype mehod, namely he acceleraed primal-dual APD) mehod, ha can achieve he opimal rae of convergence in.4) for deerminisic SPP. The basic idea of his algorihm is o incorporae a muli-sep acceleraion scheme ino he primal-dual mehod in 9]. We demonsrae ha, wihou requiring he applicaion of he smoohing echnique, his mehod can also achieve he same opimal rae of convergence as eserov s smoohing scheme when applied o.). We also show ha he cos per ieraion for APD is comparable o ha of eserov s smoohing scheme. Hence our mehod can efficienly solve problems wih a big Lipschiz consan L G. Secondly, in order o solve sochasic SPP, we develop a sochasic counerpar of he APD mehod, namely sochasic APD and demonsrae ha i can acually achieve he lower bound on he rae of convergence in.7). Therefore, his algorihm exhibis an opimal rae of convergence for sochasic SPP no only in erms of is dependence on, bu also on a variey of problem parameers including, L G, L K, σ x and σ y. To he bes of our knowledge, his is he firs ime ha such an opimal algorihm has been developed for sochasic SPP in he lieraure. In addiion, we invesigae he sochasic APD mehod in more deails, e.g., by developing he large-deviaion resuls associaed wih he rae of convergence of he sochasic APD mehod. Thirdly, for boh deerminisic and sochasic SPP, we demonsrae ha he developed APD algorihms can deal wih he siuaion when eiher X or Y is unbounded, as long as a saddle poin of problem.) exiss. We incorporae ino he APD mehod he erminaion crierion employed by Moneiro and Svaier 3] for solving variaional inequaliies, and generalize i for solving sochasic SPP. In boh deerminisic and sochasic cases, he rae of convergence of he APD algorihms will depend on he disance from he iniial poin o he se of opimal soluions. Finally, we demonsrae he advanages of he proposed deerminisic and sochasic ADP mehod for solving cerain classes of SPP hrough numerical experimens..4. Organizaion of he paper. We presen he APD mehods and discuss heir main convergence properies for solving deerminisic and sochasic SPP problems, respecively, in Secions and 3. In order o 4

5 faciliae he readers, we pu he proofs of our main resuls in Secion 4. Experimenal resuls on deerminisic and sochasic APD mehods including comparisons wih several exising algorihms are presened in secion 5. Some brief concluding remarks are made in Secion 6.. Acceleraed primal-dual mehod for deerminisic SPP. Our goal in his secion is o presen an acceleraed primal-dual mehod for deerminisic SPP and discuss is main convergence properies. The sudy on firs-order primal-dual mehod for nonsmooh convex opimizaion has been mainly moivaed by solving oal variaion based image processing problems e.g. 5,, 43, 9, 6, 7]). Algorihm shows a primal-dual mehod summarized in 9] for solving a special case of problem.), where Y = R m for some m > 0, and Jy) = F y) is he convex conjugae of a convex and l.s.c. funcion F. Algorihm Primal-dual mehod for solving deerminisic SPP : Choose x X, y Y. Se x = x. : For =,...,, calculae y + = argmin y Y x + = argmin x X 3: Oupu x = = x, y = = y. K x, y + Jy) + τ y y,.) Gx) + Kx, y + + η x x,.) x + = θ x + x ) + x +..3) Algorihm Acceleraed primal-dual mehod for deerminisic SPP : Choose x X, y Y. Se x ag = x, y ag = y, x = x. : For =,,...,, calculae 3: Oupu x ag, yag. x md = β )x ag + β x,.4) y + = argmin y Y x + = argmin x X x ag + K x, y + Jy) + τ V Y y, y ),.5) Gx md ), x + x, K T y + + η V X x, x ),.6) = β )x ag + β x +,.7) y ag + = β )y ag + β y +,.8) x + = θ + x + x ) + x +..9) The convergence of he sequence x, y )} in Algorihm has been sudied in 43,, 9, 6, 7] for various choices of θ, and under differen condiions on he sepsizes τ and η. In he sudy by Chambolle and Pock 9], hey consider he case when consan sepsizes are used, i.e., τ = τ, η = η and θ = θ for some τ, η, θ > 0 for all. If τηl K <, where L K is defined in.), hen he oupu x, y ) possesses a rae of convergence of O/) for θ =, and of O/ ) for θ = 0, in erms of parial dualiy gap dualiy gap in a bounded domain, see.3) below). One possible limiaion of 9] is ha boh G and J need o be simple enough so ha he wo subproblems.) and.) in Algorihm are easy o solve. To make Algorihm applicable o more pracical problems we consider more general cases, where J is simple, bu G may no be so. In paricular, we assume ha G 5

6 is a general smooh convex funcion saisfying.). In his case, we can replace G in.) by is linear approximaion Gx ) + Gx ), x x. Then.) becomes x + = argmin x X Gx ), x + Kx, y + + η x x..0) In he following conex, we will refer o his modified algorihm as he linearized version of Algorihm. By some exra effor we can show ha, if for =,...,, 0 < θ = τ /τ = η /η, and L G η +L K η τ, hen x, y ) has an OL G + L K )/) rae of convergence in he sense of he parial dualiy gap. As discussed in Secion, he aforemenioned rae of convergence for he linearized version of Algorihm is he same as ha proved in 9], and no opimal in erms of is dependence on L G see.5)). However, his algorihm solves he problem.) direcly wihou smoohing he nonsmooh objecive funcion. Considering he primal-dual mehod as an alernaive o eserov s smoohing mehod, and inspired by his idea of using acceleraed gradien descen algorihm o solve he smoohed problem 39, 40, 4], we propose he following acceleraed primal-dual algorihm ha inegraes he acceleraed gradien descen algorihm ino he linearized version of Algorihm. Our acceleraed primal-dual APD) mehod is presened in Algorihm. Observe ha in his algorihm, he superscrip ag sands for aggregaed, and md sands for middle. For any x, u X and y, v Y, he funcions V X, ) and V Y, ) are Bregman divergences defined as V X x, u) := d X x) d X u) d X u), x u, and V Y y, v) := d Y y) d Y v) d Y v), y v,.) where d X ) and d Y ) are srongly convex funcions wih srong convexiy parameers α X and α Y. For example, under he Euclidean seing, we can simply se V X x, x ) := x x / and V Y y, y ) := y y /, and α X = α Y =. We assume ha Jy) is a simple convex funcion, so ha he opimizaion problem in.5) can be solved efficienly. oe ha if β = for all, hen x md = x, x ag + = x +, and Algorihm is he same as he linearized version of Algorihm. However, by specifying a differen selecion of β e.g., β = O)), we can significanly improve he rae of convergence of Algorihm in erms of is dependence on L G. I should be noed ha he ieraion cos for he APD algorihm is abou he same as ha for he linearized version of Algorihm. In order o analyze he convergence of Algorihm, i is necessary o inroduce a noion o characerize he soluions of.). Specifically, denoing Z = X Y, for any z = x, ỹ) Z and z = x, y) Z, we define Q z, z) := G x) + K x, y Jy)] Gx) + Kx, ỹ Jỹ)]..) I can be easily seen ha z is a soluion of problem.), if and only if Q z, z) 0 for all z Z. Therefore, if Z is bounded, i is suggesive o use he gap funcion g z) := max Q z, z).3) z Z o assess he qualiy of a feasible soluion z Z. In fac, we can show ha f x) f g z) for all z Z, where f denoes he opimal value of problem.). However, if Z is unbounded, hen g z) is no well-defined even for a nearly opimal soluion z Z. Hence, in he sequel, we will consider he bounded and unbounded case separaely, by employing a slighly differen error measure for he laer siuaion. The following heorem describes he convergence properies of Algorihm when Z is bounded. Theorem.. Suppose ha for some Ω X, Ω Y > 0, sup V X x, x ) Ω X and sup V Y y, y ) Ω Y..4) x,x X y,y Y 6

7 Also assume ha he parameers β, θ, η, τ in Algorihm are chosen such ha for all, Then for all, β =, β + = β θ +,.5) 0 < θ min η, τ },.6) η τ α X η β L K τ α Y 0..7) gz ag + ) β η Ω X + β τ Ω Y..8) There are various opions for choosing he parameers β, η, τ and θ such ha.5).7) hold. Below we provide such an example. Corollary.. Suppose ha.4) holds. In Algorihm, if he parameers are se o β = +, θ = α X, η = L G + L K D Y /D X and τ = α Y D Y L K D X,.9) where D X := Ω X /αx and D Y := Ω Y /αy, hen for all, gz ag ) L GDX ) + L KD X D Y..0) Proof. I suffices o verify ha he parameers in.9) saisfies.5).7) in Theorem.. I is easy o check ha.5) and.6) hold. Furhermore, α X η β L K τ α Y = L G+L K D Y /D X + L KD Y D X 0, so.7) holds. Therefore, by.8), for all we have gz ag ) β η Ω X + β τ Ω Y = 4L G+ )L K D Y /D X α X ) αx DX + L KD X /D Y α Y = L GD X ) + L KD X D Y. αy D Y Clearly, in view of.4), he rae of convergence of Algorihm applied o problem.) is opimal when he parameers are chosen according o.9). Also observe ha we need o esimae D Y /D X o use hese parameers. However, i should be poined ou ha replacing he raio D Y /D X in.9) by any posiive consan only resuls in an increase in he RHS of.0) by a consan facor. ow, we sudy he convergence properies of he APD algorihm for he case when Z = X Y is unbounded, by using a perurbaion-based erminaion crierion recenly employed by Moneiro and Svaier and applied o SPP 3, 33, 3]. This erminaion crierion is based on he enlargemen of a maximal monoone operaor, which is firs inroduced in 7]. One advanage of using his crierion is ha is definiion does no depend on he boundedness of he domain of he operaor. More specifically, as shown in 3, 3], here always exiss a perurbaion vecor v such ha g z, v) := max Q z, z) v, z z.) z Z is well-defined, alhough he value of g z) in.3) may be unbounded if Z is unbounded. In he following resul, we show ha he APD algorihm can compue a nearly opimal soluion z wih a small residue g z, v), 7

8 for a small perurbaion vecor v i.e., v is small). In addiion, our derived ieraion complexiy bounds are proporional o he disance from he iniial poin o he soluion se. Theorem.3. Le z ag } = x ag, y ag )} be he ieraes generaed by Algorihm wih V X x, x ) = x x / and V Y y, y ) = y y /. Assume ha he parameers β, θ, η and τ saisfy.5), θ = η η = τ τ,.) L K τ 0,.3) η β p for all and for some 0 < p <, hen here exiss a perurbaion vecor v + such ha for any. Moreover, we have v + β η ˆx x + β τ ŷ y + gz ag, v p)d + +) β η p) =: ε +.4) β η + η τ p) ) + L ] K D,.5) β where ˆx, ŷ) is a pair of soluions for problem.) and D := ˆx x + η ŷ y τ..6) Below we sugges a specific parameer seing which saisfies.5),.) and.3). Corollary.4. In Algorihm, if is given and he parameers are se o hen here exiss v ha saisfies.4) wih β = +, θ = +, η = L G + L K ), and τ = +.7) L K ε 0L ˆD G + 0L ˆD K and v 5L G ˆD + 9L K ˆD,.8) where ˆD = ˆx x + ŷ y. Proof. For he parameers β, γ, η, τ in.7), i is clear ha.5),.) holds. Furhermore, le p = /4, for any =,...,, we have η β L K τ p = L G+L K + + L K +) L K L K + L K+) 0, hus.3) holds. By Theorem.3, inequaliies.4) and.5) hold. oing ha η τ, in.4) and.5) we have D ˆD, ˆx x + ŷ y ˆD, hence and v + ˆD β η ε + Also noe ha by.7), β η = 4L G+L K ) + + 4/3) ˆD β η p) ˆD β η p) = 7 ˆD 3β η. = 4L G + L K ˆD β + 4L K. Using hese hree relaions and he definiion of β in.7), we obain.8) afer simplifying he consans. I is ineresing o noice ha, if he parameers in Algorihm are se o.7), hen boh residues ε and v in.8) reduce o zero wih approximaely he same rae of convergence up o a facor of ˆD). Also observe ha in Theorem.3 and Corollary.4, we fix V X, ) and V Y, ) o be regular disance funcions raher han more general Bregman divergences. This is due o fac ha we need o apply he Triangular inequaliy associaed wih V X, ) and V Y, ), while such an inequaliy does no necessarily hold for Bregman divergences in general. 8

9 3. Sochasic APD mehod for sochasic SPP. Our goal in his secion is o presen a sochasic APD mehod for sochasic SPP i.e., problem.) wih a sochasic oracle) and demonsrae ha i can acually achieve he lower bound in.7) on he rae of convergence for sochasic SPP. The sochasic APD mehod is a sochasic counerpar of he APD algorihm in Secion, obained by simply replacing he gradien operaors K x, Gx md sochasic gradien operaors compued by he SO, i.e., ˆK x x ), Ĝx md algorihm is formally described as in Algorihm 3. Algorihm 3 Sochasic APD mehod for sochasic SPP Modify.5) and.6) in Algorihm o y + = argmin y Y x + = argmin x X ) and K T y +, used in.5) and.6), wih he ) an ˆK y y + ), respecively. This ˆK x x ), y + Jy) + τ V Y y, y ) 3.) Ĝxmd ), x + x, ˆK y y + ) + η V X x, x ) 3.) A few more remarks abou he developmen of he above sochasic APD mehod are in order. Firsly, observe ha, alhough primal-dual mehods have been exensively sudied for solving deerminisic saddlepoin problems, i seems ha hese ypes of mehods have no ye been generalized for sochasic SPP in he lieraure. Secondly, as noed in Secion, one possible way o solve sochasic SPP is o apply he AC-SA algorihm in 4] o a cerain smooh approximaion of.) by eserov 4]. However, he rae of convergence of his approach will depend on he variance of he sochasic gradiens compued for he smooh approximaion problem, which is usually unknown and difficul o characerize. On he oher hand, he sochasic APD mehod described above works direcly wih he original problem wihou requiring he applicaion of he smoohing echnique, and is rae of convergence will depend on he variance of he sochasic gradien operaors compued for he original problem, i.e., σx,g, σ y and σx,k in A. We will show ha i can achieve exacly he lower bound in.7) on he rae of convergence for sochasic SPP. Similarly o Secion, we use he wo gap funcions g ) and g, ), respecively, defined in.3) and.) as he erminaion crieria for he sochasic APD algorihm, depending on wheher he feasible se Z = X Y is bounded or no. Since he algorihm is sochasic in naure, for boh cases we esablish is expeced rae of convergence in erms of g ) or g, ), i.e., he average rae of convergence over many runs of he algorihm. In addiion, we show ha if Z is bounded, hen he convergence of he APD algorihm can be srenghened under he following ligh-ail assumpion on SO. ] A. E exp Gx) Ĝx) /σx,g} exp}, E and E exp K T y ˆK ] y y) /σx,k} exp}. I is easy o see ha A implies A by Jensen s inequaliy. exp Kx ˆK x x) /σ y} ] exp} Theorem 3. below summarizes he convergence properies of Algorihm 3 when Z is bounded. oe ha he following quaniy will be used in he saemen of his resul and he convergence analysis of he APD algorihms see Secion 4):, =, γ = θ γ,. Theorem 3.. Suppose ha.4) holds for some Ω X, Ω Y > 0. Also assume ha for all, he parameers β, θ, η and τ in Algorihm 3 saisfy.5),.6), and 3.3) qα X η β L K τ pα Y 0 3.4) 9

10 for some p, q 0, ). Then, a). Under assumpion A, for all, Egz ag + )] Q 0), 3.5) where Q 0 ) := γ β γ η Ω X + γ τ Ω Y } + q)ηiγ i β γ i= q)α X σx + p)τiγi p)α Y σ y }. 3.6) b). Under assumpion A, for all λ > 0 and, where P robgz ag + ) > Q 0) + λq )} 3 exp λ /3} + 3 exp λ}, 3.7) σxω ) Q ) := αx X β γ + σyω Y αy i= γ i + β γ i= q)ηiγ i q)α X σ x + p)τiγi p)α Y σ y }. 3.8) We provide below a specific choice of he parameers β, θ, η and τ for he sochasic APD mehod for he case when Z is bounded. Corollary 3.. Suppose ha.4) holds and le D X and D Y be defined in Corollary.. In Algorihm 3, if he parameers are se o β = +, θ =, η = Then under Assumpion A, 3.5) holds, and α X D X 6L G D X + 3L K D Y + 3σ x 3/ and τ = Q 0 ) 6L GDX + ) + 6L KD X D Y If in addiion, Assumpion A holds, hen for all λ > 0, 3.7) holds, and α Y D Y 3L K D X + 3σ y. 3.9) + 6σ xd X + σ y D Y ). 3.0) Q ) 5σ xd X + 4σ y D Y. 3.) Proof. Firs we check ha he parameers in 3.9) saisfy he condiions in Theorem 3.. The inequaliies.5) and.6) can be checked easily. Furhermore, seing p = q = /3 we have for all, qα X η β L K τ pα Y L GD X +L K D Y D X + L K D Y L K D X 0, hus 3.4) hold, and hence Theorem 3. holds. ow i suffice o show ha 3.0) and 3.) hold. Observe ha by 3.3) and 3.9), we have γ =. Also, observe ha + i= i 3 + ), hus udu 3 +)3/ γ i= η iγ i α X D X 3σ x i= i 4 α X D X +) 9σ x and γ i= τ iγ i α Y D Y 3σ y i= i 4 α Y D Y +) 9σ y. Apply he above bounds o 3.6) and 3.8), we ge Q 0 ) 6LG D X +3L K D Y +3σ x 3/ + α X D X αx DX + 3L KD X +3σ y α Y D Y αy DY + σ x α X 4 αx D X +) 9σ x + σ y α Y 4 αy D Y +) ) 9σ y, ) Q ) +) σ x D X + σyd Y +) 3 + 4σ x α X +) 4 αx D X +) 9σ x + 4σ y α Y +) 4 0 αy D Y +) 9σ y.

11 Simplifying he above inequaliies, we see ha 3.0) and 3.) hold. Comparing he rae of convergence esablished in 3.0) wih he lower bound in.7), we can clearly see ha he sochasic APD algorihm is an opimal mehod for solving he sochasic saddle-poin problems. More specifically, in view of 3.0), his algorihm allows us o have very large Lipschiz consans L G as big as O 3 )) and L K as big as O )) wihou significanly affecing is rae of convergence. We now presen he convergence resuls for he sochasic APD mehod applied o sochasic saddle-poin problems wih possibly unbounded feasible se Z. I appears ha he soluion mehods of hese ypes of problems have no been well-sudied in he lieraure. Theorem 3.3. Le z ag } = x ag, y ag )} be he ieraes generaed by Algorihm wih V X x, x ) = x x / and V Y y, y ) = y y /. Assume ha he parameers β, θ, η and τ in Algorihm 3 saisfy.5),.) and 3.4) for all and some p, q 0, ), hen here exiss a perurbaion vecor v + such ha E gz ag, v + +)] 6 4p β η p D + 5 3p ) p C =: ε + 3.) for any. Moreover, we have E v + ] ˆx x + ŷ y + D β η β τ + C + ) τ β η β τ η p + + L ] K, 3.3) β where ˆx, ŷ) is a pair of soluions for problem.), D is defined in.6) and C := ηi σ x q + η i τ i σy p. 3.4) i= Below we specialize he resuls in Theorem 3.3 by choosing a se of parameers saisfying.5),.) and 3.4). Corollary 3.4. In Algorihm 3, if is given and he parameers are se o where i= β = +, θ =, η = 3 4η, and τ = η, 3.5) η = L G + L K ) + σ/ D for some D > 0, σ = 9 4 σ x + σ y, 3.6) hen here exiss v ha saisfies 3.) wih ε 36L GD ) + 36L KD σd 8D/ D + 6 D/D ) +, 3.7) E v ] 50L GD ) + L KD D/D) σ9 + 5D/ D) +, 3.8) where D is defined in.6). Proof. For he parameers in 3.5), i is clear ha.5) and.) hold. Furhermore, le p = /4, q = 3/4, hen for all =,...,, we have q η β L K τ p = η + 4L K η L G+L K ) L K L K ) 0,

12 hus 3.4) holds. By Theorem 3.3, we ge 3.) and 3.3). oe ha η /τ = 3/4, and β η ˆx x β η D, β τ ŷ y β η η 4 τ 3 D = 3/4D β η, so in 3.) and 3.3) we have ε 0 β η 3 D C ), 3.9) D + C 3 + ) 3/4 E v ] + 3)D β η + 9σ x i 4η β η By 3.4) and he fac ha i= i )/3, we have C = i= + i= σ y i η 3η ) + L K D + C. 3.0) β ) 9σ x 4 + σ y = σ 3η Applying he above bound o 3.9) and 3.0), and using 3.6) and he fac ha D + C D + C, we obain ) ) 8η 0 ε 3 ) 3 D + 7σ ) 8 0 9η = 3 ) 3 ηd + 7σ ) 9η 30L GD 9 ) + 30L K )D 9 ) + 60 σd / D 9 ) + 36σ ) 36L GD ) + 36L KD + σd8d/ D+6 D/D), 7 ) 3/ σ/ D ) E v ] β η D + 3D + 3 D + 6D/ + 3 C + 6C/ + L K D β 6L G+6L K )+8 σ/ D ) 3 ) / D ) 6 + / + 4 L K D + 8σ 3 50L G D ) + L KD55+4 D/D) L K σ 3 σ/ D σ9+5d/ D). + L K C β Observe ha he parameer seings in 3.5)-3.6) are more complicaed han he ones in.7) for he deerminisic unbounded case. In paricular, for he sochasic unbounded case, we need o choose a parameer D which is no required for he deerminisic case. Clearly, he opimal selecion for D minimizing he RHS of 3.7) is given by 6D. oe however, ha he value of D will be very difficul o esimae for he unbounded case and hence one ofen has o resor o a subopimal selecion for D. For example, if D =, hen he RHS of 3.7) and 3.8) will become OL G D / +L K D / +σd / ) and OL G D/ +L K D/ +σd/ ), respecively. 4. Convergence analysis. Our goal in his secion is o prove he main resuls presened in Secion and 3, namely, Theorems.,.3, 3. and Convergence analysis for he deerminisic APD algorihm. In his secion, we prove Theorems. and.3 which, respecively, describe he convergence properies for he deerminisic APD algorihm for he bounded and unbounded SPPs. Before proving Theorem., we firs prove wo echnical resuls: Proposiion 4. shows some imporan properies for he funcion Q, ) in.) and Lemma 4. esablishes a bound on Qx ag Proposiion 4.. Assume ha β for all. If z ag + = xag +, yag + for all z = x, y) Z,, z). ) is generaed by Algorihm, hen β Qz ag +, z) β )Qz ag, z) Gx md ), x + x + L G x + x + Jy + ) Jy)] + Kx +, y Kx, y +. β 4.)

13 Proof. By equaions.4) and.7), x ag + xmd of G ), we have = β Gx md = β ) Gx md = β x + x ). Using his observaion and he convexiy β Gx ag + ) β Gx md ) + β Gx md ), x ag + xmd + βl G + xmd = β Gx md ) + β Gx md ), x ag + xmd + L G β x + x ) + β ) Gx md ), x ag x md + Gx md ), x + x md + L G β x + x = β ) Gx md ) + Gx md ), x ag x md ] + Gx md ) + Gx md ), x + x md ] + L G β x + x ) + Gx md ), x ag x md ] + Gx md ) + Gx md ), x x md ] + Gx md ), x + x + L G β x + x β )Gx ag x ag ) + Gx) + Gx md ), x + x + L G β x + x. Moreover, by.8) and he convexiy of J ), we have β Jy ag + ) β Jy) β )Jy ag ) + Jy + ) β Jy) = β ) Jy ag ) Jy)] + Jy + ) Jy). By.),.7),.8) and he above wo inequaliies above, we obain β Qz ag +, z) β )Qz ag, z) = β Gx ag + ) + Kxag +, y Jy)] Gx) + Kx, y ag + Jyag + )]} β ) Gx ag ) + Kx ag, y Jy)] Gx) + Kx, y ag Jy ag )]} = β Gx ag + ) β )Gx ag ) Gx) + β Jy ag + ) Jy)] β ) Jy ag ) Jy)] + Kβ x ag + β )x ag ), y Kx, β y ag + β )y ag Gx md ), x + x + L G β x + x + Jy + ) Jy) + Kx +, y Kx, y +. Lemma 4. esablishes a bound for Qz ag +, z) for all z Z, which will be used in he proof of boh Theorems. and.3. Lemma 4.. Le z ag + = xag +, yag + ) be he ieraes generaed by Algorihm. Assume ha he parameers β, θ, η, and τ saisfy.5),.6) and.7). Then, for any z Z, we have β γ Qz ag +, z) B αx z, z ] ) + γ Kx + x ), y y + γ L ) G x + x, 4.) η β where γ is defined in 3.3), z ] := x i, y i )} + i= and B z, z ] ) := i= γi V X x, x i ) V X x, x i+ )] + γ } i V Y y, y i ) V Y y, y i+ )]. 4.3) η i τ i Proof. Firs of all, we explore he opimaliy condiions in ieraions.5) and.6). Apply Lemma in 5] o.5), we have K x, y + y + Jy + ) Jy) τ V Y y, y ) τ V Y y +, y ) τ V Y y, y + ) τ V Y y, y ) α Y τ y + y τ V Y y, y + ), where he las inequaliy follows from he fac ha, by he srong convexiy of d Y ) and.), 4.4) Similarly, from.6) we can derive ha V Y y, y ) α Y y y, for all y, y Y. 4.5) Gx md ), x + x + x + x, K T y + V X x, x ) α X x + x V X x, x + ). 4.6) η η η 3

14 Our nex sep is o esablish a crucial recursion of Algorihm. I follows from 4.), 4.4) and 4.6) ha β Qz ag +, z) β )Qz ag, z) Gx md ), x + x + L G x + x + Jy + ) Jy)] + Kx +, y Kx, y + β V X x, x ) αx V X x, x + ) L ) G x + x η η η β + τ V Y y, y ) τ V Y y, y + ) α Y τ y + y x + x, K T y + + K x, y + y + Kx +, y Kx, y ) Also observe ha by.9), we have x + x, K T y + + K x, y + y + Kx +, y Kx, y + = Kx + x ), y y + θ Kx x ), y y + = Kx + x ), y y + θ Kx x ), y y θ Kx x ), y y +. Muliplying boh sides of 4.7) by γ, using he above ideniy and he fac ha γ θ = γ due o 3.3), we obain β γ Qz ag +, z) β )γ Qz ag, z) γ η V X x, x ) γ η V X x, x + ) + γ τ V Y y, y ) γ τ V Y y, y + ) + γ Kx + x ), y y + γ Kx x ), y y αx γ L ) G x + x α Y γ y + y γ Kx x ), y y +. η β τ 4.8) ow, applying Cauchy-Schwarz inequaliy o he las erm in 4.8), using he noaion L K noicing ha γ /γ = θ minη /η, τ /τ } from.6), we have in.) and γ Kx x ), y y + γ Kx x ) y y + L K γ x x y y + L K γ τ α Y γ x x + α Y γ τ y y + L K γ τ α Y x x + α Y γ τ y y +. oing ha θ + = γ /γ +, so by.5) we have β + )γ + = β γ. Combining he above wo relaions wih inequaliy 4.8), we ge he following recursion for Algorihm. β + )γ + Qz ag +, z) β )γ Qz ag, z) = β γ Qz ag +, z) β )γ Qz ag, z) γ η V X x, x ) γ η V X x, x + ) + γ τ V Y y, y ) γ τ V Y y, y + ) +γ Kx + ) x ), y y + γ Kx x ), y y γ αx η β x + x + L K γ τ α Y x x,. Applying he above inequaliy inducively and assuming ha x 0 = x, we conclude ha β + )γ + Qz ) ag +, z) β )γ Qz ag, z) B z, z ] ) + γ Kx + x ), y y + γ αx η β x + x ) i= γ αx i L K τi α Y x i+ x i, η i β i which, in view of.7) and he facs ha β = and β + )γ + = β γ by.5), implies 4.). We are now ready o prove Theorem., which follows as an immediae consequence of Lemma 4.. 4

15 Proof of Theorem.. Le B z, z ] ) be defined in 4.3). Firs noe ha by he definiion of γ in 3.3) and relaion.6), we have θ = γ /γ η /η and hence γ /η γ /η. Using his observaion and.4), we conclude ha B z, z ] ) = γ η V X x, x ) γi i= η i + γ τ V Y y, y ) i= γ η Ω X i= γi τ i ) γi η i γi+ η i+ γi ) γi+ η i+ V X x, x i+ ) γ η V X x, x + ) ) γi+ τ i+ V Y y, y i+ ) γ τ V Y y, y + ) Ω X γ η V X x, x + ) + γ τ Ω Y ) i= τ i γi+ τ i+ Ω Y γ τ V Y y, y + ) = γ η Ω X γ η V X x, x + ) + γ τ Ω Y γ τ V Y y, y + ). ow applying Cauchy-Schwarz inequaliy o he inner produc erm in 4.), we ge 4.9) γ Kx + x ), y y + L K γ x + x y y + L K γ τ x + x + α Y γ y y +. α Y τ 4.0) Using he above wo relaions,.7), 4.) and 4.5), we have β γ Qz ag γ +, z) η Ω X γ η V X x, x + ) + γ τ Ω Y γ τ VY y, y + ) α Y y y + ) ) γ αx η β L K τ α Y x + x γ η Ω X + γ τ Ω Y, z Z, which ogeher wih.3), hen clearly imply.8). Our goal in he remaining par of his subsecion is o prove Theorem.3, which summarizes he convergence properies of Algorihm when X or Y is unbounded. We will firs prove a echnical resul which specializes he resuls in Lemma 4. for he case when.5),.) and.3) hold. Lemma 4.3. Le ẑ = ˆx, ŷ) Z be a saddle poin of.). If V X x, x ) = x x / and V Y y, y ) = y y / in Algorihm, and he parameers β, θ, η and τ saisfy.5),.) and.3), hen a). ˆx x + + η p) ŷ y + ˆx x + η ŷ y, for all. 4.) τ τ b). gz ag +, v +) β η x ag + x + β τ y ag + y =: δ +, for all, 4.) where g, ) is defined in.) and v + = x x + ), y y + ) ) Kx + x ). 4.3) β η β τ β Proof. I is easy o check ha he condiions in Lemma 4. are saisfied. By.), 4.) in Lemma 4. becomes β Qz ag +, z) x x x x + + y y y y + η η τ τ + Kx + x ), y y + L ) G x + x. η β To prove 4.), observe ha 4.4) Kx + x ), y y + L K τ p x + x + p τ y y + 4.5) 5

16 where p is he consan in.3). By.3) and he above wo inequaliies, we ge β Qz ag +, z) η x x η x x + + τ y y p τ y y +. Leing z = ẑ in he above, and using he fac ha Qz ag +, ẑ) 0, we obain 4.). ow we prove 4.). oing ha x x x x + = x + x, x + x x + = x + x, x x ag + + x + x, x ag + + x x + = x + x, x x ag + + xag + x x ag + x +, 4.6) we conclude from.3) and 4.4) ha for any z Z, β Qz ag +, z) + Kx + x ), y ag + y η x x +, x ag + x τ y y +, y ag + y η x ag + x x ag + x + ) + τ y ag + y y ag ) + y + ) + Kx + x ), y ag + y + η β x + x η x ag + x x ag + x + ) + τ y ag + y y ag ) + y + ) + p τ y ag + y + η β L K τ p x + x η x ag + x + τ y ag + y. The resul in 4.) and 4.3) immediaely follows from he above inequaliy and.). We are now ready o prove Theorem.3. Proof. Proof of Theorem.3. We have esablished he expression of v + and δ + in Lemma 4.3. I suffices o esimae he bound on v + and δ +. I follows from he definiion of D,.) and 4.) ha for all, ˆx x + D and ŷ y + D τ η p). ow by 4.3), we have v + β η x x + + β τ y y + + L K β x + x β η ˆx x + ˆx x + ) + β τ ŷ y + ŷ y + ) + L K ) β β η ˆx x + D) + β τ ŷ y + D = β η ˆx x + β τ ŷ y + D β η + τ η p) η τ p) + L K β D ) ] + L K β. ˆx x + + ˆx x ) To esimae he bound of δ +, consider he sequence γ } defined in 3.3). Using he fac ha β + )γ + = β γ due o.5) and 3.3), and applying.7) and.8) inducively, we have x ag + = β γ i= γ i x i+, y ag + = β γ γ i y i+ and Thus x ag + and yag + are convex combinaions of sequences x i+} i= and y i+} i= and 4.), we have i= β γ γ i =. 4.7) i=. Using hese relaions δ + = β η x ag + x + β τ y ag + y β η ˆx x ag + + ˆx x ) + β τ ŷ y ag + + ŷ y ) D + ˆx x ag ŷ y ag = β η β η D η p) τ + + ηp τ ŷ y ag + ) )] β γ i= γ i ˆx x i+ + η p) τ ŷ y i+ + ηp τ ŷ y i+ β η D + )] β γ i= γ i D + ηp τ τ η p) D = p)d β. η p) 6

17 4.. Convergence analysis for he sochasic APD algorihm. In his subsecion, we prove Theorems 3. and 3.3 which describe he convergence properies of he sochasic APD algorihm presened in Secion 3. Le Ĝxmd ), ˆKx x ) and ˆK y y + ) be he oupu from he SO a he -h ieraion of Algorihm 3. Throughou his subsecion, we denoe x,g := Ĝxmd ) Gx md ), x,k := ˆK y y + ) K T y +, y := ˆK x x ) + K x, x := x,g + x,k and := x, y). Moreover, for a given z = x, y) Z, le us denoe z = x + y and is associae dual norm for = x, y ) by = x + y. We also define he Bregman divergence V z, z) := V X x, x) + V Y y, ỹ) for z = x, y) and z = x, ỹ). Before proving Theorem 3., we firs esimae a bound on Qz ag +, z) for all z Z. This resul is analogous o Lemma 4. for he deerminisic APD mehod. Lemma 4.4. Le z ag = x ag, y ag ) be he ieraes generaed by Algorihm 3. Assume ha he parameers β, θ, η and τ saisfy.5),.6) and 3.4). Then, for any z Z, we have β γ Qz ag +, z) B qαx z, z ] ) + γ Kx + x ), y y + γ L ) G x + x + η β where γ and B z, z ] ), respecively, are defined in 3.3) and 4.3), z ] = x i, y i )} + i= and Λ i z), i= 4.8) Λ i z) := q)α Xγ i x i+ x i p)α Y γ i y i+ y i γ i i, z i+ z. 4.9) η i τ i Proof. Similar o 4.4) and 4.6), we conclude from he opimaliy condiions of 3.) and 3.) ha ˆK x x ), y + y + Jy + ) Jy) τ V Y y, y ) α Y τ y + y τ V Y y, y + ), Ĝxmd ), x + x + x + x, ˆK y y + ) η V X x, x ) α X η x + x η V X x, x + ). ow we esablish an imporan recursion for Algorihm 3. Observing ha Proposiion 4. also holds for Algorihm 3, and applying he above wo inequaliies o 4.) in Proposiion 4., similar o 4.8), we have β γ Qz ag +, z) β )γ Qz ag, z) γ η V X x, x ) γ η V X x, x + ) + γ τ V Y y, y ) γ τ V Y y, y + ) + γ Kx + x ), y y + γ Kx x ), y y αx γ L ) G x + x α Y γ y + y γ Kx x ), y y + η β τ γ x,g + x,k, x + x γ y, y + y, z Z. 4.0) By Cauchy-Schwarz inequaliy and.6), for all p 0, ), γ Kx x ), y y + γ Kx x ) y y + L K γ x x y y + L K γ τ x x + pα Y γ y y + pα Y γ τ 4.) L K γ τ x x + pα Y γ y y +. pα Y τ 7

18 By.5), 4.9), 4.0) and 4.), we can develop he following recursion for Algorihm 3: β + )γ + Qz ag +, z) β )γ Qz ag, z) = β γ Qz ag +, z) β )γ Qz ag, z) γ η V X x, x ) γ η V X x, x + ) + γ τ V Y y, y ) γ τ V Y y, y + ) +γ Kx + x) ), y y + γ Kx x ), y y γ qαx η β x + x + L K γ τ pα Y x x + Λ x), z Z. Applying he above inequaliy inducively and assuming ha x 0 = x, we obain β + )γ + Qz ag +, z) β )γ Qz ag, z) ) B z, z ] ) + γ Kx + x ), y y + γ qαx η β x + x ) i= γ qαx i η i β i L K τi pα Y x i+ x i + i= Λ ix), z Z. Relaion 4.8) hen follows immediaely from he above inequaliy,.5) and 3.4). We also need he following echnical resul whose proof is based on Lemma. of 35]. Lemma 4.5. Le η i, τ i and γ i, i =,,..., be given posiive consans. For any z Z, if we define z v = z and zi+ v = argmin z=x,y) Z ηi i x, x τ i i y, y + V z, zi v ) }, 4.) hen γ i i, zi v z B z, z] v ) + i= i= η i γ i α X i x + where z] v := zv i } i= and B z, z] v ) is defined in 4.3). Proof. oing ha 4.) implies zi+ v = xv i+, yv i+ ) where xv i+ = argmin y v i+ = argmin y Y τi i y, y + V y, y v i )}, from Lemma. of 35] we have i= x= X τ i γ i α Y i y, 4.3) ηi i x, x + V X x, x v i )} and V X x, x v i+ ) V Xx, x v i ) η i i x, x x v i + η i i x α X, and V Y y, yi+ v ) V Y y, yi v) τ i i y, y yi v + τ i i y α Y for all i. Thus γ i V X x, x v η i+) γ i V X x, x v i ) γ i i i η x, x x v i + γ iη i i x, and i α X γ i τ i V Y y, y v i+) γ i τ i V Y y, y v i ) γ i i y, y y v i + γ iτ i i y α Y. Adding he above wo inequaliies ogeher, and summing up hem from i = o we ge so 4.3) holds. 0 B z, z] v ) i= γ i i, z zi v + γ iη i i x i= α X + γ iτ i i y i= α Y, We are now ready o prove Theorem 3.. Proof of Theorem 3. Firsly, applying he bounds in 4.9) and 4.0) o 4.8), we ge β γ Qz ag +, z) γ η Ω X γ η V X x, x + ) + γ τ Ω Y γ τ V Y y, y + ) + α Y γ τ y y + γ qαx η γ η Ω X + γ τ Ω Y + L K τ ) x + x + β α Y Λ i z), z Z. i= 8 Λ i z) i= 4.4)

19 By 4.9), we have Λ i z) = q)α Xγ i x i+ x i p)α Y γ i y i+ y i + γ i i, z z i+ η i τ i = q)α Xγ i x i+ x i p)α Y γ i y i+ y i + γ i i, z i z i+ + γ i i, z z i η i τ i η i γ i i q)α x τ i γ i + i X p)α y + γ i i, z z i, Y 4.5) where he las relaion follows from Young s inequaliy. For all i, leing z v = z, and zi+ v as in 4.), we conclude from 4.5) and Lemma 4.5 ha, z Z, Λ i z) i= i= η i γ i q)α X i x + } τ i γ i i p)α y + γ i i, zi v z i + γ i i, zi v z Y B z, z] v ) + q)ηi γ i i q)α x + p)τ } iγ i i i= X p)α y + γ i i, zi v z i, Y }} U 4.6) where similar o 4.9) we have B z, z v ] ) Ω X γ /η + Ω Y γ /τ. Using he above inequaliy,.3),.4) and 4.4), we obain β γ gz ag + ) γ η Ω X + γ τ Ω Y + U. 4.7) ow i suffices o bound he above quaniy U, boh in expecaion par a)) and in probabiliy par b)). We firs show par a). oe ha by our assumpions on SO, a ieraion i of Algorihm 3, he random noises i are independen of z i and hence E i, z z i ] = 0. In addiion, Assumpion A implies ha E i x ] σx,g +σ x,k = σ x noing ha i x,g and i x,k are independen a ieraion i), and E i y ] σy. Therefore, EU ] q)η i γ i σx + p)τ } iγ i σy. 4.8) q)α X p)α Y i= Taking expecaion on boh sides of 4.7) and using he above inequaliy, we obain 3.5). We now show ha par b) holds. oe ha by our assumpions on SO and he definiion of z v i, he sequences i x,g, xv i x i } i is a maringale-difference sequence. By he well-known large-deviaion heorem for maringale-difference sequence e.g., Lemma of 7]), and he fac ha Eexp α X γ i i x,g, x v i x i / γ i Ω Xσ x,g)} ] Eexp αx i x,g x v i x i / Ω Xσ x,g)} ] Eexp i x,g V X x v i, x i )/ Ω Xσ x,g)} ] Eexp i x,g /σ x,g} ] exp}, we conclude ha Prob i= γ i i x,g, xv i x i > λ σ x,g Ω X By using a similar argumen, we can show ha, λ > 0, Prob i= γ i i y, yi v y i > λ σ y Ω Y α Y Prob i= γ i i x,k, x x i > λ σ x,k Ω X 9 α X i= γ i i= γ i α X i= γ i } exp λ /3}, λ > 0. } exp λ /3}, } exp λ /3}.

20 Using he previous hree inequaliies and he fac ha σ x,g + σ x,k σ x, we have, λ > 0, Prob Prob i= γ i i, zi v z i > λ i= γ i i, zi v z i > λ σxω X αx σx,g +σ x,k )Ω X αx ] + σyω Y αy } i= γ i + σyω Y αy ] i= γ i } 3 exp λ /3}. 4.9) have ow le S i := q)η i γ i / q)α X ] and S := i= S i. By he convexiy of exponenial funcion, we E exp }] S i= S i i x,g /σx,g E S i= S i exp i x,g /σ x,g} ] exp}. where he las inequaliy follows from Assumpion A. Therefore, by Markov s inequaliy, for all λ > 0, Prob = Prob q)η iγ i i= exp Using an similar argumen, we can show ha Prob Prob q)η iγ i i= i= } q)η iγ i q)α X q)α X i x,g > + λ)σx,g i= } } S i= S i i x,g /σx,g exp + λ} exp λ}. q)α X i x,k > + λ)σx,k i= p)τ iγ i p)α Y i y > + λ)σy i= Combining he previous hree inequaliies, we obain q)η iγ i q)α X } exp λ}, p)τ iγ i p)α Y } exp λ}. Prob + λ) i= σ x i= q)η iγ i q)α X i x + p)τ iγ i i= p)α Y i y > q)η iγ i q)α X + σy ]} p)τ iγ i i= p)α Y 3 exp λ}, 4.30) Our resul now follows direcly from 4.6), 4.7), 4.9) and 4.30). In he remaining par of his subsecion, our goal is o prove Theorem 3.3, which describes he convergence rae of Algorihm 3 when X and Y are boh unbounded. Similar as proving Theorem.3, firs we specialize he resul of Lemma 4.4 under.5),.) and 3.4). The following lemma is analogous o Lemma 4.3. Lemma 4.6. Le ẑ = ˆx, ŷ) Z be a saddle poin of.). If V X x, x ) = x x / and V Y y, y ) = y y / in Algorihm 3, and he parameers β, θ, η and τ saisfy.5),.) and 3.4), hen a). ˆx x + + ˆx x v + + η p) ŷ y + + η ŷ y τ τ + v ˆx x + η τ ŷ y + η γ U, for all, 4.3) b). where x v +, y v +) and U are defined in 4.) and 4.6), respecively. gz ag +, v +) β η x ag + x + β τ y ag + y + β γ U =: δ +, for all, 4.3) where g, ) is defined in.) and v + = x x + x v β η +), y y + y β τ +) v + ) Kx + x ). 4.33) β Proof. Apply 3.4), 4.5) and 4.6) o 4.8) in Lemma 4.4, we ge β γ Qz ag +, z) Bz, z ] ) + pγ τ y y + + Bz, z v ] ) + U, 0

21 where B, ) is defined as Bz, z ] ) := γ η x x γ η x x + + γ τ y ỹ γ τ y ỹ +, z Z and z ] Z hanks o.). ow leing z = ẑ, and noing ha Qz ag +, ẑ) 0, we ge 4.3). On he oher hand, if we only apply 3.4) and 4.6) o 4.8) in Lemma 4.4, hen we ge β γ Qz ag +, z) Bz, z ] ) + γ Kx + x ), y y + + Bz, z v ] ) + U. Apply.) and 4.6) o Bz, z ] ) and Bz, z] v ) in he above inequaliy, we ge 4.3). Wih he help of Lemma 4.6, we are ready o prove Theorem 3.3. Proof of Theorem 3.3 Le δ + and v + be defined in 4.3) and 4.33), respecively. Also le C and D, respecively, be defined in 3.4) and.6). I suffices o esimae E v + ] and Eδ + ]. Firs i follows from.), 3.4) and 4.8) ha EU ] γ η C. 4.34) Using he above inequaliy,.),.6) and 4.3), we have E ˆx x + ] D + C and E ŷ y + ] D + C τ ) η, which, by Jensen s inequaliy, hen imply ha E ˆx x p) + ] D + C and E ŷ y + ] D + C τ η. Similarly, we can show ha E ˆx p) xv + ] D + C and E ŷ y v + ] D + C τ η. Therefore, by 4.33) and he above four inequaliies, we have E v + ] E β x η x + + x x v + ) + β y τ y + + y y+ ) v + L K β E β η ˆx x + ˆx x + + ˆx x v + ) + β τ ŷ y + ŷ y + + ŷ y+ ) ] v + L K β ˆx x + + ˆx x ) + ŷ y β τ + ) ] D + C β η + τ β τ η p + + L K β, ˆx x β η hus 3.3) holds. ow le us esimae a bound on δ +. By 4.7), 4.8), 4.3) and 4.34), we have ] x + x ] Eδ + ] = E β η x ag + x + β τ y ag + y + β γ EU ] E β η ˆx x ag + + ˆx x ) + β τ ŷ y ag + + ŷ y )] + β η C )] = E β η D + ˆx x ag + + η p) τ ŷ y ag + + ηp τ ŷ y ag + + β η C β η D + β γ i= γ i E ˆx x i+ ] + η p) τ E ŷ y i+ ] + ηp τ E ŷ y i+ ]) ] + C β η D + ) ] ) β γ i= γ i D + C + ηp τ τ η p) D + C ) + C = 6 4p β η p D + 5 3p p C. Therefore 3.) holds. 5. umerical examples. In his secion we will presen our experimenal resuls on solving hree saddle poin problems using he deerminisic or sochasic APD algorihm. The comparisons wih he linearized version of he primal dual algorihm in 9], eserov s smoohing echnique in 43], emirovski s mirror-prox mehod in 36], he mirror-descen sochasic approximaion mehod in 35] and he sochasic mirror-prox mehod in ] are provided for a beer examinaion of he performance of he APD algorihm.

22 5.. Image reconsrucion. Our primary goal in his subsecion is o compare he performance of Algorihms and. Consider he following oal variaion TV) regularized linear inversion problem, which has been widely used as a framework for image reconsrucion: min x X fx) := Ax b + λ Dx,, 5.) where x is he reconsruced image, Dx, is he discree form of he TV semi-norm, A is a given srucure marix depending on he physics of he daa acquisiion), b represens he observed daa, and X := x R n : l x i) u, i =,..., n}. For simpliciy, we consider x as a n-vecor form of a wo-dimensional image. Problem 5.) can be reformulaed as he following SPP problem of in he form of.): min max x X y Y Ax b + λ Dx, y }, where Y := y R n : y, := max i=,...,n y i }, and y i is he Euclidean norm of y i in R. In our experimen, we consider wo ypes of insances depending on how he srucure marix A R k n is generaed. More specifically, he enries of A are normally disribued according o 0, / k) for he firs ye of insance, while for he second one, he enries of A are generaed independenly from a Bernoulli disribuion, i.e., each enry of A is given by / k or / k wih equal probabiliy. Boh ypes of srucure marices are widely used in compressive sensing see, e.g., 3]). For a given A, he measuremens are generaed by b = Ax rue + ε, where x rue is a 64 by 64 Shepp-Logan phanom 48] wih inensiies in 0, ], and ε 0, 0 6 I k ) wih k = 048. We se X := x R n : 0 x i), i =,..., n} and λ = 0 3 in 5.). We applied he linearized version of Algorihm, denoed by LPD, in which.) replaced by.0), and he APD algorihm o solve problem 5.). In LPD he sepsize parameers are se o η = /L G +L K D Y /D X ), τ = D Y /L K D X ) and θ = )/. The sepsizes in APD are chosen as in Corollary., and he Bregman divergences are defined as V X x, x) := x x / and V Y y, y) := y y /, hence D Y /D X =. In addiion, we also applied he APD algorihm wih unbounded feasible ses, denoed APD-U, o solve 5.) by assuming ha X is unbounded. The sepsizes in APD-U are chosen as in Corollary.4, and we se = 50. To have a fair comparison, we use he same Lipschiz consans L G and L K for all algorihms wihou performing a backracking. I can be easily seen ha L G = λ maxa T A) and L K = λ 8 see 8]) are he smalles Lipschiz consans ha saisfy.). Moreover, since in many applicaions he Lipschiz consans are eiher unknown or expensive o compue, he robusness o he overesimaed Lipschiz consans of he algorihm is imporan in pracice. Hence, we also compare he sensiiviy o he overesimaed Lipschiz consans of he algorihms APD, APD-U and LPD in image reconsrucion. To do so, we supply all algorihms wih he bes Lipschiz consans L G = L G and L K = L K firs. For an approximae soluion x R n, we repor boh he primal objecive funcion value f x) and he reconsrucion error relaive o he ground ruh, i.e., r x) := x x rue / x rue, versus CPU ime, as shown in Figure 5.. Moreover, o es he sensiiviy of all he algorihms wih respec o L G and L K, we also supply all hese algorihms wih over-esimaed Lipschiz consans L G = ζ G L G and L K = ζ K L K, where ζ G and ζ K i/ } 8 i=0. We repor in Figure 5. he relaionship beween he mulipliers ζ G, ζ K and he primal objecive funcion value of all hese algorihms afer ieraions. We make a few observaions abou he obained resuls. Firsly, for solving he image reconsrucion problem 5.), boh APD and APD-U ouperform LPD in erms of he decreasing of objecive value and relaive error. Secondly, alhough APD-U has he same rae of convergence as APD, is pracical performance is no as good as APD. A plausible explanaion is ha we need o specify more conservaive sepsize parameers in APD-U see.7) and.9)) in order o ensure is convergence for unbounded ses X and Y, which may conribue o is inferior pracical performance. Finally, he performance of boh APD and APD-U is more robus han LPD when L G is over-esimaed. This is consisen wih our heoreical observaions ha boh APD and APD-U have beer raes of convergence han LPD in erms of he dependence on L G. 5.. onlinear game. Our nex experimen considers a nonlinear wo-person game min max x n y m Qx, x + Kx, y, 5.)

23 where Q = A T A is a posiive semidefinie marix wih A R k n, and n and m are sandard simplices: n := x R n + : n i= xi) = } and m := y R n + : m i= yi) = }. We generae each enry of A independenly from he sandard normal disribuion, and each enry of K independenly and uniformly from he inerval, ]. Problem 5.) can be inerpreed as a wo-person game, in which he firs player has n sraegies and chooses he i-h sraegy wih probabiliy x i), i =,..., n. On he oher hand, he second player has m sraegies and chooses sraegy i =,..., m wih probabiliy y i). The goal of he firs player is o minimize he loss while he second player aims o maximize he gain, and he payoff of he game is a quadraic funcion ha depends on he sraegies of boh players. A saddle poin of 5.) is a ash equilibrium of his nonlinear game. I has been shown e.g., 4, 34, 35]) ha he Euclidean disances V X x, x) = x x / andv Y y, y) = y y / are no he mos suiable for solving opimizaion problems on simplices. In his experimen, we. APD APD U LPD True APD APD U LPD Objecive Value Relaive Error o Ground Truh CPU Time CPU Time. APD APD U LPD True APD APD U LPD Objecive Value Relaive Error o Ground Truh CPU Time CPU Time Fig. 5.: Comparisons of APD, APD-U and LPD in image reconsrucion. The op and boom rows, respecively, show he performance of hese algorihms on he Gaussian and Bernoulli insances. Lef: he objecive funcion values fx ag ) from APD and APD-U, and fx ) from LPD vs. CPU ime. The sraigh line a he boom is fx rue). Righ: he relaive errors ) from APD and APD-U and rx) in LPD vs. CPU ime. rx ag 3

Fig. 5.: Sensiiviy o he overesimaed Lipschiz consans: comparisons of APD, APD-U and LPD in image reconsrucion. Lef: he primal objecive funcion values fx ag ) from APD and APD-U, and fx ) from LPD vs.

choose := and := in boh spaces X and Y, and use he following enropy seing for Bregman divergences V X, ) and V Y, ): V X x, x) := L G = max i,j n x i) + ν/n) ln xi) + ν/n m x i) + ν/n, V Y y, y) := y

24 Fig. 5.: Sensiiviy o he overesimaed Lipschiz consans: comparisons of APD, APD-U and LPD in image reconsrucion. Lef: he primal objecive funcion values fx ag ) from APD and APD-U, and fx ) from LPD vs. ζ G and ζ K on he Gaussian insance. Righ: he primal objecive funcion values fx ag ) from APD and APD-U, and fx ) from LPD vs. ζ G and ζ K on he Bernoulli insance. choose := and := in boh spaces X and Y, and use he following enropy seing for Bregman divergences V X, ) and V Y, ): V X x, x) := L G = max i,j n x i) + ν/n) ln xi) + ν/n m x i) + ν/n, V Y y, y) := y i) + ν/m) ln yi) + ν/m i= y i) + ν/m, i= Q i,j), L K = max K i,j), α X = + ν, α Y = + ν, i,j Ω X = + ν n ) lnn ν + ), Ω Y = + ν m ) lnm ν + ), D X = Ω X /αx, D Y = Ω Y /αy, 5.3) where L G and L K are he smalles Lipschiz consans, and ν is arbirarily small e.g., ν = 0 6 ), see 5] for he calculaion of α X, α Y, Ω X and Ω Y. Wih his seing, he subproblems in.5) and.6) can be efficienly solved wihin machine accuracy 5]. In his experimen, we compare he proposed APD algorihm wih eserov s smoohing echnique in 4] and emirovski s mirror-prox mehod in 34]. The noaion APD denoes he APD algorihm wih he sepsizes in Corollary. and 5.3). EST denoes eserov s algorihm in Secion 5.3 of 4] wih enropy disance See Theorem 3 and Secion 4. in 4] for deails abou he seing for eserov s algorihm). EM denoes emirovski s mirror-prox mehod in 3.)-3.4) of 34] in which L = max i,j Q i,j) D X /+max i,j K i,j) D X D Y see Mixed seups in Secion 5 in 34] for he variaional inequaliy formaion of SPP 5.). In paricular, we se L = L G, L = L = L K, L = 0, Θ = Ω X, Θ = Ω Y, α = α X and α = α Y in 5.) in 34]). Our basic observaions are as follows. Firs, he performance of APD is comparable o ha of EST. Second, boh APD and EST decrease he primal objecive funcion value faser han EM, and are more robus agains he over-esimaion of L G. This is consisen wih our heoreical observaions ha boh APD and EST enjoy he opimal rae of convergence.4), while EM obeys a sub-opimal rae of convergence in.5). In addiion, boh APD and EST have lower ieraion cos han EM, since EM requires an exragradien sep in is inner ieraion. 4

n=0000, m=000 n=0000, m=000 L G = 48.60, L K = L G = 39.98, L K = L G = 6.93, L K = L G = 98.0, L K = Obj. Val.

03 84.7-0.005 4. 000 0.00.5 0.0 8. -0.05 354. -0.08 85.0 00 0.047 0.6 0.304 0.4 0.06 8.0 0.03 4.4 000 0.008 6.

038. 0.47 7.6 0.009 40.5 0.043 79.6 000 0.08 4. 0.35 5.3 0.005 7.3 0.09 56.

25 Table 5.: onlinear game. Algorihm APD EST EM Insance : k=00, Insance : k=000, Insance 3: k=00 Insance 4: k=000, n=000, m=000 n=000, m=000 n=0000, m=000 n=0000, m=000 L G = 48.60, L K = L G = 39.98, L K = L G = 6.93, L K = L G = 98.0, L K = Obj. Val. CPU Obj. Val. CPU Obj. Val. CPU Obj. Val. CPU a) Insance b) Insance c) Insance 3 d) Insance 4 Fig. 5.3: Sensiiviy o he overesimaed Lipschiz consans: comparisons of APD, EST and EM in nonlinear game. The figures are he primal objecive funcion values vs. ζ G and ζ K afer = 000 ieraions. 5

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Noes for EE7C Spring 018: Convex Opimizaion and Approximaion Insrucor: Moriz Hard Email: hard+ee7c@berkeley.edu Graduae Insrucor: Max Simchowiz Email: msimchow+ee7c@berkeley.edu Ocober 15, 018 3