Accelerated Method for Stochastic Composition Optimization with Nonsmooth Regularization

Size: px

Start display at page:

Download "Accelerated Method for Stochastic Composition Optimization with Nonsmooth Regularization"

Neal Waters
5 years ago
Views:

1 he hirty-secod AAAI Coferece o Artificial Itelligece AAAI-8 Accelerated Method for Stochastic Compositio Optimizatio with Nosmooth Regularizatio Zhouyua Huo, Bi Gu, Ji Liu, 2 Heg Huag Departmet of Electrical ad Computer Egieerig, Uiversity of Pittsburgh, Pittsburgh, PA 526, USA 2 Departmet of Computer Sciece, Uiversity of Rochester, Rochester, NY, 4627, USA zhouyua.huo@pitt.edu, big0@pitt.edu, jliu@cs.rochester.edu, heg.huag@pitt.edu Abstract Stochastic compositio optimizatio draws much attetio recetly ad has bee successful i may emergig applicatios of machie learig, statistical aalysis, ad reiforcemet learig. I this paper, we focus o the compositio problem with osmooth regularizatio pealty. Previous works either have slow covergece rate, or do ot provide complete covergece aalysis for the geeral problem. I this paper, we tackle these two issues by proposig a ew stochastic compositio optimizatio method for compositio problem with osmooth regularizatio pealty. I our method, we apply variace reductio techique to accelerate the speed of covergece. o the best of our kowledge, our method admits the fastest covergece rate for stochastic compositio optimizatio: for strogly covex compositio problem, our algorithm is proved to admit liear covergece; for geeral compositio problem, our algorithm sigificatly improves the state-of-theart covergece rate from O /2 to O + 2 2/3. Fially, we apply our proposed algorithm to portfolio maagemet ad policy evaluatio i reiforcemet learig. Experimetal results verify our theoretical aalysis. Itroductio Stochastic compositio optimizatio draws much attetio recetly ad has bee successful i addressig may emergig applicatios of differet areas, such as reiforcemet learig Dai et al. 206; Wag ad Liu 206, statistical learig Wag, Fag, ad Liu 204 ad risk maagemet Detcheva, Peev, ad Ruszczyński 206. he authors i Wag, Fag, ad Liu 204; Wag ad Liu 206 proposed compositio problem, which is the compositio of two expected-value fuctios: mi x R N E i F i E j G j x +hx, }{{} fx where G j x :R N R M are ier compoet fuctios, F i y :R M R are outer compoet fuctios. he regularizatio pealty hx is a closed covex fuctio but ot ecessarily smooth. I reality, we usually solve the fiitesum sceario for compositio problem, ad it ca be o whom all correspodece should be addressed. Copyright c 208, Associatio for the Advacemet of Artificial Itelligece All rights reserved. represeted as follows: 2 mi Hx = mi F i G jx +hx, 2 x R N x R N 2 j= }{{} fx where it is defied that F y = 2 j= F i y ad Gx = G j x. hroughout this paper, we maily focus o the case that F i ad G j are smooth. However, we do ot require that F i ad G j have to be covex. Miimizig the compositio of expected-value fuctios or fiite-sum fuctios 2 is challegig. Classical stochastic gradiet method SGD ad its variats are well suited for miimizig traditioal fiite-sum fuctios Bottou, Curtis, ad Nocedal 206. However, they are ot directly applicable to the compositio problem. o apply SGD, we eed to compute the ubiased samplig gradiet G j x F i Gx of problem 2, which is timecosumig whe Gx is ukow. Evaluatig Gx requires traversig all ier compoet fuctios, which is uacceptable to compute i each iteratio if 2 is a large umber. I Wag, Fag, ad Liu 204, the authors cosidered the problem with hx =0ad proposed stochastic compositioal gradiet descet algorithm SCGD which is the first stochastic method for compositio problem. I their paper, they proved that the covergece rate of SCGD for strogly covex compositio problem is O 2/3, ad for geeral problem is O /4. hey also proposed accelerated SCGD by usig Nesterov smoothig techique Nesterov 983 which is proved to admit faster covergece rate. SCGD has costat query complexity per iteratio, however, their covergece rate is far worse tha full gradiet method because of the oise iduced by samplig gradiets. Recetly, variace reductio techique Johso ad Zhag 203 was applied to accelerate the covergece of stochastic compositio optimizatio. Lia, Wag, ad Liu 206 first utilized the variace reductio techique ad proposed two variace reduced stochastic compositioal gradiet descet methods Compositioal-SVRG- ad Compositioal-SVRG-2. Both methods are proved to admit liear covergece rate. However, the methods proposed i Wag, Fag, ad Liu

2 able : he table shows the comparisos of SCGD, Accelerated SCGD, ASC-PG, Compositioal-SVRG-, Compositioal- SVRG-2, com-svr-admm ad our VRSC-PG i terms of covergece. For fair compariso, we cosider query complexity i the covergece rate. We defie that oe query of Samplig Oracle SO has three cases: Give x R N ad j {, 2,..., 2 }, SO returs G j x R M ; 2 Give x R N ad j {, 2,..., 2 }, SO returs G j x R M N ; 3 Give y R M ad i {, 2,..., }, SO returs F i y R M. deotes the total umber of iteratios ad κ deotes coditio umber ad 0 <ρ<. Algorithm hx 0 Strogly Covex Geeral Problem SCGD Wag, Fag, ad Liu 204 O 2/3 O /4 Accelerated SCGD Wag, Fag, ad Liu 204 O 4/5 O 2/7 Compositioal-SVRG- Lia, Wag, ad Liu 206 O ρ + 2 +κ 4 - Compositioal-SVRG-2 Lia, Wag, ad Liu 206 Oρ + 2 +κ 3 - ASC-PG Wag ad Liu 206 O 4/5 O 4/9 ASC-PG if G j x are liear Wag ad Liu 206 O O /2 com-svr-admm Yu ad Huag 207 O ρ + 2 +κ 4 - VRSC-PG Our O ρ + 2 +κ 3 O + 2 2/3 ad Lia, Wag, ad Liu 206 are ot applicable to compositio problem with osmooth regularizatio pealty. Compositio problem with osmooth regularizatio was the cosidered i Wag ad Liu 206; Yu ad Huag 207. I Wag ad Liu 206, the authors proposed accelerated stochastic compositioal proximal gradiet algorithm ASC-PG. hey proved that the optimal covergece rate of ASC-PG for strogly covex problem ad geeral problem is O ad O /2 respectively. However, ASC-PG suffers from slow covergece because of the oise of the samplig gradiets. Yu ad Huag 207 proposed com-svr-admm usig variace reductio. Although com- SVR-ADMM admits liear covergece for strogly covex compositio problem, it is ot optimal. Besides, they did ot aalyze the covergece for geeral ocovex compositio problem either. We review the covergece rate of stochastic compositio optimizatio i able. I this paper, we propose variace reduced stochastic compositioal proximal gradiet method VRSC-PG for compositio problem with osmooth regularizatio pealty. Applyig the variace reductio techique to compositio problem is otrivial because the optimizatio procedure ad covergece aalysis are essetially differet. We ivestigate the covergece rate of our method: for strogly covex problem, we prove that VRSC-PG has liear covergece rate O ρ + 2 +κ, 3 which is faster tha com-svr-admm; For geeral problem, sometimes ocovex, VRSC-PG sigificatly improves the state-of-the-art covergece rate of ASC- PG from O /2 to O + 2 2/3. o the best of our kowledge, our result is the ew bechmark for stochastic compositio optimizatio. We further evaluate our method by applyig it to portfolio maagemet ad reiforcemet learig. Experimetal results verify our theoretical aalysis. I Yu ad Huag 207, their result is Oρ + 2 +Am.We prove that to get liear covergece, it must be satisfied that A ad m are proportioal to κ 2, which is ot icluded i their paper. Check Prelimiary I this sectio, we briefly review stochastic compositio optimizatio ad proximal stochastic variace reduced gradiet. Stochastic Compositio Optimizatio he objective fuctio of the stochastic compositio optimizatio is the compositio of expected-value or fiitesum 2 fuctios, which is much more complicated tha traditioal fiite-sum problem. he full gradiet of compositio problem usig chai rule is fx = Gx F Gx. Give x, applyig the classical stochastic gradiet descet method i costat queries to compute the ubiased samplig gradiet G j x F i Gx is ot available, whe Gx is ukow yet. I problem 2, evaluatig Gx is time-cosumig which requires 2 queries i each iteratio. herefore, classical SGD is ot applicable to compositio optimizatio. I Wag, Fag, ad Liu 204, the authors proposed the first stochastic compositioal gradiet descet SCGD for miimizig the stochastic compositio problem with hx =0. I their paper, they proposed to use a auxiliary variable y to approximate Gx. I each iteratio t, we store x t ad y t i memory. SCGD are briefly described i Algorithm. I the algorithm, α t ad β t are learig rate. Both of them are decreasig to guaratee covergece because of the oise iduced by samplig gradiets. I their paper, they supposed that x X. I each iteratio, x is projected to X after step 4. Furthermore, the authors proposed Accelerated SCGD by applyig Nesterov smoothig Nesterov 983, which is proved to coverge faster tha basic SCGD. Remark i supplemetary material. 3288

3 Algorithm SCGD : Iitialize x 0 R N, y 0 R M ; 2: for t =0,, 2,..., do 3: Uiformly sample j from {, 2,..., 2 } with replacemet ad query G j x t ad j Gx t ; 2 queries 4: Update y t+ usig: y t+ β t y t + β t G j x t ; 3 5: Uiformly sample i from {, 2,..., } with replacemet ad query F i y t+ ; query 6: Update x t+ usig: 7: ed for x t+ x t α t G j x t F i y t+ ; 4 Proximal Stochastic Variace Reduced Gradiet Stochastic variace reduced gradiet SVRG Johso ad Zhag 203 was proposed to miimize fiite-sum fuctios: mi x R N f i x, 5 where compoet fuctios f i x : R N R. I largescale optimizatio, SGD ad its variats use ubiased samplig gradiet f i x as the approximatio of the full gradiet, which oly requires oe query i each iteratio. However, the variace iduced by samplig gradiets forces us to decease learig rate to make the algorithm coverge. Suppose x is the optimal solutio to problem 5, full gradiet f i x =0, while samplig gradiet f i x 0. We should decease learig rate, otherwise the covergece of the objective fuctio value ca ot be guarateed. However, the decreasig learig rate makes SGD coverge very slow at the same time. For example, if problem 5 is strogly covex, gradiet descet method GD coverges with liear rate, while SGD coverges with a learig rate at O. Reducig the variace is oe of the most importat ways to accelerate SGD, ad it has bee widely applied to large-scale optimizatio Bottou, Curtis, ad Nocedal 206; Defazio, Bach, ad Lacoste-Julie 204; Gu, Huo, ad Huag 206b; Alle-Zhu ad Yua 206; Huo ad Huag 207; Gu, Huo, ad Huag 206a. I Xiao ad Zhag 204, the authors cosidered the osmooth regularizatio pealty hx 0ad proposed proximal stochastic variace reduced gradiet Proximal SVRG. Proximal SVRG is briefly described i Algorithm 2. I their paper, they used v t as the approximatio of full gradiet, where Ev t =0. It was also proved that the variace of v t coverges to zero: lim E v t t f i x t herefore, we ca keep learig rate η costat i the procedure. I step 7, Prox ηh. x deotes proximal operator. With the defiitio of proximal mappig, we have: Prox ηh. x =argmi x hx + η x x 2, 6 Covergece aalysis ad experimetal results cofirmed that Proximal SVRG admits liear covergece i expectatio for strogly covex optimizatio. I Reddi et al. 206b, the authors proved that Proximal SVRG has subliear covergece rate of O 2/3 whe f i x is ocovex. Algorithm 2 Proximal SVRG : Iitialize x 0 R N ; 2: for s =0,, 2,...S do 3: x s+ 0 x s ; 4: f f i x s ; queries 5: for t =0,, 2,...,m do 6: Uiformly sample i from {, 2,..., } with replacemet ad query f i x s+ t ad f i x s ; 2 queries 7: Update vt s+ usig: v s+ t f i x s+ t f i x s +f ; 7 8: Update model x s+ t+ usig: 9: ed for 0: x s+ x s+ : ed for x s+ t+ Prox ηh.x s+ t ηv s+ t ; 8 m ; Variace Reduced Stochastic Compositioal Proximal Gradiet I this sectio, we propose variace reduced stochastic compositioal proximal gradiet method VRSC-PG for solvig the fiite-sum compositio problem with osmooth regularizatio pealty 2. he descriptio of VRSC-PG is preseted i Algorithm 3. Similar to the framework of Proximal SVRG Xiao ad Zhag 204, our VRSC-PG also has two-layer loops. At the begiig of the outer loop s, we keep a sapshot of the curret model x s i memory ad compute the full gradiet: f x s = 2 G j x s F i G s, 9 2 j= where G s = 2 2 j= G j x s deotes the value of the ier fuctios ad G x s = 2 2 j= G j x s deotes the gradiet of ier fuctios. Computig the full gradiet of fx i problem 2 requires +2 2 queries. o make the umber of queries i each ier iteratio irrelevat to 2, we eed to keep Ĝs+ t ad Ĝs+ t i memory to work as the estimates of Gx s+ t ad Gxt s+ respectively. I our algorithm, we query G At x s+ t ad G At x s, 3289

4 the Ĝs+ t Ĝ s+ t is evaluated as follows: = G s A j A GAt [j] x s G At [j]x s+ t, 0 where A t [j] deotes elemet j i the set A t ad A t = A. he elemets of A t are uiformly sampled from {, 2,..., 2 } with replacemet. I 0, we reduce the variace of G At x s+ t by usig G s ad G At x s. Similarly, we sample B t with size B from {, 2,..., 2 } uiformly with replacemet, ad query G Bt x s+ t ad G Bt x s. he estimatio of Gx s+ t is evaluated as follows: B j B Ĝs+ t = G x s GBt[j] x s G Bt[j]x s+ t where B t [j] deotes elemet j i the set B t ad B t = B. It is importat to ote that A t ad B t are idepedet. Computig Ĝs+ t ad Ĝs+ t requires 2A +2B queries i each ier iteratio. Now, we are able to compute the estimate of fx s+ t i ier iteratio t as follows: vt s+ = b Ĝs+ t Fit Ĝs+ t i t I t G x s F it G s + f x s, 2 where I t is a set of idexes uiformly sampled from {, 2,..., } ad I t = b. As per 2, we eed to query F It Ĝs+ t ad F It G s, ad it requires 2b queries. Fially, we update the model with proximal operator: x s+ t+ = Prox ηh x s+ t ηvt s+, 3 where η is the learig rate. Covergece Aalysis I this sectio, we prove that VRSC-PG admits liear covergece rate for the strogly covex problem; 2 VRSC- PG admits subliear covergece rate O + 2 2/3 for the geeral problem. o the best of our kowledge, both of them are the best results so far. Followig are the assumptios commoly used for stochastic compositio optimizatio Wag, Fag, ad Liu 204; Wag ad Liu 206; Lia, Wag, ad Liu 206. Strogly covex: o aalyze the covergece of VRSC-PG for the strogly covex compositio problem, we assume that the fuctio f is -strogly covex. Assumptio he fuctio fx is -strogly covex. herefore x ad y, we have: fx fy x y. 5 Equivaletly, -strogly covexity ca also be writte as follows: fx fy+ fy,x y + 2 x y 2.6 Algorithm 3 VRSC-PG Iput: he total umber of iteratios i the ier loop m, the total umber of iteratios i the outer loop S, the size of the mii-batch sets A,B ad b, learig rate η. : Iitialize x 0 R N ; 2: for s =0,, 2,,S do 3: x s+ 0 x s ; 4: G s 2 2 j= G i x s ; 2 queries 5: G x s 2 2 j= G j x s ; 2 queries 6: Compute the full gradiet f x s usig 9 ; queries 7: for t =0,, 2,,m do 8: Uiformly sample A t from {, 2,..., 2 } with replacemet ad A t = A ; 9: Update Ĝs+ t usig 0 ; 2A queries 0: Uiformly sample B t from {, 2,..., 2 } with replacemet ad B t = B; : Update Ĝs+ t usig ; 2B queries 2: Uiformly sample I t from {, 2,..., } with replacemet; 3: Compute vt s+ usig 2: 2b queries 4: Update model x s+ x s+ t+ Prox ηh 5: ed for 6: x s+ x s+ 7: ed for m ; t+ usig: x s+ t ηvt s+ 4 Lipschitz Gradiet: We assume that there exist Lipschitz costats L F, L G ad L f for F i x, G j x ad fx respectively. Assumptio 2 here exist costats L F, L G ad L f for F i x, G j x ad fx satisfyig that x, y, i {,, }, j {,, 2 }: F i x F i y L F x y, 7 G j x G j y L G x y, 8 G j x F i Gx G j y F i Gy L f x y. 9 As proved i Lia, Wag, ad Liu 206, accordig to 9, we have: fx fy L f x y, x, y. 20 Equivaletly, 20 ca also be writte as follows: x, y, we have fx fy+ fy,x y + L f 2 x y 2, 2 Bouded gradiets: We assume that the gradiets F i x ad G j x are upper bouded. Assumptio 3 he gradiets F i x ad G j x have upper bouds B F ad B G respectively. F i x B F, x, i {,, } 22 G j x B G, x, j {,, 2 }

5 Note that we do ot eed the strog covexity assumptio whe we aalyze the covergece of VRSC-PG for the geeral problem. Strogly Covex Problem I this sectio, we prove that our VRSC-PG admits liear covergece rate for strogly covex fiite-sum compositio problem with osmooth pealty regularizatio 2. We eed Assumptios, 2 ad 3 i this sectio. Ulike Prox-SVRG i Xiao ad Zhag 204, the estimated vt s+ is biased, i.e., E It,A t,b t [vt s+ ] fx s+ t. It makes the theoretical aalysis for provig the covergece rate of VRSC-PG more challegig tha the aalysis i Xiao ad Zhag 204. I spite of this, we ca demostrate that E vt s+ fx s+ t 2 is upper bouded as well. Lemma Let x be the optimal solutio to problem 2 Hx such that x =argmi x R N Hx. We defie γ = 64 B 2 F L2 G B + B4 G L2 F A +8L f. Supposig Assumptios, 2 ad 3 hold, from the defiitio of vt s+ i 2, the followig iequality holds that: E vt s+ fx s+ t 2 [ ] γ Hx s+ t Hx +H x s Hx. 24 herefore, whe x s+ t ad x s coverges to x, E v t fx s+ t 2 also coverges to zero. hus, we ca keep learig rate costat, ad obtai faster covergece. heorem Suppose Assumptios, 2 ad 3 hold. We let the optimal solutio x =argmi x R N Hx, ifm, A, B ad η are selected properly so that ρ<, where ρ is defied as follows: ρ = ρ c = 2 +2η 6ηL f + ρ c m + 2η 7 8 6ηL f + ρ c m η B 2 F L 2 G B + B4 G L2 F A we ca prove that our VRSC-PG admits liear covergece rate: EH x S Hx ρ EH x S 0 Hx 27 As per heorem, we eed to choose η, m, A ad B properly to make ρ<. We provide a example to show how to select these parameters. Corollary Accordig to heorem, we set η, m, A ad B as follows: η = 28 96L f m = L f 29 A = 2048B4 G L2 F 2 30 B = 2048B2 F L2 G 2 3 we have the followig liear covergece rate for VRSC-PG: S 2 EH x S Hx EH x 0 Hx 32 3 Remark Accordig to heorem, to obtai EH x s Hx ε 33 the umber of stages S is required to satisfy: S log EH x0 Hx / log ε ρ 34 As per Algorithm 3 ad the defiitio of Samplig Oracle i Wag ad Liu 206, to make the objective value gap EH x s Hx ε, the total query complexity we eed to take is O + 2 +ma+b+b log ε = O + { } 2 + κ 3 log ε, L where we let κ =max f, L F, L G ad b ca be smaller tha or proportioal to κ 2. It is better tha com-svr-admmyu ad Huag 207 whose total query complexity is O κ 4 log ε. Geeral Problem I this sectio, we prove that VRSC-PG admits a subliear covergece rate O for the geeral fiite-sum compositio problem with osmooth regularizatio pealty. It is much better tha the state-of-the-art method ASC-PG Wag ad Liu 206 whose optimal covergece rate is O /2. I this sectio, we oly eed Assumptio 2 ad 3. he ubiased vt s+ makes our aalysis otrivial ad it is much differet from previous aalysis for fiite-sum problem Reddi et al. 206a. I our proof, we defie: G η x = x Proxηh. x fx. 35 η heorem 2 Suppose Assumptios 2 ad 3 hold. Let x be the optimal solutio to problem 2, we have x = arg mi x R N Hx.Ifm, A, B, b ad η are selected properly such that: 4 ηm 2 L 2 f b + 2ηm2 B 4 G L2 F A + 2ηm2 B 2 F L2 G B + L f 2 2η, 36 the the followig iequality holds that: E G η x a 2 2 H x 0 Hx 37 2ηL f η where x a is uiformly selected from {{x s+ t } m t=0 }S t=0 ad is a multiple of m, As per heorem 2, we eed to choose m, A, B, b ad η appropriately to make coditio 36 satisfied. We provide a example to show how to select these parameters. 329

6 a κ cov =2 b κ cov = c κ cov =0 d κ cov =0 Figure : Experimetal results for meaig-variace portfolio maagemet o sythetic data. κ cov is the coditioal umber of the covariace matrix of the correspodig Gaussia distributio which is used to geerate reward. We use time as x axis, ad it is proportioal to the query complexity. I y axis, the objective value gap is defied as Hx Hx, where x is obtaied by ruig our methods for eough iteratios util covergece. Gx 2 deotes the l 2 -orm of the full gradiet, where Gx = fx+ hx. Corollary 2 Accordig to heorem 2, we let m = + 2 3, η = 4L f, b = ad be a multiple of m, it is easy to kow that if A ad B are lower bouded: A 8m2 B 4 G L2 F L f 38 B 8m2 BF 2 L2 G 39 L f we ca obtai subliear covergece rate for VRSC-PG: E G η x a 2 H x 0 Hx 6L f 40 Remark 2 Accordig to heorem 2, to obtai E G η x a 2 ε 4 the umber of iteratios is required to satisfy: EH x 0 Hx 6L f 42 ε As per Algorithm 3 ad the defiitio of Samplig Oracle i Wag ad Liu 206, to obtai ε-accurate solutio, E G η x a 2 ε, the total query complexity we eed to take is O A+B+b ε =O /3 ε where A, B ad b are proportioal to herefore, our method improves the state-of-the-art covergece rate of stochastic compositio optimizatio for geeral problem from O /2 Optimal covergece rate for ASC-PG to O + 2 2/3. Experimetal Results We coduct two experimets to evaluate our proposed method: applicatio to portfolio maagemet; 2 applicatio to policy evaluatio i reiforcemet learig. I the experimets, there are three compared methods for stochastic compositio optimizatio:, 3292

7 Accelerated stochastic compositioal proximal gradiet ASC-PG Wag ad Liu 206; Stochastic variace reduced ADMM for Stochastic compositio optimizatio com-svr-admm Yu ad Huag 207; Variace Reduced Stochastic Compositioal Proximal Gradiet VRSC-PGOur method. I our experimets, learig rate η is tued from {, 0, 0 2, 0 3, 0 4 }. We keep the learig rate costat for com-svr-admm ad VRSC-PG i the optimizatio. For ASC-PG, i order to guaratee covergece, learig rate is decreased as per η +t, where t deotes the umber of iteratios. Applicatio to Portfolio Maagemet Suppose there are N assets we ca ivest, r t R N deotes the rewards of N assets at time t. Our goal is to maximize the retur of the ivestmet ad to miimize the risk of the ivestmet at the same time. Portfolio maagemet problem ca be formulated as the mea-variace optimizatio as follows: mi x R N r t,x + t= t= r t,x 2 r j,x 43 j= where x R N deotes the ivestmet quatity vector i N assets. Accordig to Lia, Wag, ad Liu 206, problem 43 ca also be viewed as the compositio problem as 2. I our experimet, we also add a osmooth regularizatio pealty hx =λ x i the mea-variace optimizatio problem 43. Similar to the experimetal settigs i Lia, Wag, ad Liu 206, we let = 2000 ad N = 200. Rewards r t are geerated i two steps: Geerate a Gaussia distributio o R N, where we defie the coditio umber of its covariace matrix as κ cov. Because κ cov is proportioal to κ, i our experimet, we will cotrol κ cov to chage the value of κ; 2 Sample rewards r t from the Gaussia distributio ad make all elemets positive to guaratee that this problem has a solutio. I the experimet, we compared three methods o two sythetic datasets, which are geerated through Gaussia distributios with κ cov =2ad κ cov =0separately. We set λ =0 3 ad A = B = b =5. We just select the values of A, B, b casually, it is probable that we ca get better results as log as we tue them carefully. Figure shows the covergece of compared methods regardig time. We suppose that the elapsed time is proportioal to the query complexity. Objective value gap meas Hx t Hx, where x is the optimal solutio to Hx. We compute Hx by ruig our method util covergece. Firstly, by observig the x ad y axises i Figure, we ca kow that whe κ cov =0, all compared methods eed more time to miimize problem 43, which is cosistet with our aalysis. Icreasig κ will icrease the total query complexity. Secodly, we ca also fid out that com-svr-admm ad VRSC-PG admit liear covergece rate. ASC-PG rus faster at the begiig, because of their low query complexity i each iteratio. However, their covergece slows dow whe the learig rate gets small. I four figures, our SVRC-PG always has the best performace compared to other compared methods. Applicatio to Reiforcemet Learig We the apply stochastic compositio optimizatio to reiforcemet learig ad evaluate three compared methods i the task of policy evaluatio. I reiforcemet learig, let V π s be the value of state s uder policy π. he value fuctio V π s ca be evaluated through Bellma equatio as follows: V π s =E[r s,s 2 + γv π s 2 s ] 44 for all s,s 2 {, 2,..., S}, where S represets the umber of total states. Accordig to Wag ad Liu 206, the Bellma equatio 44 ca also be writte as a compositio problem. I our experimet, we also add sparsity regularizatio hx =λ x i the objective fuctio. Followig Da, Neuma, ad Peters 204, we geerate a Markov decisio process MDP. here are 400 states ad 0 actios at each state. he trasitio probability is geerated radomly from the uiform distributio i the rage of [0, ]. We the add 0 5 to each elemet of trasitio matrix to esure the ergodicity of our MDP. he rewards rs, s from state s to state s are also sampled uiformly i the rage of [0, ]. I our experimet, we set λ =0 3 ad A = B = b =5. We also select these values casually, better results ca be obtaied if we tue them carefully. I Figure 2, we plot the covergece of the objective value ad Gx 2 i terms of time. We ca observe that VRSC- PG is much faster tha ASC-PG, which has bee reflected i the aalysis of covergece rate already. It is also obvious that our VRSC-PG coverges faster tha com-svr-admm. Experimetal results o policy evaluatio also verify our theoretical aalysis. Coclusio I this paper, we propose variace reduced stochastic compositioal proximal gradiet method VRSC-PG for compositio problem with osmooth regularizatio pealty. We also aalyze the covergece rate of our method: for strogly covex compositio problem, VRSC-PG is proved to admit liear covergece; 2 for geeral compositio problem, VRSC-PG sigificatly improves the state-of-the-art covergece rate from O /2 to O + 2 2/3. Both of our theoretical aalysis, to the best of our kowledge, are the state-of-the-art results for stochastic compositio optimizatio. Fially, we apply our method to two differet applicatios, portfolio maagemet ad reiforcemet learig. Experimetal results show that our method always has the best performace i differet cases ad verify the coclusios of theoretical aalysis. Ackowledgemet Z. H., B. G., H. H. were partially supported by the followig grats: NSF-IIS , NSF-IIS 34452, NSF- DBI , NSF-IIS 69308, NSF-IIS , NIH R0 AG J. L. was partially supported by NSF-CCF

8 Figure 2: Figures show the experimetal results of policy evaluatio i reiforcemet learig. We plot the covergece of objective value ad the full gradiet Gx 2 regardig time respectively. Gx 2 deotes the l 2 -orm of the full gradiet, where Gx = fx+ hx. Refereces Alle-Zhu, Z., ad Yua, Y Improved svrg for ostrogly-covex or sum-of-o-covex objectives. I Iteratioal coferece o machie learig, Bottou, L.; Curtis, F. E.; ad Nocedal, J Optimizatio methods for large-scale machie learig. arxiv preprit arxiv: Dai, B.; He, N.; Pa, Y.; Boots, B.; ad Sog, L Learig from coditioal distributios via dual kerel embeddigs. arxiv preprit arxiv: Da, C.; Neuma, G.; ad Peters, J Policy evaluatio with temporal differeces: a survey ad compariso. Joural of Machie Learig Research 5: Defazio, A.; Bach, F.; ad Lacoste-Julie, S Saga: A fast icremetal gradiet method with support for ostrogly covex composite objectives. I Advaces i Neural Iformatio Processig Systems, Detcheva, D.; Peev, S.; ad Ruszczyński, A Statistical estimatio of composite risk fuctioals ad risk optimizatio problems. Aals of the Istitute of Statistical Mathematics 24. Gu, B.; Huo, Z.; ad Huag, H. 206a. Asychroous stochastic block coordiate descet with variace reductio. arxiv preprit arxiv: Gu, B.; Huo, Z.; ad Huag, H. 206b. Zeroth-order asychroous doubly stochastic algorithm with variace reductio. arxiv preprit arxiv: Huo, Z., ad Huag, H Asychroous mii-batch gradiet descet with variace reductio for o-covex optimizatio. I AAAI, Johso, R., ad Zhag, Acceleratig stochastic gradiet descet usig predictive variace reductio. I Advaces i Neural Iformatio Processig Systems, Lia, X.; Wag, M.; ad Liu, J Fiite-sum compositio optimizatio via variace reduced gradiet descet. arxiv preprit arxiv: Nesterov, Y A method for ucostraied covex miimizatio problem with the rate of covergece o /k2. I Doklady a SSSR, volume 269, Reddi, S. J.; Sra, S.; Poczos, B.; ad Smola, A. 206a. Fast stochastic methods for osmooth ocovex optimizatio. arxiv preprit arxiv: Reddi, S. J.; Sra, S.; Poczos, B.; ad Smola, A. J. 206b. Proximal stochastic methods for osmooth ocovex fiite-sum optimizatio. I Advaces i Neural Iformatio Processig Systems, Wag, M., ad Liu, J Acceleratig stochastic compositio optimizatio. I Advaces I Neural Iformatio Processig Systems, Wag, M.; Fag, E. X.; ad Liu, H Stochastic compositioal gradiet descet: algorithms for miimizig compositios of expected-value fuctios. arxiv preprit arxiv: Xiao, L., ad Zhag, A proximal stochastic gradiet method with progressive variace reductio. SIAM Joural o Optimizatio 244: Yu, Y., ad Huag, L Fast stochastic variace reduced admm for stochastic compositio optimizatio. arxiv preprit arxiv:

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Duality revisited. AM 221: Advanced Optimization Spring 2016 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R