Stochastic Convex Optimization

Size: px

Start display at page:

Download "Stochastic Convex Optimization"

Marshall Mills
6 years ago
Views:

1 Stochastc Covex Optmzato Sha Shalev-Shwartz TTI-Chcago Ohad Shamr The Hebrew Uversty Natha Srebro TTI-Chcago Karthk Srdhara TTI-Chcago Abstract Recetly regret bouds for ole covex optmzato have bee derved uder very geeral codtos. These results ca be used also the stochastc batch settg by applyg ole-tobatch coversos. I ths paper we study whether stochastc guaratees ca be obtaed more drectly, amely usg uform covergece guaratees. We dscover a surprsg ad complex stuato: although the stochastc covex optmzato problem s learable e.g. usg ole-to-batch coversos, o uform covergece holds the geeral case, ad emprcal mmzato mght fal. Rather the beg a dfferece betwee ole methods ad a global mmzato approach, we show that the key gredet s strog covexty ad regularzato. Usg stablty argumets, we prove that strogly covex problems are learable usg emprcal mmzato. We the uderstad how weakly covex problems ca be leared usg regularzato, ad dscuss how ole algorthms ca also be uderstood terms of regularzato. Itroducto We cosder the stochastc covex mmzato problem argm F w where F w = E Z [fw; Z] s the expectato, wth respect to Z, of a radom objectve that s covex w. The optmzato s based o a..d. sample z,..., z draw from a ukow dstrbuto. The goal s to choose w based o the sample ad full kowledge of f, ad W so as to mmze F w. Alteratvely, we ca also thk of a ukow dstrbuto over covex fuctos, where we are gve a sample of fuctos {w fw; z } ad would lke to optmze the expected fucto. A specal case s the famlar predcto settg where z = x, y s a stacelabel par ad, e.g., fw; x, y = l w, φx, y for some covex loss fucto l ad feature mappg φ. The stuato whch the stochastc depedece o w s lear, as the precedg example, s farly well uderstood. Whe the doma W ad the mappg φ are bouded, oe ca uformly over all w W boud the devato betwee the expected objectve F w ad the emprcal average ˆF w = Ê [fw; z] = fw; z. 2 = Ths uform covergece of ˆF w to F w justfes choosg the emprcal mmzer ŵ = arg m w ˆF w, 3 ad guaratees that the expected value of F ŵ coverges to the optmal value F w = f w F w. Furthermore, a smlar guaratee ca also be obtaed for ay approxmate mmzer of the emprcal objectve. Our goal here s to cosder the stochastc covex optmzato problem more broadly, wthout assumg ay metrc or other structure o the parameter z or mappgs of t, or ay specal structure of the objectve fucto f ;. Vewed as optmzato based o a sample of fuctos, we do ot mpose ay costrats o the fuctos, or the relatoshp betwee the fuctos, except that each fucto w fw; z seperately s covex ad Lpschtz-cotuous. A ole aalogue of ths settg has recetly receved cosderable atteto. Ole covex optmzato cocers a sequece of covex fuctos f ; z,..., f ; z, whch ca be chose by a adversary, ad a sequece of ole predctors w, where w ca deped oly o z,..., z. Ole guaratees provde a upper boud o the ole regret, fw ; z m w fw; z. Note the dfferece versus the stochastc settg, where we seek a sgle predctor w ad would lke to boud the populato sub-optmalty F w F w. Zkevch [Z03] showed that requrg fw; z be Lpschtz-cotuous w.r.t. w s eough for obtag a ole algorthm wth ole regret whch dmshes as O/. If fw, z s ot merely covex w.r.t. w, but also strogly covex, the regret boud ca be mproved to Õ/ [HKKA06]. These ole results parallel kow results the stochastc settg, whe the stochastc depedece o w s lear. However, they apply also a much broader settg, whe the depedece o w s ot lear. E.g. whe fw; z = w z p for p 2. The requremet that the fuctos w fw; z be Lpschtz-cotuous s much

2 more geeral tha a specfc requremet o the structure of the fuctos, ad does ot at all costra the relatoshp betwee the fuctos. That s, we ca thk of z as parameterzg all possble Lpschtz-cotuous covex fuctos w fw; z. We ote that ths s qute dfferet from the work of vo Luxburg ad Bousquet [vlb04] who studed learg wth fuctos that are Lpschtz wth respect to z. The results for the ole settg prompt us to ask whether smlar results, requrg oly Lpschtz cotuty, ca also be obtaed for stochastc covex optmzato. The aswer we dscover s surprsgly complex. Our frst surprsg observato s that requrg Lpschtz cotuty s ot eough for esurg uform covergece of ˆF w to F w, or for the emprcal mmzer ŵ to coverge to a optmal soluto. We preset covex, bouded, Lpschtz-cotuous examples where eve as the sample sze creases, the expected value of the emprcal mmzer ŵ s bouded away from the populato optmum: F ŵ = /2 > 0 = F w. I essetally all prevously studed settgs we are aware of where learg or stochastc optmzato s possble, we have at least some form of locally uform covergece, ad a emprcal mmzato approach s approprate. I fact, for commo models of supervsed learg, t s kow that uform covergece s equvalet to stochastc optmzato beg possble [ABCH97]. Ths mght lead us to thk that Lpschtz-cotuty s ot eough to make stochastc covex optmzato possble, eve though t s eough to esure ole covex optmzato s possble. However, ths gap betwee the ole ad stochastc settg caot be, sce t s possble to covert the ole methods of Zkevch ad of Haza et al to batch algorthms, wth matchg guaratees o the populato sub-optmalty F w F w. These guaratees hold for the specfc output w of the algorthm, whch s ot, geeral, the emprcal mmzer. It seems, the, that we are a strage stuato where stochastc optmzato s possble, but oly usg a specfc ole algorthm, rather tha the more atural emprcal mmzer. We show that the magc ca be uderstood ot as a gap betwee ole optmzato ad emprcal mmzato, but rather terms of regularzato. To do so, we frst show that for a strogly covex stochastc optmzato problem, eve though we mght stll have o uform covergece, the emprcal mmzer s guarateed to coverge to the populato optmum. Ths results seems to defy Vapk s celebrated result o the equvalece of uform covergece ad strct cosstecy of the emprcal mmzer [Vap95, Vap98]. We expla why there s o cotradcto here: Vapk s oto of strct cosstecy s too strct ad does ot capture all stuatos whch learg s o-trval, yet stll possble. Covergece of the emprcal mmzer to the populato optmum for strogly covex objectves justfes stochastc covex optmzato of weakly covex Lpschtzcotuous fuctos usg regularzed emprcal mmzato. I fact, we dscuss how Zkevch s algorthm ca also be uderstood terms of mmzg a mplct regularzed problem. 2 Setup ad Backgroud A stochastc covex optmzato problem s specfed by a covex doma W, whch ths paper we always take to be a compact subset of a Hlbert space H, ad a fucto f : W Z R whch s covex w.r.t. ts frst argumet. We say that the problem s learable or solvable ff there exsts a rule for choosg w based o a..d. sample z,..., z, ad complete kowledge of W ad f ;, such that for ay δ > 0, ay ɛ > 0, ad large eough sample sze, for ay dstrbuto over z, wth probablty at least δ over a sample of sze, we have F w F w + ɛ. We say that such a rule s uformly cosstet, or that t solves the stochastc optmzato problem. We say that the problem s bouded by B ff for all w W we have w B. We say that the problem s L-Lpschtz f fw; z s L- Lpschtz w.r.t. w. That s, for ay z Z ad w, w 2 W we have fw ; z fw 2 ; z L w w 2. We say that the problem -strogly covex f for ay z Z, w, w 2 W ad α [0, ] we have fαw + αw 2 ; z αfw ; z+ αfw 2 ; z 2 α α w w 2 2. Note that ths stregthes the covexty requremet, whch correspods to settg = Geeralzed Lear Stochastc Optmzato We say that a problem s a geeralzed lear problem f fw; z ca be wrtte as fw; z = g w, φz ; z + rw 4 where g : R Z R s covex w.r.t. ts frst argumet, r : W R s covex, ad φ : Z H. A specal case s supervsed learg of a lear predctor wth a covex loss fucto, where g ; ecodes the loss fucto. Learablty results for lear predctors ca -fact be stated more geerally as guaratees o stochastc optmzato of geeralzed lear problems: Theorem. Cosder a geeralzed lear stochastc covex optmzato problem of the form 4, such that the doma W s bouded by B, the mage of φ s bouded by R ad gu; z s L g -Lpschtz u. The for ay dstrbuto over z ad ay δ > 0, wth probablty at least δ over a sample of sze : sup F w ˆF B2 RL g w O 2 log/δ That s, the emprcal values ˆF w coverge uformly, for all w W, to ther expectatos F w. Ths esures that wth probablty at least δ, for all w W: F w F w ˆF w ˆF ŵ B2 RL g + O 2 log/δ 5

3 The emprcal suboptmalty term o the rght-had-sde vashes for the emprcal mmzer ŵ, establshg that emprcal mmzato solves the stochastc optmzato problem wth a rate of /. Furthermore, 5 allows us to boud the populato suboptmalty terms of the emprcal suboptmalty ad obta meagful guaratees eve for approxmate emprcal mmzers. The o-stochastc term rw does ot play a role the above boud, as t ca always be caceled out. However, whe ths terms s strogly-covex e.g. whe t s a squaredorm regularzato term, rw = 2 w 2, a faster covergece rate ca be guarateed: Theorem 2. [SSS08] Cosder a geeralzed lear stochastc covex optmzato problem of the form 4, such that rw s -strogly covex, the mage of φ s bouded by R ad gu; z s L g -Lpschtz u. The for ay dstrbuto over z ad ay δ > 0, wth probablty at least δ over a sample of sze, for all w W: F w F w 2 ˆF w ˆF RLg 2 log/δ ŵ+o 2.2 Ole Covex Optmzato Zkevch [Z03] establshed that Lpschtz cotuty ad covexty of the objectve fuctos wth respect to the optmzato argumet are suffcet for ole optmzato : Theorem 3. [Sha07, Corollary ] Let f : W Z R be such that W s bouded by B ad fw, z s covex ad L- Lpschtz wth respect to w. The, there exsts a ole algorthm such that for ay sequece z,..., z the sequece of ole vectors w,..., w satsfes: fw ; z fw B2 L ; z + O 2 6 Subsequetly, Haza et al [HKKA06] showed that a faster rate ca be obtaed whe the objectve fuctos are ot oly covex, but also strogly covex: Theorem 4. [HKKA06, Theorem ] Let f : W Z R be such that fucto fw, z s -strogly covex ad L- Lpschtz wth respect to w. The, there exsts a ole algorthm such that for ay sequece z,..., z the sequece of ole vectors w,..., w satsfes: fw ; z L fw 2 log ; z + O Ole-to-batch coversos I ths paper, we are ot terested the ole settg, but rather the batch stochastc optmzato settg, where we would lke to obta a sgle predctor w wth low expected value over future examples F w = E z [f w; z]. We preset here slghtly more geeral Theorem statemets tha those foud the orgal papers [Z03, HKKA06]. We do ot requre dfferetablty, ad stead of boudg the gradet ad the Hessa we boud the Lpschtz costat ad the parameter of strog covexty. The boud Theorem 3 s also a bt tghter. Usg martgale equaltes, t s possble to covert a ole algorthm to a batch algorthm wth a stochastc guaratee. Oe smple way to do so s to ru the ole algorthm o the stochastc sequece of fuctos f, z,..., f, z ad set the sgle predctor w to be the average of the ole choces w,..., w. Assumg the codtos of Theorem 3, t s possble to show e.g. [CCG04] that wth probablty of at least δ we have F w F w B2 L O 2 log/δ. 7 It s also possble to derve a smlar guaratee assumg the codtos of Theorem 4 [KT08]: L F w F w 2 log/δ O. 8 The codtos for Theorem 3 geeralze those of Theorem whe rw = 0: If fw; z = g w, φz satsfes the codtos of Theorem the t also satsfes the codtos of Theorem 3 wth L = L g R ad the boud o the populato sub-optmalty of w gve 7 matches the guaratee o ŵ usg Theorem. Smlarly, the codtos of Theorem 4 roughly geeralze those of Theorem 2 wth L = RL g + L r ad the guaratees are smlar except for a log-factor, ad as log as L r = ORL g. It s mportat to ote, however, that the guaratees 7 ad 8 do ot subsume Theorems ad 2, as the ole-to-batch guaratees apply oly to a specfc choce w whch s defed terms of the behavor of a specfc algorthm. They do ot provde guaratees o the emprcal mmzer, ad certaly ot a uform guaratee terms of the emprcal sub-optmalty. 3 Warm-Up: Fte Dmesoal Case We beg by otg that the fte dmesoal case, Lpschtz cotuty s eough to guratee uform covergece, hece also learablty va emprcal mmzato. Theorem 5. Let W R d be bouded by B ad let fw, z be L-Lpschtz w.r.t. w. The wth probablty of at least δ over a sample of sze, for all w W: F w ˆF w O L 2 B 2 d log log d δ Proof. We wll show uform covergece by boudg the l -coverg umber of the class of fuctos F = {z fw; z w W}. To do so, we frst ote that as a subset of a l 2 -sphere, we ca boud the coverg umber of W wth respet to the Eucldea dstace d 2 w, w 2 = w w 2 [VG05]: for d > 3 N ɛ, W, d 2 = O d 2 B ɛ d We ow tur to coverg umbers of F wth respect to the l dstace d fw ;, fw 2 ; = sup z fw ; z fw 2 ; z. By Lpschtz cotuty, for ay w, w 2 W we have sup z fw ; z fw 2 ; z 9

4 L w w 2. A ɛ-coverg of W w.r.t. d 2 therefore yelds a Lɛ-coverg of F w.r.t. d dstaces, ad so: N ɛ, F, d N ɛ/l, W, d 2 = O d 2 LB d ɛ 0 Notg the the emprcal l coverg umber s bouded by the d coverg umber, ad usg a uform boud terms of emprcal l coverg umbers we get [Pol84]: Pr sup F w ˆF w ɛ 8Nɛ, F, d exp ɛ2 28LR d LB O d 2 exp ɛ2 ɛ 28LR. Equatg the rght-had-sde to δ ad boudg ɛ we get the boud the Theorem. We ca therefore coclude that emprcal mmzato s uformly cosstet wth the same rate as Theorem 5: F ŵ F w L + O 2 B 2 d log log d δ wth probablty at least δ over a sample of sze. Ths s the stadard approach for establshg learablty. We ow tur to ask whether such a approach ca also be take the fte dmesoal case,.e. yeldg a boud that does ot deped o the dmesoalty. 4 Learable, but ot wth Emprcal Mmzer The results of the prevous Sectos suggest that perhaps Lpschtz cotuty s eough for obtag guaratees o stochastc covex optmzato usg a more drect approach, eve fte dmesos. I partcular, that perhaps Lpschtz cotuty s eough for esurg uform covergece, whch tur would mply learablty usg emprcal mmzato, as the fte dmesoal lear case, the fte dmesoal Lpschtz case, ad essetally all studed scearos of stochastc optmzato that we are aware of. Esurg uform covergece would further eables us to use approxmate emprcal mmzers, ad boud the stochastc sub-optmalty of ay vector w terms of ts emprcal suboptmalty, rather tha obtag a guaratee o the stochastc sub-optmalty of oly oe specfc procedural choce obtaed from rug the ole learg algorthm. Ufortuately, ths s ot the case. Despte the fact that a bouded, Lpschtz-cotuous, stochastc covex optmzato problem s learable eve fte dmesos, as dscussed Secto 2.2, we show here that uform covergece does ot hold ad that t mght ot be learable wth emprcal mmzato. 4. Emprcal Mmzer far from Populato Optmal Cosder a covex stochastc optmzato problem gve by: f 2 w; x, α = α w x = α 2 []w[] x[] 2 2 where for ow we wll set the doma to the d-dmesoal ut sphere W = { w R d : w } ad take z = x, α wth α [0, ] d ad x W, ad where u v deotes a elemet-wse product. We wll frst cosder a sequece of problems, where d = 2 for ay sample sze, ad estbalsh that we caot expect a covergece rate whch s depedet of the dmesoalty d. We the formalze ths example fte dmesos. Oe ca thk of the problem 2 as that of fdg the ceter of a ukow dstrbuto over x R d, where we also have stochastc per-coordate cofdece measures α[]. We wll actually focus o the case where some coordates are mssg,.e. occasoally α[] = 0. I ay case the doma W s bouded by oe, ad for ay z = x, α the fucto w f 2 w; z s covex ad -Lpschtz. Thus, the codtos of Theorem 3 hold, ad the covex stochastc optmzato problem s learable by rug Zkevch s ole algorthm ad takg a average. Cosder the followg dstrbuto over Z = X, α: X = 0 wth probablty oe, ad α s uform over {0, } d. That s, α[] are..d. uform Beroull. For a radom sample x, α,..., x, α we have that wth probablty greater tha e > 0.63, there exsts a coordate j... 2 such that all cofdece vectors α the sample are zero o the coordate j,.e. α [j] = 0 for all =... Let e j W be the stadard bass vector correspodg to ths coordate. The but ˆF 2 e j = α e j 0 = α [j] = 0 F 2 e j = E X,α [ α e j 0 ] = E X,α [ α[j] ] = /2. We establshed that for ay, we ca costruct a covex Lpschtz-cotuous objectve hgh eough dmeso such that wth probablty at least 0.63 over the sample, F2 sup w w ˆF 2 w /2. Furthermore, sce f ; s o-egatve, we have that e j s a emprcal mmzer, but ts expected value F 2 e j = /2 s far from the optmal expected value m w F 2 w = F 2 0 = I Ifte Dmesos: Populato Mmzer Does Not Coverge to Populato Optmum To formalze the example a sample-sze depedet way, take W to be the ut sphere of a fte-dmesoal Hlbert space wth orthoormal bass e, e 2,..., where for v W, we refer to ts coordates v[j] = v, e j w.r.t ths bass. The cofdeces α are ow a mappg of each coordate to [0, ]. That s, a fte sequece of reals [0, ]. The elemet-wse product operato α v s defed wth respect to ths bass ad the objectve fucto f 2 of equato 2 s well defed ths fte-dmesoal space. We aga take a dstrbuto over Z = X, α where X = 0 ad α s a..d. sequece of uform Beroull radom varables. Now, for ay fte sample there s almost surely a coordate j wth α [j] = 0 for all, ad so we a.s. have a emprcal mmzer ˆF 2 e j = 0 wth F 2 e j = /2 > 0 = F 2 0.

5 We see that although the stochastc covex optmzato problem 2 s learable usg Zkevch s ole algorthm, the emprcal values ˆF 2 w do ot coverge uformly to ther expectatos, ad emprcal mmzato s ot guarateed to solve the problem! 4.3 Uque Emprcal Mmzer Does Not Coverge to Populato Optmum It s also possble to costruct a sharper couterexample, whch the uque emprcal mmzer ŵ s far from havg optmal expected value. To do so, we augmet f 2 by a small term whch esures ts emprcal mmzer s uque, ad far from the org. Cosder: f 3 w; x, α = f 2 w; x, α+ɛ 2 w[] 2 3 where ɛ = 0.0. The objectve s stll covex ad + ɛ- Lpschtz. Furthermore, sce the addtoal term s strctly covex, we have that f 3 w; z s strctly covex w.r.t. w ad so the emprcal mmzer s uque. Cosder the same dstrbuto over Z: X = 0 whle α[] are..d. uform zero or oe. The emprcal mmzer s the mmzer of ˆF 3 w subject to the costrats w. Idetfyg the soluto to ths costraed optmzato problem s trcky, but fortuately ot ecessary. It s eough to show that the optmum of the ucostraed optmzato problem w UC = arg m ˆF 3 w wthout costrag w W has orm w UC. Notce that the ucostraed problem, wheever α [j] = 0 for all =.., oly the secod term of f 3 depeds o w[j] ad we have w UC[j] =. Sce ths happes a.s. for some coordate j, we ca coclude that the soluto to the costraed optmzato problem les o the boudary of W,.e. has ŵ =. But for such a soluto we have F 3 ŵ [ E α α[]ŵ2 [] ] E α [ α[]ŵ2 [] ] = 2 ŵ 2 = 2, whle F w F 0 = ɛ. I cocluso, o matter how bg the sample sze s, the uque emprcal mmzer ŵ of the stochastc covex optmzato problem 3 s a.s. much worse tha the populato optmum, F ŵ 2 > ɛ F w, ad certaly does ot coverge to t. 5 Emprcal Mmzato of a Strogly Covex Objectve We saw that emprcal mmzato s ot adequate for stochastc covex optmzato eve f the objectve s Lpschtz-cotuous. We wll ow show that, f the objectve fw; z s strogly covex w.r.t. w, the emprcal mmzer does coverge to the optmum. Ths s despte the fact that eve the strogly covex case, we stll mght ot have uform covergece of ˆF w to F w. 5. Emprcal Mmzer coverges to Populato Optmum Theorem 6. Cosder a stochastc covex optmzato problem such that fw; z s -strogly covex ad L- Lpschtz wth respect to w W. Let z,..., z be a..d. sample ad let ŵ be the emprcal mmzer. The, wth probablty at least δ over the sample we have F ŵ F w 4L2 δ. 4 Proof. We use a stablty argumet to prove the Theorem. Deote ˆF w = fw, z + fw, z j the emprcal average wth z replaced by a depedetly ad detcally draw z, ad cosder ts mmzer: ŵ = arg m ˆF w. We frst use strog covexty ad Lpschtz-cotuty to establsh that emprcal mmzato s stable the followg sese: z,..., z, z, z Z fŵ, z fŵ, z β 5 wth β = 2L2 ths s refered to as CV Replacemet Stablty [RMP05] ad s smlar to uform stablty [BE02]. We the show that 5 mples covergece of F ŵ to F w. Clam 6.. Uder the codtos of Theorem 6, the stablty boud 5 holds wth β = 4L2. Proof We frst calculate: ˆF ŵ ˆF ŵ = fŵ, z fŵ, z = fŵ, z fŵ, z + ˆF ŵ ˆF ŵ j fŵ, z fŵ, z + + fŵ, z fŵ, z fŵ, z fŵ, z + fŵ, z fŵ, z 2L ŵ ŵ 6 where the frst equalty follows from the fact that ŵ s the mmzer of ˆF w ad for the secod equalty we use Lpschtz cotuty. But from strog covexty of ˆF w ad the fact that ŵ mmzes ˆF w we also have that ˆF ŵ ˆF ŵ + 2 ŵ ŵ 2. 7 Combg 7 wth 6 we get ŵ ŵ 4L/. Fally from Lpschtz cotuty, for ay z Z: fŵ, z fŵ, z 4L2 Clam 6.2. If the stablty boud 5 holds, the for ay δ > 0, wth probablty δ over the sample, F ŵ F w β 8 δ A smlar clam that s ot specfc to ŵ, but yelds oly a β + rate appears [RMP05, Theorem 4.4].

6 Proof: Sce the samples wth z ad wth z are detcally dstrbuted, ad z s depedet of z, we have: [ ] [ ] E [F ŵ] = E F ŵ = E fŵ ; z where the expectato s over z,..., z, z. Ths holds for all, ad so we ca also wrte: E [F ŵ] = [ ] E fŵ ; z. 9 We also have: [ [ ] E ˆF ŵ = E = ] fŵ; z = = E [fŵ; z ] 20 Combg 9 ad 20 ad usg 5 yelds 2 : [ E F ŵ ˆF ] ŵ = [ ] E fŵ, z fŵ; z β = [ ] [ ] We also have that E [F w ] = E ˆF w E ˆF ŵ, where the equalty s just equatg a expectato to a expectato of a average, ad the equalty follows from optmalty of ŵ. We ca therefore coclude: [ E [F ŵ F w ] E F ŵ ˆF ] ŵ β. 2 Usg Markov s equalty yelds 8. = We do ot kow f the depedece o δ the above boud ca be mproved to log/δ, matchg the oleto-batch guaratee 8. Bousquet ad Elsseeff [BE02] do provde argumets for a boud wth a log/δ depedece, but ufortuately, ther approach ca oly yeld a boud of O log/δ, whch s much worse tha Theorem 6 ad equato 8 terms of the depedece o. 5.2 But Wthout Uform Covergece! We ow tur to ask whether the covergece of the emprcal mmzer ths case s a result of uform covergece. Cosder augmetg the objectve fucto f 2 of Secto 4 wth a strogly covex term: f 22 w; x, α = f 2 w; x, α + 2 w The modfed objectve f 22 ; s -strogly covex ad + -Lpschtz over the doma W = {w : w } ad thus satsfes the codtos of Theorem 6. Cosder the same dstrbuto over Z = X, α used Secto 4: X = 0 ad α s a..d. sequece of uform zero/oe Beroull varables. Recall that almost surely we have a coordate j that s ever observed,.e. such that α [j] = 0. Cosder a vector te j of magtude 0 < t the drecto of ths coordate. We have that ˆF 22 te j = 2 t2 but F 22 te j = 2 t+ 2 t2. Hece F 22 te j ˆF 22 te j = t/2. 2 Ths s a modfcato of a dervato extracted from the proof of Theorem 2 [BE02] I partcular, we ca set t = ad establsh sup F 22 w ˆF 22 w 2 regardless of the sample sze. We see the that the emprcal averages ˆF 22 w do ot coverge uformly to ther expectatos, eve as the sample sze creases. 5.3 Not Eve Local Uform Covergece For ay ɛ > 0, cosder lmtg our atteto oly to predctors that are close to beg populato optmal: W ɛ = {w W : F 22 w F 22 w + ɛ}. Settg t = ɛ we have te j W ɛ focusg for coveece o < ad so: sup ɛ F 22 w ˆF 22 w 2 ɛ2 23 regardless of the sample sze. Ad so, eve a arbtrarly small eghborhood of the optmum, the emprcal values ˆF 22 w do ot coverge uformly to ther expected values eve as. Ths s sharp cotrast to essetally all other results o stochastc optmzato ad learg that we are aware of. 5.4 Boudg Populato Sub-Optmalty term of Emprcal Sub-Optmalty. A practcal questo related to uform covergece s whether we ca obta a uform boud o the populato sub-optmalty terms of the emprcal sub-optmalty, as Theorem 2. We frst ote that merely due to the fact that the emprcal objectve ˆF s strogly covex, ay approxmate emprcal mmzer must be close to ŵ, ad due to the fact that the expected objectve F s Lpschtz-cotuous ay vector close to ŵ caot have a much worse value tha ŵ. We therefore have, uder the codtos of Theorem 6, that wth probablty at least δ, for all w W: 2L F w F w 2 ˆF w ˆF ŵ + 4L2 δ 24 It s mportat to emphasze that ths s a mmedate cosequece of 4 ad does ot volve ay further stochastc propertes of ˆF or F. Although ths uform equalty does allow us to boud the populato sub-optmalty terms of the emprcal sub-optmalty, the emprcal sub-optmalty must be quadratc the desred populato sub-optmalty. Compare ths depedece wth the more favorable lear depedece of Theorem 2. Ufortuately, as we show ext, ths s the best that ca be esured. Cosder the objectve f 22 ad the same dstrbuto over Z = X, α dscussed above ad recall that te j s a vector of magtude t alog a coordate j s.t. α [j] = 0. We have that ˆF 22 te j ˆF 22 ŵ = 2 t2 ad so settg t = 2ɛ/, we get a ɛ-emprcal-suboptmal vector wth populato sub-optmalty F 22 te j F 22 0 = 2 t + 2 t2 = ɛ 2 + ɛ. Ths establshes that the depedece o ɛ the frst term of 24 s tght, ad the stuato s qualtatvely dfferet tha the geeralzed lear case.

7 5.5 Cotradcto to Vapk? At ths pot, a reader famlar wth Vapk s work o ecessary ad suffcet codtos for cosstecy of emprcal mmzato.e. codtos for F ŵ F w mght be cofused. I seekg such ecessary ad suffcet codtos [Vap98, Chapter 3], Vapk excludes certa cosstet settgs where the cosstecy s so-called trval. The ma example of a excluded settg s oe whch there s oe hypothess w 0 that domates all others,.e. fw 0 ; z < fw; z for all w W ad all z Z [Vap98, Fgure 3.2]. Whe ths s the case, emprcal mmzato wll be cosstet regardless of the behavor of ˆF w for w w 0. I order to exclude such trval cases Vapk defes strct aka o-trval cosstecy of emprcal mmzato as our otato: f ˆF w P f F w c 25 F w c F w c for all c R, where the covergece s probablty. Ths codto deed esures that F ŵ P F w. Vapk s Key Theorem o Learg Theory [Vap98, Theorem 3.] the states that strct cosstecy of emprcal mmzato s equvalet to oe-sded uform covergece. Oesded meag requrg oly supf 22 w ˆF 22 w P 0, rather the sup F 22 w ˆF P 22 w 0. Note that the aalyss above shows the lack of such oe-sded uform covergece. I the example preseted above, eve though Theorem 6 establshes F ŵ P F w, the cosstecy s t strct by the defto above. To see ths, for ay c > 0, cosder the vector te j where α [j] = 0 wth t = 2c. We have F te j = 2 t + 2 t2 > c but ˆF22 te j = 2 t2 = 2c 2. Focusg o = 2 we get: f ˆF w c 2 < c 26 F w c almost surely for ay sample sze, volatg the strct cosstecy requremet 25. The fact that the rght-had-sde of 26 s strctly greater the F w = 0 s eough for obtag o strct cosstecy of emprcal mmzato, but ths s ot eough for satsfyg strct cosstecy. We emphasze that stochastc covex optmzato s far from trval that there s o domatg hypothess that wll always be selected. Although for coveece of aalyss we took X = 0, oe should thk of stuatos whch X s stochastc wth a ukow dstrbuto. We see the that there s o mathematcal cotradcto here to Vapk s Key Theorem. Rather, we see a demostrato that strct cosstecy s too strct a requremet, ad that terestg, o-trval, learg problems mght admt o-strct cosstecy whch s ot equvalet to oe-sded uform covergece. We see that uform covergece s a suffcet, but ot at all ecessary, codto for cosstecy of emprcal mmzato o-trval settgs. 6 Regularzato We ow retur to the case where fw, z s Lpschtz ad covex w.r.t. w but ot strogly covex. As we saw, emprcal mmzato may fal ths case, despte the guarateed success of a ole approach. Our goal ths secto s to uderscore a more drect, o-procedural, optmzato crtero for stochastc optmzato. To do so, we defe a regularzed emprcal mmzato problem ŵ = m 2 w 2 + fw, z =, 27 where s a parameter that wll be determed later. The followg theorem establshes that the mmzer of 27 s a good soluto to the stochastc covex optmzato problem: Theorem 7. Let f : W Z R be such that W s bouded by B ad fw, z s covex ad L-Lpschtz wth respect to w. Let z,..., z be a..d. sample ad let ŵ be the mmzer of 27 wth = at least δ we have 8L 2 δ B 2 F ŵ F w. The, wth probablty 8L2 B 2. δ Proof. Let rw, z = 2 w 2 + fw, z ad let Rw = E z [rw, z]. Note that ŵ s the emprcal mmzer for the stochastc optmzato problem defed by rw; z. From Theorem 6 we therefore have: 2 ŵ 2 + F ŵ = Rŵ f w Hece Rw + 4L2 δ Rw + 4L2 δ = 2 w 2 + F w + 4L2 δ F ŵ F w + 2 w 2 + 4L2 δ Boudg w 2 B 2 ad substtutg the value of yelds the desred boud. From the above theorem ad the dscusso Secto 4 we coclude that regularzato s a ecessary tool for covex stochastc optmzato. It s terestg to cotrast ths wth the ole learg algorthm of Zkevch [Z03]. Seemgly, the ole approach of Zkevch does ot rely o regularzato. However, a more careful look reveals a uderlyg regularzato also the ole techque. Ideed, Shalev-Shwartz [Sha07] showed that Zkevch s ole learg algorthm ca be vewed as approxmate coordate ascet optmzato of the dual of the regularzed problem 27. Furthermore, t s also possble to obta the same ole regret boud usg a Follow-The-Regularzed- Leader approach, whch at each terato drectly solves the regularzed mmzato problem 27 o z,..., z. The key, the, seems to be regularzato, rather the a procedural ole versus global mmzato approach. 6. Regularzato vs Costrats The role of regularzato here s very dfferet tha famlar settgs such as l 2 regularzato SVMs ad l regularzato LASSO. I those settgs regularzato serves to

8 f 2 f 22 uform covergece: sup ˆF w F w 0 learable by emprcal m always supervsed learg f2 learable wth ŵ: F ŵ F w uform covergece learable always supervsed learg f22 learable: F w F w Fgure : Lpschtz-cotuous covex problems tragle are all learable, but ot ecessarly usg emprcal mmzato. Lpschtz-cotuous strogly covex problems dotted rectagle are all learable wth emprcal mmzato, but uform covergece mght ot hold. For bouded geeralzed lear problems starred rectagle, uform covergece always holds. Our two separatg examples are also dcated. costra our doma to a low-complexty doma e.g. loworm predctors, where we rely o uform covergece. I fact, almost all learg guaratees for such settgs that we are aware of ca be expressed terms of some sort of uform covergece. Ad as we metoed, learablty uder the stadard supervsed learg model s fact equvalet to a uform covergece property. I our case, costrag the orm of w does ot esure uform covergece. Cosder the example f 2 of Secto 4. Eve over a restrcted doma W r = {w : w r}, for arbtrarly small r > 0, the emprcal averages ˆF w do ot uformly coverge to F w ad Pr lm sup sup r ˆF w F w > 0 =. Furthermore, cosder replacg the regularzato term w 2 wth a costrat o the orm of w, amely, solvg the problem w r = arg m ˆF w 28 w r As we show below, we caot solve the stochastc optmzato problem by settg r a dstrbuto-depedet way.e. wthout kowg the soluto... To see ths, ote that whe X = 0 a.s. we must have r 0 to esure F w w F w. However, f X = e a.s., we must set r. No costrat wll work for all dstrbutos over Z = X, α! Ths sharply cotrasts wth tradtoal uses of regularzato, were learg guaratees are actually typcally stated terms of a costrat o the orm rather tha terms of a parameter such as, ad addg a regularzato term of the form 2 w 2 s vewed as a proxy for boudg the orm w. Fgure 2: Relatoshp betwee dfferet propertes of stochastc optmzato problems. 7 Summary Followg the work of Zkevch [Z03], we expected to be able to geeralze well establshed results o stochastc optmzato of lear fuctos also to the more geeral Lpschtz-covex case. We dscovered a complex ad uexpected stuato, where strog covexty ad regularzato play a key role ad ultmately dd reach a uderstadg of stochastc covex optmzato that does ot rely o ole techques Fgure. For stochastc objectves that arse from supervsed predcto problems, t s well kow that learablty,.e. solvablty of the stochastc optmzato problem, s equvalet to uform covergece, ad so wheever the problem s learable, t s learable usg emprcal mmzato [ABCH97]. May mght thk that ths prcpal, amely that a problem s learable ff t s learbale usg emprcal mmzato, exteds also the Geeral Settg of Learg [Vap95] whch cludes also the stochastc covex optmzato problem studed here. However, we demostrated stochastc optmzato problems whch these equvaleces do ot hold. There s o cotradcto, sce stochastc optmzato problems that arse from supervsed learg have a restrcted structure, ad partcular the examples we study are ot amog such problems. I fact, for a reasoable loss fucto, order to make fw; x, y = lpredw, x, y covex for both postve ad egatve labels, we must essetally make the predcto fucto predw, x both covex ad cocave w,.e. lear. Ad so the oly stochastc or ole covex optmzato problems that correspod to supervsed problems are geeralzed lear problems. To summarze, although there s o cotradcto to the work of Vapk [Vap95] or of Alo et al [ABCH97], we see that learg the Geeral Settg s more complex tha we perhaps apprcate. Emprcal mmzato mght be cosstet wthout local uform covergece, ad more suprsgly, learg mght be possle, but ot by emprcal mmzato Fgure 2.

9 Refereces [ABCH97] Noga Alo, Sha Be-Davd, Ncolò Cesa- Bach, ad Davd Haussler. Scale-sestve dmesos, uform covergece, ad learablty. J. ACM, 444:65 63, 997. [BE02] Olver Bousquet ad Adré Elsseeff. Stablty ad geeralzato. J. Mach. Lear. Res., [CCG04] 2: , N. Cesa-Bach, A. Coco, ad C. Getle. O the geeralzato ablty of o-le learg algorthms. IEEE Trasactos o Iformato Theory, 509: , September [HKKA06] E. Haza, A. Kala, S. Kale, ad A. Agarwal. Logarthmc regret algorthms for ole covex optmzato. I Proceedgs of the Neteeth Aual Coferece o Computatoal Learg Theory, [RMP05] [Sha07] [SSS08] [Vap95] [Vap98] [VG05] [vlb04] [KT08] Sham M. Kakade ad Ambuj Tewar. O the geeralzato ablty of ole strogly covex programmg algorthms. I NIPS, [Pol84] D. Pollard. Covergece of Stochastc Processes. Sprger, New York, 984. S. Rakhl, S. Mukherjee, ad T. Poggo. Stablty results learg theory. Aalyss ad Applcatos, 34:397 49, S. Shalev-Shwartz. Ole Learg: Theory, Algorthms, ad Applcatos. PhD thess, The Hebrew Uversty, K. Srdhara, N. Srebro, ad S. Shalev-Shwartz. Fast rates for regularzed objectves. I Advaces Neural Iformato Processg Systems 22, V.N. Vapk. The Nature of Statstcal Learg Theory. Sprger, 995. V. N. Vapk. Statstcal Learg Theory. Wley, 998. Jea-Lous Verger-Gaugry. Coverg a ball wth smaller equal balls -. Dscrete Comput. Geom., 33:43 55, Ulrke vo Luxburg ad Olver Bousquet. Dstace based classfcato wth lpschtz fuctos. J. Mach. Lear. Res., 5: , [Z03] M. Zkevch. Ole covex programmg ad geeralzed ftesmal gradet ascet. I Proceedgs of the Tweteth Iteratoal Coferece o Mache Learg, 2003.

Stochastic Convex Optimization

Stochastic Convex Optimization Stochastc Covex Optmzato Sha Shalev-Shwartz TTI-Chcago sha@tt-c.org Ohad Shamr The Hebrew Uversty ohadsh@cs.huj.ac.l Natha Srebro TTI-Chcago at@uchcago.edu Karthk Srdhara TTI-Chcago karthk@tt-c.org Abstract