Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence

Supplemen for Sochasic Convex Opimizaion: Faser Local Growh Implies Faser Global Convergence Yi Xu Qihang Lin ianbao Yang Proof of heorem heorem Suppose Assumpion holds and F (w) obeys he LGC (6) Given (, ), le = /K, K = dlog ( )e, D c and be he smalles ineger such ha max{9, 78 log(/ )} G D hen A-c guaranees ha, wih a probabiliy, F (w K ) F apple As a resul, he ieraion complexiy of A-c for achieving an -opimal soluion wih a high probabiliy is O(c G dlog ( )e log(/ )/( ) ) provided D = O( c ) ( ) Proof Le w k, denoe he closes poin o w k in S Define k = Noe ha D k k = D c k and k k = k 3G We will show by inducion ha F (w k ) F apple k + for k =,, wih a high probabiliy, which leads o our conclusion when k = K he inequaliy holds obviously for k = Condiioned on F (w k ) F apple k +, we will show ha F (w k ) F apple k + wih a high probabiliy By Lemma, we have kw k, w k k apple c (F (w k ) F (w k, )) apple c k apple D k () We apply Lemma o he k-h sage of Algorihm condiioned on randomness in previous sages Wih a probabiliy we have F (w k ) F (w k, ) apple kg + kw k w k, k k q + GD k 3 log(/ ) p () We now consider wo cases for w k Firs, we assume F (w k ) F apple, ie w k S hen we have Deparmen of Compuer Science, he Universiy of Iowa, Iowa Ciy, IA 56, USA Deparmen of Managemen Sciences, he Universiy of Iowa, Iowa Ciy, IA 56, USA Correspondence o: ianbao Yang <ianbao-yang@uiowaedu> Proceedings of he 3 h Inernaional Conference on Machine Learning, Sydney, Ausralia, PMLR 7, 7 Copyrigh 7 by he auhor(s) w k, = w k and q F (w k ) F (w k, ) apple kg + GD k 3 log(/ ) p apple k 3 + k = k 6 3 he second inequaliy using he fac ha k 78 log(/ ) G D As a resul, = k 3G F (w k ) F apple F (w k, ) F + k 3 apple + k and Nex, we consider F (w k ) F >, ie w k /S hen we have F (w k (), we ge, ) F = Combining () and q F (w k ) F (w k, ) apple kg + D k k + GD k 3 log(/ ) p Since k = k 3G and max{9, 78 log(/ )} G D, we have each erm in he RHS of above inequaliy bounded by k /3 As a resul, F (w k ) F (w k, ) apple k ) F (w k ) F apple k + wih a probabiliy probabiliy a leas ( herefore by inducion, wih a )K we have F (w K ) F apple K + apple Since = /K, hen ( )K and we complee he proof Proof of Lemma 3 Lemma 3 For any, we have k bw w k apple 3 G and kw w k apple G Proof By he opimaliy of bw, we have for any w K @F( bw )+ ( bw w ) > (w bw )

Sochasic Convex Opimizaion: Faser Local Growh Implies Global Convergence Le w = w, we have @F( bw ) > (w bw ) k bw w k Because k@f( bw )k apple G due o k@f(w; )k apple G, hen Nex, we bound kw w + we have kw + w k k bw w k apple G w k According o he updae of applekw + w k =k @f(w ; )+( / )(w w )k We prove kw w k apple G by inducion Firs, we consider =, where =, hen kw w k applek @f(w ; )k apple G hen we consider any, where / apple hen apple kw + w k apple herefore @f(w ; )+ G + 3 Proof of heorem G apple G k bw w k apple 3 G (w w ) heorem Suppose Assumpion holds and F (w) obeys he LGC (6) Given (, /e), le = /K, K = dlog ( )e, c and be he smalles ineger such ha max{3, 36 G (+log( log / )+log ) ( ) } hen A-r guaranees ha, wih a probabiliy, F (w K ) F apple As a resul, he ieraion complexiy of A-r for achieving an -opimal soluion wih a high probabiliy is O(c G log( /) log(/ )/ ( ) ) provided = O( c ) ( ) o prove he heorem, we firs show he following resuls of high probabiliy convergence bound Lemma Given w K, apply -ieraions of (9) For any fixed w K, (, ), and 3, wih a probabiliy a leas, he following inequaliy holds F ( bw ) F (w) apple kw w k where bw = P = w / + 3 G ( + log + log( log / )), Proof Le g = @f(w ; )+(w w )/ and @ F b (w )= @F(w )+(w w )/ Noe ha kg k apple 3G According o he sandard analysis for he sochasic gradien mehod we have g > (w bw ) apple kw bw k hen + kg k @ b F (w ) > (w bw ) apple kw bw k By srong convexiy of b F we have kw + bw k kw + bw k + kg k +(@ b F (w ) g ) > (w bw ) bf ( bw ) b F (w ) @ b F (w ) > ( bw w )+ k bw w k hen bf (w ) F b ( bw ) apple kw bw k kw + +(@ b F (w ) g ) > (w bw ) apple kw bw k kw + +(@F(w ) @f(w ; )) > (w {z bw ) } bw k + kg k k bw w k bw k + kg k k bw w k By summing he above inequaliies across =,,, we have ( F b (w ) F b ( bw )) (3) = apple k bw w + k + apple = + = = k bw w k = k bw w k + k bw w k + 9G = k bw w k +9 G ( + log ) = where he las inequaliy uses = Nex, we bound RHS of he above inequaliy We need he following lemma ()

Sochasic Convex Opimizaion: Faser Local Growh Implies Global Convergence Lemma 5 (Lemma 3 (Kakade & ewari, 8)) Suppose X,,X is a maringale difference sequence wih X appleb Le Var X = Var(X X,,X ) where Var denoes he variance Le V = P = Var X be he sum of condiional variance of X s Furher, le = p V hen we have for any </e and 3, Pr X > max{, 3b p log(/ )} p! log(/ ) = apple log o proceed he proof of Lemma We le X = and D = P = kw bw k hen X,,X is a maringale difference sequence Le D = 3 G Noe ha applegd By Lemma 5, for any </e and 3, wih a probabiliy we have apple = 8 v < 9 max : u log X log( ) Var, 6GD log( log = ) ; Noe ha Var apple = = E [ ] apple G = X = As a resul, wih a probabiliy, (5) kw bw k =G D appleg p log( log / ) p D +6GD log( log / ) = apple6 G log( log / )+ D +6GD log( log / ) As a resul, wih a probabiliy, = k bw w k = apple6 G log( log / )+6GD log( log / ) =3 G log( log / ) hus, wih a probabiliy bf ( bw ) b F ( bw ) apple 3 G log( log / ) + 9 G ( + log ) apple 3 G ( + log + log( log / )) Using he facs ha F ( bw ) apple F b ( bw ) and F b ( bw ) apple bf (w) =F (w)+ kw wk, we have F ( bw ) F (w) kw w k apple 3 G ( + log + log( log / )) Nex, le us sar o prove heorem Proof of heorem Le w k, denoe he closes poin o w k in he sublevel se Define k, Firs, we noe ha k c k k We will show by inducion ha F (w ( ) k ) F apple k + for k =,, wih a high probabiliy, which leads o our conclusion when k = K he inequaliy holds obviously for k = Condiioned on F (w k ) F apple k +, we will show ha F (w k ) F apple k + wih a high probabiliy We apply Lemma o he k-h sage of Algorihm condiioned on he randomness in previous sages Wih a probabiliy a leas we have F (w k ) F (w k, ) apple k kw k, w k k + 3 kg ( + log + log( log / )) (6) We now consider wo cases for w k Firs, we assume F (w k ) F apple, ie w k S hen we have w k, = w k and F (w k ) F (w k, ) apple3 kg ( + log + log( log / )) apple k he las inequaliy uses he fac ha 36 G (+log( log / )+log ) As a resul, F (w k ) F apple F (w k, ) F + k apple + k Nex, we consider F (w k ) F >, ie w k /S hen we have F (w k, ) F = Similar o he proof of heorem, by Lemma, we have kw k, w k k apple c k (7) Combining (6) and (7), we have F (w k ) F (w k, ) apple ck 3 k G ( + log + log( log / )) + k

Sochasic Convex Opimizaion: Faser Local Growh Implies Global Convergence Using he fac ha k and 36 G (+log +log( log / )), we ge c k ( ) 68 k G (+log +log( log / )) k = F (w k ) F (w k, ) apple k + k = k, which ogeher wih he fac ha F (w k, )=F + implies F (w k ) F apple + k herefore by inducion, we have wih a probabiliy a leas ( )K, F (w K ) F apple K + = + apple, K where he las inequaliy is due o he value of K = dlog ( )e Since = /K, hen ( )K Proof of heorem 3 heorem 3 (RA wih unknown c) Le apple /,! =, and K = dlog ( )e in Algorihm 3 Suppose D () is sufficienly large so ha here exiss ˆ [, /], wih which F ( ) saisfies a LGC (6) on Sˆ wih (, ) and he consan c, and D () = c Le ˆ = ˆ K(K+), and = max{9, 78 log(/ˆ)} GD () / hen wih a mos S = dlog (ˆ /)e +calls of A-c, Algorihm 3 finds a soluion w (S) such ha F (w (S) ) F apple he oal number of ieraions of RA for obaining -opimal soluion is upper bounded by S = O(dlog ( )e log(/ )/( ) ) ˆ GD () Proof Since K = dlog ( )e dlog ( ˆ )e, D () = c, and = max{9, 78 log(/ˆ)}, following he proof of heorem, we can show ha wih a probabiliy K+, F (w () ) F apple ˆ (8) By running A-c saring from w () which saisfies (8) wih K = dlog ( )e dlog ( ˆ D () = max{9, 78 log(/ˆ)} ha c cˆ (ˆ /) ˆ / )e, (ˆ /), and =, GD () / heorem ensures F (w () ) F apple ˆ wih a probabiliy a leas ( /(K + )) By coninuing he process, wih S = dlog (ˆ /)e +we can prove ha wih a probabiliy a leas ( /(K +)) S S K+, F (w (S) ) F apple ˆ / S apple he oal number of ieraions for he S calls of A-c is bounded by S =K s = K s= s= =K (S )( ) S X (s )( ) s= / ( ) S s applek (S )( ) /( )! ( ) ˆ appleo K apple O(log(/ e )/ ( ) ) 5 Proof of heorem heorem (RA wih unknown ) Le =, apple /,! =, and K = dlog ( )e in Algorihm 3 Assume D () is sufficienly large so ha here exiss ˆ [, /] rendering ha D () = Bˆ ˆ Le ˆ = K(K+), and = max{9, 78 log(/ˆ)} GD () / hen wih a mos S = dlog (ˆ /)e +calls of A-c, Algorihm 3 finds a soluion w (S) such ha F (w (S) ) F apple he oal number of ieraions of RA for obaining -opimal soluion is upper bounded by S = O dlog ( )e log(/ ) G B ˆ Proof he proof is similar o he proof of heorem 3, and we reprove i for compleeness I is easy o show ha Following he proof of heorem, we hen can show ha wih a probabiliy S, 36 () G (+log( log /ˆ)+log ) wih K = dlog ( )e F (w () ) F apple ˆ (9) dlog ( ˆ )e and () = c ( ) ˆ By running A-r saring from w () which saisfies (9) wih K = dlog ( )e dlog ( ˆ ˆ )e, ( ) / = 36 () G (+log( log /ˆ)+log ) and () c ˆ / (ˆ /), heorem ensures ha ( ) F (w () ) F apple ˆ = c (ˆ /) ( )

Sochasic Convex Opimizaion: Faser Local Growh Implies Global Convergence wih a probaliy a leas ( /S) By coninuing he process, wih S = dlog (ˆ /)e +, we can prove ha wih a probaliy a leas ( /S) S F (w (S) ) F apple ˆ / S apple he oal number of ieraions for he S calls of A-c is bounded by S =K s = K s= s= =K (S )( ) S X (s )( ) s= / ( ) S s applek (S )( ) /( )! ( ) ˆ appleo K apple O(log(/ e )/ ( ) ) 6 Monooniciy of B / Lemma 6 B is monoonically decreasing in Proof Consider >> Le x be any poin on L such ha dis(x, )=B and x be he closes poin o x in so ha kx x k = B We define a new poin beween x and x as x = B B x + B B B x Since <B <B, x is sricly beween x and x and dis( x, )=kx xk = B B kx x k = B By he convexiy of F, we have able Saisics of real daases Name #raining (n) #Feaures (d) ype covypebinary 58, 5 Classificaion real-sim 7,39,958 Classificaion url,396,3 3,3,96 Classificaion million songs 63,75 9 Regression E6-fidf 6,87 5,36 Regression E6-logp 6,87,7,7 Regression he resuls of A wih very large and add hem ino Figure For compleeness, we replo all hese resuls in Figure he resuls show ha A wih smaller converges much faser o an -level se han A wih larger, while A wih larger can converge o a much smaller In some case, A wih larger is no as good as in earlier sages bu overall i converges faser o a smaller han We also presen he resuls for = in Figure 3, which are similar o ha for = in Figure In Figures and 3, we compare RA wih in erms of running ime (cpu ime) since compues a full gradien in each ouer loop while RA only goes one sample in each ieraion Following many previous sudies, we also include he resuls in erms of he number of full gradien pass in Figure boh for = and = Similar rend resuls can be found in Figure, indecaing ha RA converges faser han oher hree algorihms References Kakade, Sham M and ewari, Ambuj On he generalizaion abiliy of online srongly convex programming algorihms In NIPS, pp 8 88, 8 F ( x) F dis( x, ) apple F (x ) F dis(x, ) = B Noe ha we mus have F ( x) F since, oherwise, we can move x owards x unil F ( x) F = bu dis( x, ) >B, conradicing wih he definiion of B hen, he proof is compleed by applying F ( x) F and dis( x, )=B o he previous inequaliy 7 Addiional Experimens he daases used in experimens are from libsvm websie We summarize he basic saisics of daases in able o examine he convergence behavior of A wih differen values of he ieraions in per-sage, we also provide hps://wwwcsienueduw/ cjlin/ libsvmools/daases/

Sochasic Convex Opimizaion: Faser Local Growh Implies Global Convergence log ( gap) log ( gap) hinge loss + l norm, covype A(= 7 ) A(= 6 ) RA(= 6 ) 5 5 5 5 5 6 8 number of ieraions 7 robus + l norm, million songs - A(= 7 ) A(= 6 ) RA(= 6 ) 6 8 number of ieraions 7 log ( gap) log ( gap) hinge loss + l norm, real-sim - A(= 7 ) A(= 6 ) RA(= 6 ) -5 5 5 5 6 8 number of ieraions 7 robus + l norm, E6-fidf - A(= 5 ) A(= ) RA(= ) 6 8 number of ieraions 5 huber loss + l norm, million songs A(= 7 ) A(= 6 ) RA(= 6 ) log ( gap) 6 8 6 8 number of ieraions 7 squared hinge + l norm, url 8 RA 6 5 5 cpu ime (s) 5 huber loss + l norm, E6-fidf - A(= 5 ) A(= ) RA(= ) log ( gap) -9 6 8 8 6 number of ieraions 5 huber loss + l norm, E6-logp 6 RA 5 5 cpu ime (s) 5 Figure Comparison of differen algorihms for solving differen problems on differen daases ( = ) log ( gap) log ( gap) hinge loss + l norm, covype A(= 7 ) A(= 6 ) RA(= 6 ) 5 5 5 5 5 6 8 number of ieraions 7 robus + l norm, million songs - A(= 7 ) A(= 6 ) RA(= 6 ) 6 8 number of ieraions 7 log ( gap) log ( gap) hinge loss + l norm, real-sim A(= 7 ) A(= 6 ) RA(= 6 ) 6 8 number of ieraions 7 robus + l norm, E6-fidf - -9 A(= 5 ) A(= ) RA(= ) - 6 8 number of ieraions 5 huber loss + l norm, million songs 5 5 A(= 7 ) A(= 6 ) RA(= 6 ) log ( gap) 5 5 5 5 6 8 number of ieraions 7 squared hinge + l norm, url 9 8 7 6 5 3 RA 6 8 cpu ime (s) log ( gap) huber loss + l norm, E6-fidf A(= 5 ) A(= ) RA(= ) - 6 8 5 number of ieraions 5 huber loss + l norm, E6-logp 5 RA 5 5 5 cpu ime (s) 5 Figure 3 Comparison of differen algorihms for solving differen problems on differen daases ( = ) squared hinge + l norm, url 8 RA 6 8 huber loss + l norm, E6-logp 6 RA 8 6 squared hinge + l norm, url 9 8 7 6 RA 5 3 huber loss + l norm, E6-logp 5 RA 5 6 5 5 6 8 (a) = 5 6 8 6 8 (b) = Figure Comparison of differen algorihms for solving differen problems on differen daases by number of gradien pass