For any y 2C, any sub-gradient, v, of h at prox(x), i.e., v and by optimality of prox(x) in (3), we have

Size: px

Start display at page:

Download "For any y 2C, any sub-gradient, v, of h at prox(x), i.e., v and by optimality of prox(x) in (3), we have"

Brianna Hope Holmes
5 years ago
Views:

1 A Proofs We now give the details for the proof of our main results, i.e., heorems and 2. Below, we outline the steps for the proof of FAG s heorem. he proof of heorem 2 for FARE follows the same line of reasoning. Also, we note that, in what follows, lemmas/corollaries required for the proof of heorem 2, are given immediately after those of FAG.. FAG is essentially a combination of mirror descent and proximal gradient descent steps (emmas and 4). 2. in Algorithm plays the role of an e ective gradient ipschitz constant in each iteration. he convergence rate of FAG ultimately depends on P P g S g. (emma 8 and Corollary 3) 3. By picing S adaptively lie in AdaGrad, we achieve a non-trivial upper bound for P. (emma 5) 4. FAG relies on picing an x at each iteration that satisfies an inequality involving (Corollary ). However, because is not nown prior to picing x, we must choose an x to roughly satisfy the inequality for all possible values of. We do this by picing x using binary search. (emmas 2 and 3 and Corollary ) 5. Finally, we need to pic the right stepsize for each iteration. Our scheme is very similar to the one used in [], but generalized to handle a di erent each iteration. (emmas 6 and 8 as well as Corollary 3). 6. heorem 3 combines items, 2 and 4, above. Finally, to prove heorem, we combine heorem 3 with items 3 and 5 above. A. Proof of heorem and heorem 2 First, we obtain the following ey result (similar to [4, emma 2.3]) regarding the vector p (prox(x) x), as in Step 3 of FAG, which is nown as the Gradient Mapping of F on C. emma (Gradient Mapping) For any x, y 2C, we have F (prox(x)) F (y)h(prox(x) x), y xi 2 x prox(x)2 2, where prox(x) is defined as in (3). In particular, F (prox(x)) F (x) 2 x prox(x)2 2. Proof of emma his result is the same as emma 2.3 in [4]. We bring its proof here for completeness. For any y 2C, any sub-gradient, v, of h at prox(x), i.e., v and by optimality of prox(x) in (3), we have 0 hrf(x)v (prox(x) x), y prox(x)i hrf(x)v (prox(x) x), y xi hrf(x) and so v (prox(x) x), x prox(x)i, hrf(x), prox(x) xi hrf(x)v (prox(x) x), y xi hv, x prox(x)i x prox(x) 2 2, Now from -ipschitz continuity of rf as well as convexity of f and h, we get F (prox(x)) f(prox(x)) h(prox(x)) f(x)hrf(x), prox(x) xi 2 prox(x) x2 2 h(prox(x)) f(x)hrf(x)v (prox(x) x), y xi hv, x prox(x)i 2 x prox(x)2 2 h(prox(x)) f(y)hv (prox(x) x), y xi hv, x prox(x)i 2 x prox(x)2 2 h(prox(x)) f(y)h(prox(x) x), y xi hv, y prox(x)i 2 x prox(x)2 2 h(prox(x)) F (y)h(prox(x) x), y xi 2 x prox(x)2 2. he following lemma establishes the ipschitz continuity of the prox operator. emma 2 (Prox Operator Continuity) prox : R d! R d is a 2-ipschitz continuous, that is, for any x, y 2C, we have prox(x) prox(y) 2 2x y 2. 2

2 Proof of emma 2 By Definition (3), for any x, y, z, z 0 2 C, v and w we have hv, z hw, z 0 prox(x)i hrf(x)(prox(x) x), z prox(x)i, prox(y)i hrf(y)(prox(y) y), z 0 prox(y)i. In particular, for z prox(y) and z 0 prox(z), we get emma 3 (Binary Search emma) et x BinarySearch(z, y, ) defined as in Algorithm 2. hen one of 3 cases happen: (i) x y and hprox(x) x, x zi 0, (ii) x z and hprox(x) x, y xi 0, or (iii) x ty ( t)z for some t 2 (0, ) and hprox(x) x, y zi 3y z 2 2. hv, prox(y) hw, prox(y) prox(x)i hrf(x)(prox(x) x), prox(y) prox(x)i, prox(x)i hrf(y)(prox(y) y), prox(x) prox(y)i. Proof of emma 3 Items (i) and (ii), are simply Steps 2 and 5, respectively. For item (iii), wehave x w 2 ty ( t)z t y ( t )z 2 By monotonicity of sub-gradient, we get hv, prox(y) prox(x)i hw, prox(y) prox(x)i. So hrf(x)(prox(x) x), prox(x) prox(y)i hrf(y)(prox(y) y), prox(x) prox(y)i, and as a result hrf(x)(prox(x) x), prox(x) prox(y)i hrf(x) (prox(x) prox(y)prox(y) x), prox(x) prox(y)i prox(x) prox(y) 2 2 hrf(x)(prox(y) x), prox(x) prox(y)i hrf(y)(prox(y) y), prox(x) prox(y)i, which gives prox(x) prox(y) 2 2 hrf(y) rf(x)(x y), prox(x) prox(y)i (rf(y) rf(x) 2 x y 2 ) prox(x) prox(y) 2 2x y 2 prox(x) prox(y) 2, (t t )y (t t )z 2 y z 2. Now it follows that hprox(x) x, y zi hprox(x) x, y zi hprox(w) w, y zi hprox(x) prox(w), y zi 2 hx w, y zi prox(x) prox(w) 2 y z 2 x w 2 y z 2 2x w 2 y z 2 x w 2 y z 2 3x w 2 y z 2 3 y z 2 2. Where the third inequality follows by emma 2 Using the above result, we can prove the following: Corollary et x, y, z and be defined as in Algorithm and. hen for all, hp, x z i( )hp, y x i D. and the result follows. Using prox operator continuity emma 2, we can conclude that given any y, z 2C,ifhprox(y) y, y zi < 0 and hprox(z) z, y zi > 0, then there must be a t 2 (0, ) for which w t y ( t )z gives hprox(w) w, y zi 0. Algorithm 2 finds an approximation to w in O(log / ) iterations. Proof of Corollary Note that by Step 3 of Algorithm ), p (prox(x ) x ). For, since x y z, the inequality is trivially true. For 2, we consider the three cases of emma 3: (i) if x y, the right hand side is / 0 and the left hand side is hp, x z i h (prox(x ) x ), x z i0, (ii) if x z, the left hand side 3

3 is 0 and hp, y x i h (prox(x ) x ), y x i 0, so the inequality holds trivially, and (iii) in this last case, for some t 2 (0, ), we have hp, x z i h (prox(x ) x ),ty ( t)z z i th(prox(x ) x ), y z i, and hp, y x i h (prox(x ) x ), y ty ( t)z i ( t)h(prox(x ) x ), (y z )i. Hence hp, x z i ( )hp, y x i hp, x z i ( )hp, y x i ( t ( )( t)) h(prox(x ) x ), (y z )i Next, we state a result regarding the mirror descent step. Similar results can be found in most texts on online optimization, e.g. []. emma 4 (Mirror Descent Inequality) et z arg min z2c h p, z z i 2 z z 2 S and D : sup x,y2c x y 2 be the diameter of C measured by infinity norm. hen for any u 2C, we have h p, z ui 2 2 p 2 S D 2 s 3 ( t ( )( t)) y z ( t) y z 2 2 3( )y z y z 2 2 6D y z 2 2 D D, 6d Proof of emma 4 For any u 2Cand by optimality of z,wehaveh p, z uihs (z z ), u z i. Hence, using (5) and (4), it follows that where in the last line we used the fact that y z 2 2 Dd Similar to for Algorithm, the following emma proves an analogous result for Algorithm 3. Corollary 2 et x, y, z and be defined as in Algorithm 3 and. hen for all, hp, x z i( )hp, y x i D. Proof of Corollary 2 We consider two cases:. If x is generated through Algorithm 5, then x BinarySearch(y, z, ) and,sothe statement follows from Corollary. 2. If x is generated through Algorithm 4, thenx y z, and so satisfies hp, x z i ( )hp, y x i h p, z ui h p, z z i h p, z ui h p, z z i hs (z z ), z ui h p, z z i 2 z z 2 S 2 z u 2 S 2 u z 2 S h p, zi sup z2r d 2 z2 S 2 z u 2 S 2 u z 2 S 2 2 p 2 S 2 u z 2 S 2 u z 2 S. Now recalling from Steps 5-7 of Algorithm that S diag(s ) I and s s,wesumover to 4

4 get h p, z ui 2 2 p 2 S 2 u z 2 S u z 2 S 2 u z 2 S 2 2 p 2 S 2 u z 2 S h(s S )(u z ), u z i p 2 S 2 u z 2 hs, i u z 2 hs s, i p 2 S D 2 hs, i D p 2 S D 2 s hs s, i 2 Finally, we state a similar result to that of [7] that captures the benefits of using S in FAG. emma 5 (AdaGrad Inequalities) Define q : P d i G (i, :) 2, where G is as in Step 5 of Algorithm. We have (i) P g S g 2q, (ii) q 2 min S2S P g S g, where S : {S 2 R d d S is diagonal, S ii > 0, trace(s) }, and (iii) p q p d. Proof of emma 5 o prove part (i), we use the following inequality introduced in the proof of emma 4 in [7]: for any arbitrary real-valued sequence of {a i } i and its vector representation as a : [a,a 2,...,a ], we have a 2 a : 2 2a : 2. 5 So it follows that g S g i i i 2q, g 2 (i) s 2 (i) g 2 (i) s (i) g 2 (i) G (i, :) 2 where the last equality follows from the definition of s in Step 6 of Algorithm. For the rest of the proof, one can easily see that g S g i g 2 (i) s(i) i a(i) s(i), where a(i) : P g2 (i) and s diag(s). Now the agrangian for 0 and 0, can be written as! a(i) (s,, ) s(i) s(i) h, si. i i Since the strong duality holds, for any primal-dual optimal solutions, S, and, it follows from complementary slacness that 0 (since s > 0). Now requiring )/@s(i) 0 gives s (i) p a i > 0, which since s (i) > 0, implies that > 0. As a result, by using complementary slacness again, we must have P d i s (i). Now simple algebraic calculations gives s (i) p ai /( P d i p ai ) and part (ii) follows. For part (iii), recall that g 2. Now, since min(s 0 ), one has g S g, and so q. One the other hand, consider the optimization problem v ux max G (i, :) 2 t gi 2() i i s.t. g 2 2,, 2,...,. he agrangian can be written as v ux ({g }, { } ) t gi 2() i! gi 2 (). i

5 By KK necessary condition, we require }, { } )/@g i() 0, which implies that /(2 q P g2 i ()), i, 2,...,d. Hence, P d P i g2 i () d/(4 2 ), and so 2 p d/, which gives q p d. We can now prove the central theorems of which is used to obtain FAG s main result. heorem 3 et D : sup x,y2c x y 2. For any u 2C, after iterations of Algorithm, we get n o 2 2 F (y ) F (u) 2 F (y ) D 2 D 2 s. Proof of heorem 3 Noting that p (y x ) is the gradient mapping of F 6 on C, it follows that (F (y ) (F (prox(x )) hp, x ui hp, (z u)i 2 2 p 2 S ( ) 2 2 p 2 2 hp, x D 2 s z i hp, x p 2 2 D 2 s 2 p 2 2 z i hp, x ( ) p 2 2 D 2 2 s ( )hp, y x i D 2 D 2 D 2 s ( ) (F (y ) F (y )). (emma ) Where the first inequality is by emma, the second inequality is by emma 4, the third equality is by Step 8 of Algorithm, and the second last inequality is by Corollary. Now we have 2 p 2 2 z i (F (y ) ( ) (F (y ) F (y )) F (y ) F (u) ( )F (y ) ( )F (y ) 2 F (y ) F (u) ( )F (y ) 2 F (y ) 2 F (y ) F (u) ( )F (y ) 2 F (y ) 2 2 F (y ) F (u),

6 and the result follows. Proof of heorem 4 Parts of this proof which di er from the proof of heorem 3 are bolded. Noting that p (y x ) is the gradient mapping of F on C, it follows that Once again, we present the analog of heorem 3 for Algorithm 3. heorem 4 et D : sup x,y2c x y 2. For any u 2C, after iterations of Algorithm, we get n 2 2 F (y ) 2 F (y ) D 2 D 2 s. o F (u) (F (y ) (F (prox(x )) hp, x ui hp, (z u)i 2 p p 2 S 2 p 2 2 hp, x D 2 s z i hp, x 2 p 2 2 ( ) p 2 2 D 2 2 s hp, x z i ( ) p 2 2 D 2 2 s ( )hp, y x i D 2 D 2 D 2 s ( ) (F (y ) F (y )). z i! Where the first inequality follows from emma, the second inequality follows from emma 4, the last equality follows from Steps 9 and of Alg 4, Steps8 and 9 of Alg 5, and the second last inequality follows from Corollary 2, and the last equality follows from emma. 7

7 Now we have (F (y ) ( ) (F (y ) F (y )) F (y ) F (u) ( )F (y ) ( )F (y ) 2 F (y ) F (u) ( )F (y ) 2 F (y ) 2 F (y ) F (u) ( )F (y ) 2 F (y ) 2 2 F (y ) and the result follows. F (u), We now set out to put the final piece of the proof in place: choosing the stepsize for the mirror descent step. emma 7 For the choice of in Algorithm 3 and, we have (i) 2 P i i, (ii) 2 2 0,and (iii). Proof of emma 7 Completely identical to proof of emma 6. Corollary 3 et D : sup x,y2c x y 2. For any u 2C, after iterations of Algorithm, we get F (y ) F (u) D 2 Ds 2 P. Proof of corollary 3 he result follows from heorem 3 and emma 6 as well as noting that 2 P i i P i i 2. he FARE analog: emma 6 For the choice of in Algorithm and, we have (i) 2 P i i, (ii) 2 2 0,and (iii). Corollary 4 et D : sup x,y2c x y 2. For any u 2C, after iterations of Algorithm 3, we get F (y ) F (u) D 2 Ds 2 P. Proof We prove (i) by induction. For, is is easy to verify that /, and so 2 and the base case follows trivially. Now suppose 2 P i i. Re-arranging (i) for gives X 0 2 i 2 2. i Now, it is easy to verify that the choice of in Algorithm is a solution of the above quadratic equation. he rest of the items follow immediately from part (i). Proof of corollary 4 he result follows from heorem 4 and emma 7 as well as noting that 2 P i i P i i 2. Finally, it only remains to lower bound P, which is done in the following emma. emma 8 For the choice of in Algorithm, we have 000 P Once again, the FARE analog of emma 6 is 8

8 Proof of emma 8 We prove by induction on. For, we have /, and the base case holds trivially. Suppose the desired relation holds for. We have ( ) P 2 s ( ) P 000 s ( ) P ( ) 3 P 000 s ( ) P 8000 P Where the first inequality is by the induction hypothesis on. Now if ( ) P 000 P, then we are done. Otherwise denoting : P, we must have that Hence, we get ( ) (3 2 3 ) 4 P. ( ) P ( ) P 000 P. v u 4 t P P Remar: We note here that we made little e ort to minimize constants, and that we used rather sloppy bounds such as /2. As a result, the constant appearing above is very conservative and a mere by product of our proof technique.. 2 emma 9 For the choice of in Algorithm 3, we have 000 P Proof of emma 9 Once again, exactly identical to the proof of emma 8, wehave 000 P Finally, using the guarantee that from Step of Algorithm 4 and Step 9 from Algorithm 5, we get the conclusion. he proof of FAG s main result, heorem, follows rather immediately. Proof of heorem he result follows immediately P from emma 8 and Corollary 3 and noting that P g S g 2q by emma 5 and s q by Step 6 of Algorithm and definition of q in emma 5. his gives F (y ) F (u) D 2 q2 000D 2 q2 00D 2. Now from emma 5, we see that : q 2 / 2 [,d]. Finally, the run-time per iteration follows from having to do log 2 (/ ) calls to bisection, each taing O( prox )time. he proof of FARE s main result, heorem 2, is obtained similarly to that of heorem. Proof of heorem 2 he result follows immediately P from emma 9 and Corollary 4 and noting that P g S g 2q by emma 5 and s q by Step 6 of Algorithm 4 and Step 5 of Algorithm 5 and definition of q in emma 5. his gives F (y ) F (u) D 2 q2 q2 00 D D 2 : q 2 / 2 [,d]. Now from emma 5, we see that Finally, we try to guess a suitable for log(d/ ) times, and resort to BinarySearch after. If we resort 9

9 to algorithm 5 (essentially BinarySeaerch), we mae log(/ ) calls to bisection, so overall the number of inner iterations per outer iteration is same as Algorithm. Each inner iteration taes O( prox )timein the worst case (if we have to resort to algorithm 5 each time). 20

Lecture 16: FTRL and Online Mirror Descent

Lecture 16: FTRL and Online Mirror Descent Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which