Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning, clustering, eponential family Definition (Bregman divergence) Let ψ : Ω R be a function that is: a) strictly conve, b) continuously differentiable, c) defined on a closed conve set Ω. hen the Bregman divergence is defined as ψ (, y) = ψ() ψ(y) ψ(y), y,, y Ω. () hat is, the difference between the value of ψ at and the first order aylor epansion of ψ around y evaluated at point. Eamples Euclidean distance. Let ψ() =. hen ψ (, y) = y. Ω = R n + : i i = }, and ψ() = i i log i. hen ψ (, y) = i i log i y i for, y Ω. his is called relative entropy, or Kullback Leibler divergence between probability distributions and y. l p norm. Let p and p + q =. ψ() = q. hen ψ(, y) = q + y q, y q. Note y q is not necessarily continuously differentiable, which makes this case not precisely consistent with our definition. Properties of Bregman divergence Strict conveity in the first argument. rivial by the strict conveity of ψ. Nonnegativity: ψ (, y) 0 for all, y. ψ (, y) = 0 if and only if = y. rivial by strict conveity. Asymmetry: in general, ψ (, y) ψ (y, ). Eg, KL-divergence. Symmetrization not always useful. Non-conveity in the second argument. Let Ω = [, ), ψ() = log. hen ψ (, y) = log + log y + y y. One can check its second order derivative in y is y ( y ), which is negative when < y. Linearity in ψ. For any a > 0, ψ+aϕ (, y) = ψ (, y) + a ϕ (, y). Gradient in : ψ(, y) = ψ() ψ(y). Gradient in y is trickier, and not commonly used. Generalized triangle inequality: ψ (, y) + ψ (y, z) = ψ() ψ(y) ψ(y), y + ψ(y) ψ(z) ψ(z), y z () = ψ (, z) + y, ψ(z) ψ(y). (3)

Special case: ψ is called strongly conve with respect to some norm with modulus σ if ψ() ψ(y) + ψ(y), y + σ y. (4) Note the norm here is not necessarily Euclidean norm. When the norm is Euclidean, this condition is equivalent to ψ() σ being conve. For eample, the ψ() = i i log i used in KL-divergence is -strongly conve over the simple Ω = R n + : i i = }, with respect to the l norm. When ψ is σ strong conve, we have ψ (, y) σ y. (5) Proof: By definition ψ (, y) = ψ() ψ(y) ψ(y), y σ y. Duality. Suppose ψ is strongly conve. hen ( ψ )( ψ()) =, ψ (, y) = ψ ( ψ(y), ψ()). (6) Proof: (for the first equality only) Recall ψ (y) = sup z, y ψ(z)}. (7) z Ω sup must be attainable because ψ is strongly conve and Ω is closed. is a maimizer if and only if y = ψ(). So ψ (y) + ψ() =, y y = ψ(). (8) Since ψ = ψ, so ψ (y) + ψ () =, y, which means y is the maimizer in his means = ψ (y). o summarize, ( ψ )( ψ()) =. ψ () = sup, z ψ (z)}. (9) z Mean of distribution. Suppose U is a random variable over an open set S with distribution µ. hen is optimized at ū := E µ [U] = u S uµ(u). Proof: For any S, we have min E U µ[ ψ (U, )] (0) S E U µ [ ψ (U, )] E U µ [ ψ (U, ū)] () = E µ [ψ(u) ψ() (U ) ψ() ψ(u) + ψ(ū) + (U ū) ψ(ū)] () = ψ(ū) ψ() + ψ() ū ψ(ū) + E µ [ U ψ() + U ψ(ū)] (3) = ψ(ū) ψ() (ū ) ψ() (4) = ψ (ū, ). (5) his must be nonnegative, and is 0 if and only if = ū. Pythagorean heorem. If is the projection of 0 onto a conve set C Ω: hen = argmin ψ (, 0 ). (6) ψ (y, 0 ) ψ (y, ) + ψ (, 0 ). (7) In Euclidean case, it means the angle y 0 is obtuse. More generally Lemma Suppose L is a proper conve function whose domain is an open set containing C. L is not necessarily differentiable. Let be hen for any y C we have = argmin L() + ψ (, 0 )}. (8) L(y) + ψ (y, 0 ) L( ) + ψ (, 0 ) + ψ (y, ). (9)

he projection in (6) is just a special case of L = 0. his property is the key to the analysis of many optimization algorithms using Bregman divergence. Proof: Denote J() = L() + ψ (, 0 ). Since minimizes J over C, there must eist a subgradient d J( ) such that d, 0, C. (0) Since J( ) = g + = ψ (, 0 ) : g L( )} = g + ψ( ) ψ( 0 ) : g L( )}. So there must be a subgradient g L( ) such that g + ψ( ) ψ( 0 ), 0, C. () herefore using the property of subgradient, we have for all y C that Rearranging completes the proof. Mirror Descent L(y) L( ) + g, y () L( ) + ψ( 0 ) ψ( ), y (3) = L( ) ψ( 0 ), 0 + ψ( ) ψ( 0 ) (4) + ψ( 0 ), y 0 ψ(y) + ψ( 0 ) (5) ψ( ), y + ψ(y) ψ( ) (6) = L( ) + ψ (, 0 ) ψ (y, 0 ) + ψ (y, ). (7) Why bother? Because the rate of convergence of subgradient descent often depends on the dimension of the problem. Suppose we want to minimize a function f over a set C. Recall the subgradient descent rule k+ = k α k g k, where g k f( k ) (8) k+ = argmin k+ = argmin ( k α k g k ). (9) his can be interpreted as follows. First approimate f around k by a first-order aylor epansion f() f( k ) + g k, k. (30) hen penalize the displacement by α k k. So the update rule is to find a regularized minimizer of the model k+ = argmin f( k ) + g k, k + } k. (3) α k It is trivial to see this is eactly equivalent to (9). o generalize the method beyond Euclidean distance, it is straightforward to use the Bregman divergence as a measure of displacement: k+ = argmin f( k ) + g k, k + } ψ (, k ) (3) α k = argmin α k f( k ) + α k g k, k + ψ (, k )}. (33) Mirror descent interpretation. Suppose the constraint set C is the whole space (i.e. no constraint). hen we can take gradient with respect to and find the optimality condition g k + α k ( ψ( k+ ) ψ( k )) = 0 (34) ψ( k+ ) = ψ( k ) α k g k (35) k+ = ( ψ) ( ψ( k ) α k g k ) = ( ψ )( ψ( k ) α k g k ). (36) For eample, in KL-divergence over simple, the update rule becomes k+ (i) = k (i) ep( α k g k (i)). (37) Rate of convergence. Recall in unconstrained subgradient descent we followed 4 steps. 3

. Bound on single update. elescope k+ = k α k g k (38) = k α k g k, k + α k g k (39) k α k (f( k ) f( )) + α k g k. (40) + α k (f( k ) f( )) + 3. Bounding by R and g k G : α k (f( k ) f( )) R + G 4. Denote ɛ k = f( k ) f( ) and rearrange By setting the step size α k judiciously, we can achieve αk g k. (4) α k. (4) min ɛ k R + G α k k,..., }. (43) min k,..., } ɛ k α k RG. (44) Suppose C is the simple. hen R. If each coordinate of each gradient g i is upper bounded by M, then G can be at most M n, i.e. depends on the dimension. Clearly the step to 4 can be easily etended by replacing k+ with ψ(, k+ ). So the only challenge left is to etend step. his is actually possible via Lemma. We further assume ψ is σ strongly conve. In (33), consider α k (f( k ) + g k, k ) as the L in Lemma. hen α k (f( k ) + g k, k ) + ψ (, k ) α k (f( k ) + g k, k+ k ) + ψ ( k+, k ) + ψ (, k+ ) (45) Canceling some terms can rearranging, we obtain ψ (, k+ ) ψ (, k ) + α k g k, k+ ψ ( k+, k ) (46) = ψ (, k ) + α k g k, k + α k g k, k k+ ψ ( k+, k ) (47) ψ (, k ) α k (f( k ) f( )) + α k g k, k k+ σ k k+ (48) ψ (, k ) α k (f( k ) f( )) + α k g k k k+ σ k k+ (49) ψ (, k ) α k (f( k ) f( )) + α k σ g k (50) Now compare with (40), we have successfully replaced k+ with ψ(, i ). Again upper bound ψ (, ) by R and g k by G. Note the norm on g k is the dual norm. o see the advantage of mirror descent, suppose C is the n dimensional simple, and we use KL-divergence for which ψ is strongly conve with respect to the l norm. he dual norm of the l norm is the l norm. hen we can bound ψ (, ) by using KL-divergence, and it is at most log n. G can be upper bounded by M. So as for the value of RG, mirror descent is smaller than subgradient descent by an order of O( n log n ). Acceleration : f is strongly conve. ψ with modulus λ if We say f is strongly conve with respect to another conve function f() f(y) + g, y + λ ψ (, y) g f(y). (5) 4

Note we do not assume f is differentiable. Now in the step from (47) to (48), we can plug in the definition of strong conveity: ψ (, k+ ) =... + α k g k, k +... (copy of (47)) (5)... α k (f( k ) f( ) + λ ψ (, k )) +... (53)... (54) ( λα k ) ψ (, k ) α k (f( k ) f( )) + α k σ g k (55) Denote δ k = ψ (, k ). Set α k = λk. hen δ k+ k k δ k λk ɛ k + G σλ k kδ k+ (k )δ k λ ɛ k + G σλ (56) k Now telescope (sum up both sides from k = to ) δ + ɛ k + G λ σλ min k ɛ k G i,..., } σλ k G O(log ). (57) σλ Acceleration : f has Lipschitz continuous gradient. eists L > 0 such that If the gradient of f is Lipschitz continuous, there f() f(y) L y,, y. (58) Sometimes we just directly say f is smooth. It is also known that this is equivalent to We bound the g k, k+ term in (46) as follows Plug into (46), we get f() f(y) + f(y), y + L y. (59) g k, k+ = g k, k + g k, k k+ (60) f( ) f( k ) + f( k ) f( k+ ) + L k k+ (6) = f( ) f( k+ ) + L k k+. (6) ψ (, k+ ) ψ (, k ) + α k ( f( ) f( k+ ) + L k k+ ) σ k k+. (63) Set α k = σ L, we get elescope we get ψ (, k+ ) ψ (, k ) σ L (f( k+) f( )). (64) min f( k) f( ) L (, ) LR k,..., +} σ σ. (65) his gives O( ) convergence rate. But if we are smarter, like Nesterov, the rate can be improved to O( ). We will not go into the details but the algorithm and proof are again based on Lemma. his is often called accelerated proimal gradient method.. Composite Objective Suppose the objective function is h() = f()+r(), where f is smooth and r() is simple, like. If we directly apply the above rates for optimizing h, we get O( ) rate of convergence because h is not smooth. It will be nice if we can enjoy the O( ) rate as in smooth optimization. Fortunately this is possible thanks to the simplicity of r(), and we only need to etend the proimal operator (33) as follows: k+ = argmin = argmin f( k ) + g k, k + r() + α k ψ (, k ) } (66) α k f( k ) + α k g k, k + α k r() + ψ (, k )}. (67) 5

Algorithm : Protocol of online learning he player initializes a model. for k =,,... do 3 he player proposes a model k. 4 he rival picks a function f k. 5 he player suffers a loss f k ( k ). 6 he player gets access to f k and use it to update its model to k+. Here we use a first-order aylor approimation of f around k, but keep r() eact. Assuming this proimal operator can be computed efficiently, then we can show all the above rates carry over. We here only show the case of general f (not necessarily smooth or strongly conve), and leave the rest two cases as an eercise. In fact we can again achieve O( ) rate when f has Lipschitz continuous gradient. Consider α k (f( k ) + g k, k + r()) as the L in Lemma. hen α k (f( k ) + g k, k + r( )) + ψ (, k ) (68) α k (f( k ) + g k, k+ k + r( k+ )) + ψ ( k+, k ) + ψ (, k+ ). (69) Following eactly the derivations from (46) to (50), we obtain ψ (, k+ ) ψ (, k ) + α k g k, k+ + α k (r( ) r( k+ )) ψ ( k+, k ) (70)... (7) ψ (, k ) α k (f( k ) + r( k+ ) f( ) r( )) + α k σ g k. (7) his is almost the same as (50), ecept that we want to have r( k ) here, not r( k+ ). Fortunately this is not a problem as long as we use a slightly different way of telescoping. Denote δ k = ψ (, k ) and then f( k ) + r( k+ ) f( ) r( ) (δ k δ k+ ) + α k α k σ g k. (73) Summing up from k = to we obtain r( + ) r( ) + (h( k ) h( )) δ ( + δ k ) δ + + G α k (74) α α k α k α σ k= ( R ( + ) ) + G α k (75) α α k α k σ = R + G α σ k= α k. (76) Suppose we choose = argmin r(), which ensures r( + ) r( ) 0. Setting α k = G R σ k, we get (h( k ) h( )) RG σ ( + herefore min,..., h( k ) h( )} decays at the rate of O( RG σ ).. Online learning ) = RG O( ). (77) k σ he protocol of online learning is shown in Algorithm. he player s goal of online learning is to minimize the regret, the minimal possible loss k f k() over all possible : Regret = f k ( k ) min f k (). (78) Note there is no assumption made on how the rival picks f k, and it can adversarial. After obtaining f k at iteration k, let us update the model into k+ by using the mirror descent rule on function f k only: k+ = argmin f k ( k ) + g k, k + } ψ (, k ), where g k f k ( k ). (79) α k 6

hen it is easy to derive the regret bound. Using f k in step (50), we have f k ( k ) f k ( ) α k ( ψ (, k ) ψ (, k+ )) + α k σ g k. (80) Summing up from k = to n and using the same process as in (74) to (77), we get (f k ( k ) f k ( )) RG O( ). (8) σ So the regret grows in the order of O( ). f is strongly conve. immediately. Eactly use (55) with f k in place of f, and we can derive the O(log ) regret bound f has Lipschitz continuous gradient. he result in (64) can NO be etended to the online setting because if we replace f by f k we will get f k ( k+ ) f k ( ) on the right-hand side. elescoping will not give a regret bound. In fact, it is known that in the online setting, having a Lipschitz continuous gradient itself cannot reduce the regret bound from O( ) (as in nonsmooth objective) to O(log ). Composite objective. In the online setting, both the player and the rival know r(), and the rival changes f k () at each iteration. he loss incurred at each iteration is h k ( k ) = f k ( k ) + r( k ). he update rule is k+ = argmin f k ( k ) + g k, k + r() + } ψ (, k ), where g k f k ( k ). (8) α k Note in this setting, (73) becomes f k ( k ) + r( k+ ) f k ( ) r( ) α k (δ k δ k+ ) + α k σ g k. (83) Although we have r( k+ ) here rather than r( k ), it is fine because r does not change through iterations. Choosing = argmin r() and telescoping in the same way as from (74) to (77), we immediately obtain (h k ( k ) h k ( )) G O( ). (84) σ So the regret grows at O( ). When f k are strongly conve, we can get O(log ) regret for the composite case. But as epected, having Lipschitz continuity of f k alone cannot reduce the regret from O( ) to O(log )..3 Stochastic optimization Let us consider optimizing a function which takes a form of epectation min F () := E ω p [f(; ω)], (85) where p is a distribution of ω. his subsumes a lot of machine learning models. For eample, the SVM objective is F () = m ma0, c i a i, } + λ m. (86) i= It can be interpreted as (85) where ω is uniformly distributed in,,..., m} (i.e. p(ω = i) = m ), and f(; i) = ma0, c i a i, } + λ. (87) When m is large, it can be costly to calculate F and its subgradient. So a simple idea is to base the updates on a single randomly chosen data point. It can be considered as a special case of online learning in Algorithm, where the rival in step 4 now randomly picks f k as f(; ω k ) with ω k being drawn independently from p. Ideally we hope that by using the mirror descent updates, k will gradually approach the minimizer 7

Algorithm : Protocol of online learning he player initializes a model. for k =,,... do 3 he player proposes a model k. 4 he rival randomly draws a ω k from p, which defines a function f k () := f(; ω k ). 5 he player suffers a loss f k ( k ). 6 he player gets access to f k and use it to update its model to k+ by, e.g., mirror descent (79). of F (). Intuitively this is quite reasonable, and by using f k we can compute an unbiased estimate of F ( k ) and a subgradient of F ( k ) (because ω k are sampled iid from p). his is a particular case of stochastic optimization, and we recap it in Algorithm. In fact, the method is valid in a more general setting. For simplicity, let us just say the rival plays ω k at iteration k. hen an online learning algorithm A is simply a deterministic mapping from an ordered set ω,..., ω k } to k+. Denote as A(ω 0 ) the initial model. hen the following theorem is the key for online to batch conversion. heorem 3 Suppose an online learning algorithm A has regret bound R k after running Algorithm for k iterations. Suppose ω,..., ω + are drawn iid from p. Define ˆ = A(ω j+,..., ω ) where j is drawn uniformly random from 0,..., }. hen E[F (ˆ)] min F () R + +, (88) where the epectation is with respect to the randomness of ω,..., ω, and j. Similarly we can have high probability bounds, which can be stated in the form like (not eactly true) F (ˆ) min F () R + + log δ with probability δ, where the probability is with respect to the randomness of ω,..., ω, and j. (89) Proof of heorem 3. E[F (ˆ)] = E [f(ˆ; ω + )] = E [f(a(ω j+,..., ω ); ω + )] (90) j,ω,...,ω + j,ω,...,ω + = E f(a(ω j+,..., ω ); ω + ) (as j is drawn uniformly random) (9) ω,...,ω + + j=0 = E f(a(ω,..., ω j ); ω + j ) (shift iteration inde by iid of w i ) + = + + ω,...,ω + E ω,...,ω + E ω,...,ω + j=0 [ + [ ] f(a(ω,..., ω s ); ω s ) s= min min E ω [f(; ω] + R + + ] + f(; ω s ) + R + s= (9) (change of variable s = j + ) (93) (apply regret bound) (94) (epectation of min is smaller than min of epectation) (95) = min F () + R + +. (96) 8