Exponentiated Gradient Descent

Size: px

Start display at page:

Download "Exponentiated Gradient Descent"

Sheena McLaughlin
5 years ago
Views:

1 CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity, the Lagrange multiplier and FTRL. We look at a the Lazy Projection Gradient Descent algorithm and then develop the Exponentiated Gradient Descent algorithm. Review Topics.1 Norms Definition 1. The norm is a function R n R that gives a measure of the size of vectors in a vector space W, such that u, v W and scalar a, a v = a v or positive homogeneity u + v u + v or additivity (Triangle Inequality) v 0 and v = 0 v = 0 or positivity The norm we are primarily concerned with is the p-norm, defined as Remark: The norm when p = is called the Euclidean norm.. Dual Norms ( n ) 1/p v p = v i p, p 1 (1) Definition. The dual norm is the norm in the dual space W, or the set of continuous linear functionals on W, so that v = sup ( u v) () u 1 Remark: For a given p-norm where p 1, the dual norm is equivalent to the q-norm satsifying 1/p+1/q = 1. Each norm has an associated dual-norm. The Euclidean norm is a special case equal to its dual norm (or p = q = ). Lemma 3. For a function f that is G-Lipschitz with respect to, w, g f( w), then g G Proof. Given that f is G-Lipschitz with respect to f( w ) f( w) G w w (3) 1

2 Choose g f t ( w t ), where g G, then by (strong) convexity g ( w w) f( w ) f( w) G w w (4) Convexity Use Hölder s Inequality were u v u v g ( w w) g w w (5) By the definition of the dual norm presented in (), u 1 and given that g G, then g ( w w) g w w g w w G w w (6) Remark: That this is roughly the same proof as stated in Lecture 4, only substituting Hölder s inequality for Cauchy-Schwarz. Note that Cauchy-Schwarz is a special case of Hölder s inequality where p = q =..3 Strong Convexity (Polyak, 1966 []) We expand on last lecture s definition of strong convexity and offer alternative forms. Definition 4. A convex function, f, is σ-strongly convex with respect to some norm,, over a set W if for all u, w W for every g such that g f(w) it holds that f(u) f(w) + g (u w) + σ u w. (7) this quadratic growth in separation means that halfway between u and w ( ) u + w f 1 f(u) + 1 f(w) σ 8 u w. (8) f w f(u) + f(w) } σ u w σ 8 u w { g u+w u Figure 1: Depiction of strong convexity

3 (if f is twice differentiable) u, where f is the Hessian of f u T f(u)u σ u (9) this means that the greater the step-size (u w), the greater the angle separation between gradients (in D the slope keeps growing), or ( f(u) f(w)) (u w) σ u w (10) Remark: Norms tend to matter more in higher dimensions, where the idea of size becomes harder to conceptualize. The above definition means that 1 w is 1-strongly convex with respect to w, and generalized for p [1, ], 1 w p is (p 1)-strongly convex with respect to w p..4 Lagrange Multipliers Lagrange Multipliers are used when we want to minimize a function subject to equality constraints via the use of a barrier function B. 1. min f(x) such that g(x) = 0 Define a barrier B(x) =. min f(x) + B(x) such that g(x) = 0 Then this process becomes { if g(x) 0 0 if g(x) = 0 = min λg(x), where λ is the Lagrange muliplier λ R 3. min max(f(x) + λg(x) ) = max x λ min (f(x) + λg(x)) λ x Lagrangian Remark: The above is under strong duality, and the Lagrange multiplier is also known as the dual variable. The Lagrange multiplier becomes a dialing knob for sampling along the dominant solutions..5 Follow-the-Regularized-Leader (FTRL) [3] Previously we looked at FTRL, where at time t the player s next step with hindsight is given by w t = argmin(f 1:(t 1) ( w) + r( w)) (11) where W is the feasible set, f 1:(t 1) ( w) is the sum of all losses up to time t 1 and r( w) is a strong convex regularizer (just one, not per round). We make the following assumptions: 1. each f t is G-Lipschitz with respect to (convex). r( w) is σ-strongly convex with respect to 3. W is a convex set Then we ve shown in Lecture 3 that the regret for FTRL for linear f t is bounded by Regret(T ) T G σ + r( w ) (1) 3

4 f t g t f t ( w t ) Adversary Transform OLA w t+1 Figure : FTRL Recall that linear problems are hard (worst of convex), but with a choice of σ = G T /R we can bound the regret at O( T ). The player can still play w t = argmin( w g 1:(t 1) + r( w)) to get the bound expressed earlier in (1). Remark: The advantage here is that the player need not keep a backlog of all previous vector g, but only a running total. Each round played updates the total g 1:T g 1:(T 1) + g t. Now the question is what values of r( w) should we choose? Example: Lazy Projection Gradient Descent Algorithm Choose r( w) = 1 η w, where r is (1/η)-strongly convex with respect to, and let η = R/(G π) Bound: We see that our FTRL theorem still applies: Regret(T ) ηt G + 1 η w ηt G + R GR Choose η = to get regret(t ) T + GR T = GR T. R G T η. Update Rule: To find the player s next step, find argmin by η to get (no consequence for η independent of w) w g 1:(T 1) + 1 η w. Multiply both sides argmin Add a constant 1 ηg 1:(t 1) to complete the square w g 1:(T 1) + 1 η w = argmin w η g 1:(T 1) + 1 w (13) = argmin = argmin w η g 1:(T 1) + 1 w 1 η g 1:(T 1) + w = proj( η g 1:(T 1) ) W (14) Remark: Abstractly, this means we see what direction our gradient points, and step in the opposite (or downhill ) direction. If our new vector is outside the feasible set, project back into it. 4

5 3 Exponentiated Gradient Descent Algorithm (EG) 3.1 What is EG Consider a special case of FTRL where we choose r( w) = 1 η where w is a probability simplex such that w R n : w i log w i (negative) entropy w i = 1, i, w i 0 Lemma 5. For EG, r is 1/η-strongly convex over W with respect to 1, so w 1 = n w i. Proof. We want to show strong-convexity, which by definition (7), u, w W (15) Show: r( u)? r( w) + ( u w) r( w) + 1 η u w 1 By multiplying both sides by η and calculating r( w), this is equivalent to showing? n u i log u i w i log w i + (u i w i )(1 + log w i ) + 1 u w 1 = u i log w i + 1 u w 1 (16) n ui log wi Subtracting n u i log w i from both sides forms the inequality u i log u i? 1 w i u w 1 (17) Note n u i log ui w i is the Kullback-Leibler divergence. By Pinsker s Inequality, D( u w) sup u w 1 u i log u i w i = D( u w) u w 1 1 u w 1 (18) with equality when u = w. This satisfies definition (7) for strong convexity. 3. Update Rule Now we look at the update rules for the given choice of r( w). To find the player s next step under EG, find argmin w g 1:(t 1) + 1 w i log w i (19) η Use the Lagrange multiplier and the simplex defintion for some fixed λ ( ) = argmin w g 1:(t 1) + 1 w i log w i + λ 1 w i η ( ( )) = max min w g 1:(t 1) + 1 w i log w i + λ 1 w i w λ η Lagrangian L = max w min (L) λ 5 (0)

6 Set the derivatives to zero and evaluate L w i = g 1:(t 1),i + 1 η (1 + log w i) λ = 0 (1) This is equivalent to evaluating log w i = ηg 1:(t 1),i (1 ηλ) w i = exp ( ηg 1:(t 1),i (1 ηλ) ) () fixed λ Remark: Exponentiated Gradient Descent is also known as Weighted Majority, to winnow or to hedge. 3.3 EG vs. GD What are the differences between EG and GD? Consider the situation of predicting with expert advice by choosing to follow the advice of one indexed expert I t (a random variable) on round t, out of n-experts. Let g t,i = { 1 if wrong advice by i 0 if good advice by i f t = E[ g t, I t ] = g t w t (3) For GD, in the all-wrong-experts scenario, G = n and R( 1 w ) = 1, leading to regret bound regret T n (4) For EG, G = g = g 1 = 1 and R H(w ) = log n, (where H is the entropy) regret T log n (5) Remark: In general, EG outperforms GD, as demonstrated by their regret bounds. Regret GD EG 1 n Figure 3: Relative regret bound for EG and GD with n-experts References [1] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games, Cambridge University Press, 006. [] E.S. Levitin and B.T. Polyak, Constrained Minimization Methods, USSR Comp. Math and Math Phys. 6, pp.1-50, [3] H. B. McMahan, Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L 1 Regularization, Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS),

The FTRL Algorithm with Strongly Convex Regularizers

CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked