EECS 598-005: heoretical Foundations of Machine Learning Fall 2015 Lecture 23: Online convex optimization Lecturer: Jacob Abernethy Scribes: Vikas Dhiman Disclaimer: hese notes have not been subjected to the usual scrutiny reserved for formal publications. 23.1 Online convex optimization: generalization of several algorithms Many of the algorithms that we studied in this course are special cases of online convex optimization. We recall a few algorithms studied so far. Later on we will argue how they form a special case of online convex optimization. Expert Setting 2 Algorithm chooses w t n ; 3 Nature chooses loss l t [0, 1] n ; 4 end Linear Perceptron Algorithm 23.1: Expert setting 2 Algorithm chooses w t R n ; 3 Nature chooses (x t, y t ) R n { 1, 1}; 4 Algorithm experiences loss l t = 1[(w t x t )y t < 0] = max(0, y t (w t x t )); Online linear regression Algorithm 23.2: Linear Perceptron 2 Algorithm chooses w t R n ; 3 Nature chooses x t R n and y t R ; 4 Algorithm experiences loss l t = (w t x t y t ) 2 ; Algorithm 23.3: Online linear regression 23-1
Lecture 23: Online convex optimization 23-2 Portfolio selection 2 Algorithm chooses w t n ; 3 Nature chooses price Price t (i) for all i; 4 Algorithm experiences gain g t = w t x t where x t (i) = Pricet(i) Price t 1(i) ; Algorithm 23.4: Portfolio selection 23.2 Online convex optimization Now we introduce the generalized form of convex optimization that generalizes the above given examples. Given: A convex compact decision set: X R n A class of loss functions: L = {l : l : X R, l is convex and Lipshitz continuous } 3 Nature chooses l t L; 4 Algorithm experiences loss l t (w t ); 5 Algorithm updates w t+1 ; 6 end Algorithm 23.5: Generalized pattern of online convex optimization Our goal is to design a rule Alg. 23.5 step 5 that minimizes regret Regret (Alg) := l t (w t ) min w X l t (w ). (23.1) All problems listed above are the same convex optimization. [1] proposed online gradient descent (OGD) with the regret bound Regret (OGD) DG, where D is the L 2 diameter of convex compact set X and G is the L 2 Lipschitz bound on L. 23.3 Online gradient descent Initialize: w t is the center of the set X. Let be a parameter. 3 Nature chooses l t L; 4 Algorithm experiences loss l t (w t ); 5 w t+1 w t l t (w t ) ; 6 w t+1 Projection X ( w t+1 ) := arg inf w X w w t+1 2 ; 7 end Algorithm 23.6: Generalized pattern of online convex optimization heorem 23.1. Regret defined in (23.1) for Algorithm 23.6 has an upper bound of DG. Proof:
Lecture 23: Online convex optimization 23-3 Define potential function φ t = 1 2 w w t 2 2. he change in potential at every iteration is given by φ t+1 φ t = 1 2 ( w w t+1 2 2 1 2 w w t 2 2) (23.2) = 1 2 ( w Projection X (w t l t (w t )) 2 2 w w t 2 2) (23.3) 1 2 ( w w t + l t (w t ) 2 2 w w t 2 2) (23.4) = 1 2 (2 l t (w t ) 2 2 2 l t (w w t )) (23.5) = l t (w t )(w w t ) 2 l t(w t ) 2 2 (23.6) (l t (w ) l t (w t )) 2 l t(w t ) 2 2 (23.7) l t (w t ) l t (w ) 2 G2. (23.8) Here (23.4) is due to the fact that projection of point is to X always closer to any point inside X. And (23.7) is due to convexity of l t L because for any convex function f, f(x)(y x) f(y) f(x). Equation (23.8) is because of G-Lipschitz continuity of l t L. Summing both sides of (23.8) from t = 1 to t =, we get Regret (OGD) = l t (w t ) l t (w ) ( φ t+1 φ t + 2 G2) (23.9) = φ +1 φ 1 + 2 G2 (23.10) φ 1 + 2 G2 = 1 2 w w 1 2 2 + 2 G2 (23.11) 1 2 D2 + 2 G2. (23.12) Equation (23.12) is because D is the diameter of the convex set X. Since (23.12) is true for every, we chose = to get D G Regret (OGD) DG. (23.13) Note that Perceptron is OGD with l t (w t ) = max(0, y t (wt x t ) and gradient given by { 0 if y t (wt x t ) > 0 l t (w t ) = y t x t otherwise (23.14) 23.4 Further Generalization A more generic algorithm is a modification of Follow the Regularized Leader (FRL). In FRL the update step is w t+1 = arg min w X t s=1 l s (w) + 1 R(w), (23.15)
Lecture 23: Online convex optimization 23-4 where R(w) is the regulariser. Define a new algorithm FRL with the update step w t+1 = arg min w X t s=1 l s (w) + 1 R(w). (23.16) Note that OGD is a special case of FRL with R(w) = 1 2 w 2 2. Similarly Exponential Weights Algorithm (EWA) is also a FRL with R(w) = k i=1 w i log(w i ). he above observations are very recent. Only recently since 2008, we got to understand about the equivalence of OGD, EWA and FRL. he upper bound on regret of FRL is given by Regret (FRL) R(w ) R(w t ) + D R (w t, w t+1 ), (23.17) where D R (w t, w t+1 ) is the Bragman Divergence. For more details please refer to the surveys on the course website. 23.5 Online to Batch Conversion Standard stochastic setting. Assume that we have unknown distribution D on X Y. Let (x t, y t ) D for all t = {1,..., }. Use only online learning algorithm on this sequence input : (x t, y t ) D for all t = {1,..., }. Batch output: w = 1 w t 3 Algorithm observes (x t, y t ); 4 Algorithm experiences loss given by convex loss function l(w t x t, y t ); 5 w t+1 Projection X (w t l(w t x t, y t )) ; 6 end Algorithm 23.7: Online to batch conversion Note that we are choosing the average weight w rather than the last weight w. his differences arises because of a different objective function than regret. For batch problem we want to minimize the risk, Risk(w) := heorem 23.2. he output of Online to Batch Conversion enjoys: [ ] Risk( w ) Risk(w Regret (Alg) ) E Proof: From the definition of risk of averaged weights we get, E l(w x, y) (23.18) (x,y) D
Lecture 23: Online convex optimization 23-5 Risk( w ) = ( E l( w x, y) = E l 1 (x,y) D (x,y) D ) w t x, y (23.19) 1 E l(w t x, y) (23.20) (x,y) D = 1 E l(w t x t, y t ) = E 1 l(w t x t, y t ) (23.21) (x t,y t) D (x t,y t) D }{{} l t(w t) [ ] 1 E l(w x t, y t ) + Regret t(alg) (23.22) (x t,y t) D [ ] = Risk(w Regret (Alg) ) + E. (23.23) Here Eq. (23.20) is due to convexity of the loss function with respect to w. Note that random variables (x, y) D are replaced with identically distributed (x t, y t ) D in (23.21). Equation (23.22) uses definition of Regret (Alg) as given by (23.1). References [1] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. 2003.