CS 294-128: Algorthms and Uncertanty Lecture 14 Date: October 17, 2016 Instructor: Nkhl Bansal Scrbe: Antares Chen 1 Introducton In ths lecture, we revew results regardng follow the regularzed leader (FTRL. We then begn to dscuss a new onlne convex optmzaton algorthm known as mrror descent. Frst, we construct the ntuton behnd the algorthm by ntroducng Bregman dvergence. We then dscuss the mechancs of the mrror descent algorthm, show remarkable equvalence wth FTRL, and provde an example applcaton. Fnally, we relate onlne mrror decent to Fenchel Dualty and provde some ntuton behnd usng Bregman dvergence as a dstance metrc. 2 Revew 2.1 Settng For the past few lectures, we have dscussed onlne convex optmzaton (OCO. The problem specfcatons are as follows. We are gven some decson doman modeled as a convex set K n Eucldean space. At each tme step t, the player s ht wth a convex cost functon f t : K R. The player then chooses x t such that the regret s mnmzed. regret = t f t (x t mn y K f t (y For the remander of these notes, we denote = f (x and assume all cost functons are lnear. Our regret analyss wll also depend on the noton of dameter whch we now defne. Defnton 1 The dameter wth respect to R s gven by 2.2 Follow the regularzed leader D R = maxr(x R(y} x,y Prevously, we dscussed an onlne convex optmzaton algorthm known as follow the regularzed leader (FTRL whch was ntroduced n [5][6]. The analyss of onlne mrror descent wll rely heavly on ths topc and so we revew the algorthm here. At the t-th tme step, the next value x t+1 s chosen based on ths update rule. x t+1 = argmn η( 1 + + t x + R(x } Here, R(x s a regularzaton functon that s often chosen to be α-strongly convex wth respect to some norm. Analyss based on the regme be the leader (BTL [2] yelded the followng regret bound. t 1
Theorem 2 Let denotes the dual norm wth respect to. If R(x s α-strongly convex wth respect to. Then the regret for FTRL s bounded as follows. regret t 2η α t 2 R(y R(x + η 3 Onlne Mrror Descent We now ntroduce onlne mrror descent (OMD whch s an onlne varant of Nemrovsk and Yudn s mrror descent algorthm [4]. Frst dscussed by [7], OMD s very smlar onlne gradent descent as the algorthm computes the current decson teratvely based on a gradent update rule and the prevous decson. However, the power behnd OMD les n the update beng carred out n a dual space, defned by our choce of regularzer. Ths follows from consderng R as a mappng from R n onto tself. When carryng out the update n ths space, we take advantage of a rch geometry defned only n the dual. Indeed, ths has lead to dscoveres that show many algorthms to be specal cases of onlne mrror descent [3][9]. More recently, t has been dscovered that not only does onlne mrror descent apply to a general class of onlne convex optmzaton problems, but that they do so wth optmal regret bounds [8]. 3.1 The algorthm Onlne mrror descent wll rely on Bregman dvergence. Defnton 3 Denote B R (x y as the Bregman dvergence between x and y wth respect to the functon R. Ths s gven as B R (x y = R(x R(y R(y (x y We mmedately have the noton of a Bregman projecton of y onto a convex set K. argmn B R (x y We are now ready to dscuss onlne mrror descent. The algorthm takes n as nput the learnng rate η > 0 and regularzaton functon R(x. Graphcally, the algorthm runs as follows. 2
The pseudocode s provded below. Algorthm 1 Onlne mrror descent 1: Intalze y 1 to be such that R(y 1 = 0 and x 1 = argmn B R (x y 1 2: for t = 1 T do 3: Play x t and receve cost functon f t 4: Update y t accordng to the followng rule 5: Bregman project back to K 6: end for R(y t+1 = R(y t η t x t+1 = argmn B R (x y t+1 In terms of mplementaton, y t+1 may be recovered by applyng the nverse gradent mappng ( R 1. In general, f R s α-strongly convex, then R must be a bjectve mappng. 3.2 Regret analyss Hazan and Kale [1] provded an extraordnary result equatng FTRL wth OMD. Ths theorem, whch we now prove below, wll n the future allow us to bootstrap theorem 2 and provde regret bounds for onlne mrror descent. Theorem 4 Gven that R s α-strongly convex, the lazy OMD and FTRL algorthms produce equvalent predctons. argmn ( B R (x y t+1 = argmn η t s x + R(x Proof: Observe that n lazy OMD, y t+1 s updated wth respect to the followng constrant R(y t+1 = R(y t η t. Ths gves us the followng. y t+1 = ( R 1( R(y t η t = ( R 1( R(y t 1 η t 1 η t = ( R 1( t η s Consder the case where y t+1 K mplyng that the projecton s tself. In the OMD regme, we have that x t+1 = y t+1. Denote the FTRL update as Φ t = t η s x + R(x. Takng the gradent gves us the followng. Φ t = t η s + R(x 3
However, n FTRL after ths quantty s mnmzed, we must have Φ t = 0. t R(x = η s x = ( R 1( t η s Whch s exactly y t+1. Now f y t+1 / K, we must then Bregman project back to K. Ths s gven by defnton, but snce we mnmze wth respect to x, terms ndependent of ths varable can be elmnated gvng us the followng. argmn B R (x y t+1 = argmn R(x R(yt+1 R(y t+1 (x y t+1 } R(x R(yt+1 x } = argmn = argmn t R(x + η s x } In all cases, the updates for OMD and FTRL are equvalent. Thus the theorem holds. 4 Experts From Onlne Mrror Descent As stated prevously, many algorthms occur as specal cases of onlne mrror descent. We now showcase the results of [3]. Recall the setup for experts. At tme t a probablty dstrbuton p t s mantaned on k experts and a loss vector l t s revealed. Our goal s to maxmze the probablty of pckng the expert who ncurs mnmal loss over T tme steps. 4.1 Exponentated gradent algorthm Let x( be the -th component of x and our regularzaton functon be the negatve entropy functon R(x = x( log x(. We then have that R(x = (log x( + 1. From the OMD algorthm, our update rule for y t+1 s then the followng. R(y t+1 = R(y t η t (log y t+1 ( + 1 = (log y t ( + 1 η t log y t+1 ( = log y t ( η t y t+1 ( = y t (e η t Recall that n the expert settng, our convex set K s smply the n-dmensonal smplex defned as n = x R n : x( = 1}. We make two crtcal observatons. By theorem 6, the Bregman dvergence wth respect to the negatve entropy functon becomes relatve entropy. Ths s also known as Kullback-Lebler (KL dvergence. 4
By theorem 7, the Bregman projecton wth respect to the negatve entropy functon becomes scalng by the l 1 -norm. We have fully defned a specal case of the OMD update regme called the exponentated gradent algorthm. Algorthm 2 Exponentated gradent 1: Intalze y 1 = 1 and x 1 = y 1 y 1 1 2: for t = 1 T do 3: Play x t and receve cost functon f t 4: Update y t accordng to the followng rule y t+1 ( = y t (e η t( 5: Bregman project back to K 6: end for x t+1 = y t+1 y t+1 1 Prevously, we provded a multplcatve weght update method for expert learnng and proved regret bounds usng a potental functon argument. However, here the algorthm drectly falls out of OMD as a specal case! 4.2 Regret analyss We have demonstrated that OMD s equvalent to FTRL and so we may bootstrap theorem 2 to bound the regret of exponentated gradent. Theorem 5 Suppose all expert costs are 0-1 bounded: l t ( [0, 1]. Then the regret for the exponentated gradent algorthm s gven by regret O ( T log n Proof: Frst, substtute R(y R(x wth dameter. By theorem 2, we have the followng. regret t 2η α t 2 + D2 R η Dfferentate wth respect to η and mnmze the above expresson. η = αd 2 R 2 t t 2 regret 2D R t 2 α t 2 Observe that f all expert costs are n the range [0, 1], then the cost gradent must be bounded n the followng manner. t = t 1 5
By Pnsker s nequalty (theorem 8, the negatve entropy functon s strongly convex wth respect to the l 1 -norm. However, the dual of the l 1 -norm s the l -norm, whch follows from generalze Cauchy-Schwartz. Addtonally, the negatve entropy functon s α-strongly convex where α = 1 2 ln 2. Usng Jensen s nequalty, one may show that D R log n on the smplex n. Our regret s now the followng. regret 2D R t 2 α t 2 = 2 t 2 log n 2 ln 2 = 2 T log n ln 2 = O ( T log n Thus completes our analyss. References [1] E. Hazan and S. Kale. Extractng certanty from uncertanty: Regret bounded by varaton n costs. In The 21st Annual Conference on Learnng Theory (COLT, pages 5768, 2008. [2] A. Kala and S. Vempala. Effcent algorthms for onlne decson problems. Journal of Computer and System Scences, 71(3:291307, 2005. [3] J. Kvnen and M. Warmuth. Exponentated gradent versus gradent descent for lnear predctors. Informaton and Computaton, 132(1:164, 1997. [4] A. Nemrovsk and D. Yudn. On cesaros convergence of the gradent descent method for fndng saddle ponts of convex-concave functons. Doklady Akadem Nauk SSSR, 239(4, 1978. [5] S. Shalev-Shwartz. Onlne Learnng: Theory, Algorthms, and Applcatons. The Hebrew Unversty of Jerusalem, PhD thess, 2007. [6] S. Shalev-Shwartz and Y. Snger. A prmal-dual perspectve of onlne learnng algorthms. Machne Learnng, 69(2-3:115142, 2007. [7] S. Shalev-Shwartz and Y. Snger. Convex repeated games and fenchel dualty. Advances n Neural Informaton Processng Systems, 19:1265, 2007. [8] N. Srebro, K. Srdharan, and A. Tewar. On the unversalty of onlne mrror descent. Advances n Neural Informaton Processng Systems pages. 2645-2653, 2001. [9] M. Znkevch. Onlne convex programmng and generalzed nfntesmal gradent ascent. ICML, 2003. 6
A The Negatve Entropy Functon In ths secton we provde calculatons that show propertes relevant to usng negatve entropy as the regularzer. Theorem 6 Let R(x = x( log x(. We have the followng. B R (x y = ( x( x( log x( + y( y( Proof: Calculatons follow from defnton. Note that R(x = (log x( + 1. B R (x y = R(x R(y R(y (x y = x( log x( y( log y( ( log y( + 1 ( x( y( x( + y( = = x( log x( x( log ( x( y( x( log y( x( + y( The theorem holds. Notceably, we have that the Bregman dvergence of negatve entropy s smply KL-dvergence. Gven ths formulaton we prove the followng. Theorem 7 Let R(x = x( log x(. Then B R(x y subject to x n s mnmzed at the followng pont. y( x = y 1 Proof: We wsh to mnmze the followng expresson wth respect to x subject to x( = 1. x = argmn x n x( log ( x( y( x( + } y( Ths s easly done usng Lagrange multplers. Let F be defned as follows. F (x, λ = x( log ( x( 1 + y( One can show that F/ x( = 0 at the followng values. x( = y(e λ 1 λ = ( y( λ x( 1 1 ln y( + 1 Substtutng n gves us the theorem. Ths gves us the nterpretaton that the Bregman projecton wth respect to negatve entropy on the n-dmensonal smplex becomes scalng by the l 1 -norm. 7
B Pnsker s nequalty In ths secton, we prove Pnsker s nequalty whch gves us the fact that negatve entropy s α-strongly convex wth respect to the l 1 -norm gven α = 1 2 ln 2. Theorem 8 Let P and Q be two dstrbutons defned on the sample space Ω. Then the followng holds. D KL (P Q 1 2 ln 2 P Q 2 1 Proof: We frst show the theorem holds for the case where P and Q are Bernoull dstrbutons. Let p, q [0, 1] and P, Q gven by the followng. P = 1 w.p. p 0 w.p. 1 p Q = 1 w.p. q 0 w.p. 1 q Wthout loss of generalty, let p q and defne f to be the followng. f(p, q = D KL (P Q 1 2 ln 2 P Q 2 1 = p log p q 1 p 4(p q2 + (1 p log 1 q 2 ln 2 And observe that we have f(p, q = 0 when p = q and f(p, q 0 when q p. Furthermore, the followng holds. f q = p q ( 1 ln 2 q(1 q 4 We conclude that D KL (P Q 1 2 ln 2 P Q 2 1. Now consder the case where P and Q are dstrbuted arbtrarly on Ω. Let A Ω be such that A = x : P (x Q(x} and defne the followng random varables. 1 w.p. x A P A = P (x 1 w.p. x A 0 w.p. x/ A P (x Q A = Q(x 0 w.p. x/ A Q(x We then have the followng. P Q 1 = P (x Q(x x Ω = ( ( P (x Q(x + Q(x P (x x A x/ A = P (x ( Q(x + 1 P (x x A x/ A x A = P A Q A 1 ( 1 Q(x x/ A Now defne the random varable Z to be Z(x = 1 f x A else Z(x = 0. It follows that D KL (P Q = D KL (P (Z Q((Z+D KL (P Q Z. However, D KL (P (Z Q((Z = D KL (P A Q A and D KL (P Q Z 0, we must have the followng. D KL (P Q D KL (P A Q A 1 2 ln 2 P A Q A 2 1 = 1 2 ln 2 P Q 2 1 Thus we complete the proof. 8