Efficient Bregman Projections onto the Simplex

Size: px

Start display at page:

Download "Efficient Bregman Projections onto the Simplex"

Hugo Hensley
5 years ago
Views:

1 05 IEEE 54th Annual Conference on Decson and Control (CDC) December 5-8, 05. Osaka, Japan Effcent Bregman Projectons onto the Smplex Wald Krchene Syrne Krchene Alexandre Bayen Abstract We consder the problem of projectng a vector onto the smplex = {x R d + : d = x = }, usng a Bregman projecton. Ths s a common problem n frstorder methods for convex optmzaton and onlne-learnng algorthms, such as mrror descent. We derve the KKT condtons of the projecton problem, and show that for Bregman dvergences nduced by ω-potentals, one can effcently compute the soluton usng a bsecton method. More precsely, an ɛ- approxmate projecton can be obtaned n O(d log ). We also ɛ consder a class of exponental potentals for whch the exact soluton can be computed effcently, and gve a O(d log d) determnstc algorthm and O(d) randomzed algorthm to compute the projecton. In partcular, we show that one can generalze the KL dvergence to a Bregman dvergence whch s bounded on the smplex (unlke the KL dvergence), strongly convex wth respect to the l norm, and for whch one can stll solve the projecton n expected lnear tme. I. INTRODUCTION Many frst-order methods for convex optmzaton and onlne learnng can be formulated as teratve projectons of a vector on a feasble set. Consder for example the constraned convex problem, mnmze x X f(x), where X s a convex set and f : X R s convex. Ths problem can be solved usng the mrror descent algorthm, a frst-order method proposed by Nemrovsk and Yudn n [] (see also [4]), whch generalzes the projected gradent descent method, by replacng the Eucldean projecton step wth a generalzed Bregman projecton. Ths method can be summarzed n Algorthm. Algorthm Mrror descent method wth learnng rates (η τ ) and Bregman dvergence D ψ. : for τ N do : Query a sub-gradent vector g (τ) f(x (τ) ) 3: Update x (τ+) = arg mn D ψ (x, ( ψ) ( ψ(x (τ) ) η τ g (τ) )) x X () 4: end for Here, D ψ s the Bregman dvergence nduced by a dstance generatng functon ψ. The defnton and propertes Wald Krchene s wth the department of Electrcal Engneerng and Computer Scences, Unversty of Calforna, Berkeley, USA. wald@eecs.berkeley.edu Syrne Krchene s wth the ENSIMAG school of Computer Scences and Appled Mathematcs of Grenoble, France. syrne.krchene@ensmag.grenoble-np.fr Alexandre Bayen s wth the department of Electrcal Engneerng and Computer Scences, and the department of Cvl and Envronmental Engneerng, Unversty of Calforna, Berkeley, USA. bayen@berkeley.edu of Bregman dvergences wll be revewed n Secton II. Some mportant nstances of the mrror descent method nclude projected gradent descent, obtaned by takng the Bregman dvergence to be the squared Eucldean dstance, and the exponentated gradent descent [8] (also called Hedge algorthm or multplcatve weghts algorthm []), obtaned by takng the Bregman dvergence to be the KL dvergence. In ths artcle, we focus specfcally on smplexconstraned convex problems. That s, we suppose that X s the smplex d = {x R d + : d = x = }, or more generally, a product of scaled smplexes, X = α d α I d I. Smplex-constraned problems nclude nonparametrc statstcal estmaton, see for example Secton 7. n [8], mult-commodty flow problems, see Chapter n [0], tomography mage reconstructon [5] and learnng dynamcs n repeated games [0]. Other varants of the mrror descent method have been studed as well, such as stochastc mrror descent [7], [9]. Besdes ts applcatons to convex optmzaton, smplexconstraned mrror descent plays an mportant role n onlne learnng problems [9], n whch a decson maker chooses, at each teraton τ, a dstrbuton x (τ) over a fnte acton set A wth A = d. Then, a bounded loss vector l (τ) [0, ] d s revealed, and the decson maker ncurs expected loss l (τ), x (τ) = d = x(τ) l (τ). Ths sequental decson problem s also called predcton wth expert advce [], and has a long hstory whch dates back to Hannan [5] and Blackwell [6], who studed ths problem n the context of repeated games. In (adversaral) onlne learnng problems, one seeks to desgn an algorthm whch has a guarantee on the worst-case regret, defned as follows: f the algorthm s presented wth a sequence of losses (l (τ) ) τ T, and t generates a sequence of decsons (x (τ) ) τ T, then the cumulatve regret of the algorthm up to teraton T s R((l (τ) ) 0 τ T ) = T τ= l (τ), x (τ) mn T τ= l (τ), x, and the worst-case regret s the maxmum such regret over admssble sequences of losses max (l (τ) ) 0 τ T R((l (τ) ) 0 τ T ). An algorthm s sad to have sublnear regret f ts worstcase regret grows sub-lnearly n T, that s, R((l lm sup T max (τ) ) 0 τ T ) (l (τ) ) 0 τ T T 0. The onlne mrror descent method, obtaned smply by replacng the subgradent vector g (τ) n Algorthm wth the loss vector l (τ), defnes a large class of onlne learnng algorthms wth sub-lnear regret, see for example the survey of Bubeck and Cesa-Banch n [9]. The onlne /5/$ IEEE 39

2 mrror descent method s summarzed n Algorthm. Algorthm Onlne mrror descent method wth learnng rates (η τ ) and Bregman dvergence D ψ. : for τ N do : Play acton a (τ) x (τ) 3: Dscover loss vector l (τ) [0, ] d 4: Incur expected loss l (τ), x (τ) 5: Update x (τ+) = arg mn D ψ (x, ( ψ) ( ψ(x (τ) ) η τ l (τ) )) () 6: end for Onlne mrror descent, and ts stochastc varant, have been appled to several problems ncludng mult-armed bandts [9], [], machne learnng [] and repeated games [], to cte a few. In all the varants of smplex-constraned mrror descent, one needs to solve, at each teraton τ, the Bregman projecton step gven n equaton () or (). Some nstances of Bregman projectons are known to have an exact soluton whch can be computed effcently. For example, the soluton of the KL dvergence projecton on the smplex s gven by the exponental weghts update [], [3], and the Eucldean projecton on the smplex can be computed effcently ether by sortng and thresholdng n O(d log d), or by usng a randomzed pvot method n O(d), see [3]. In ths artcle, we start by dervng the KKT condtons of the Bregman projecton problem n Secton II, then consder, n Secton III, a general class of Bregman dvergences, nduced by ω-potentals, as defned by Audbert et al. []. We show that for ths class, the soluton can be approxmated effcently: an ɛ-approxmate soluton can be computed n O(d log ɛ ) operatons. In Secton IV, we consder a class of exponental potentals, and study the resultng Bregman projecton, a generalzaton of the KL-dvergence projecton. We show that for ths class, the exact soluton can be computed usng a determnstc algorthm wth O(d log d) complexty, or a randomzed algorthm wth expected lnear complexty. We also study the propertes of the resultng Bregman dvergence. In partcular, we emphasze a tradeoff between strong convexty and boundedness, two propertes whch affect the convergence rates of the mrror descent method. II. BREGMAN PROJECTION AND OPTIMALITY CONDITIONS Let ψ : X R be a convex functon defned on a convex set X, and let X be the subset of X on whch ψ s dfferentable. Let ψ : X R be the gradent of ψ, and R ts range. The Bregman dvergence nduced by ψ s defned as follows D ψ : X X R + (x, y) D ψ (x, y) = ψ(x) ψ(y) ψ(y), x y (3) By convexty of ψ, the Bregman dvergence s non-negatve, and x D ψ (x, y) s convex. We wll refer to ψ as the dstance-generatng functon. We say that ψ s l ψ -strongly convex wth respect to a reference norm f D ψ (x, y) l ψ x y x, y X X. In order for the Bregman projecton () to be well-defned, the gradent vector (or loss vector) at teraton τ must satsfy the followng consstency condton: ψ(x (τ) ) η τ g (τ) R. (4) A. Interpretatons of the Bregman projecton The Bregman projecton, gven n equaton (), can be nterpreted as projectng on X, the vector ( ψ) ( ψ(x (τ) ) η τ g (τ) ), obtaned by mappng the current terate x (τ) to the set R through ψ, takng a step n the opposte drecton of the gradent, then mappng the new vector back through ( ψ), see Nemrovsk and Yudn []. A second nterpretaton can be obtaned, as observed by Beck and Teboulle [3], by rewrtng the objectve functon as follows: denotng the vector ( ψ) ( ψ(x (τ) ) η τ g (τ) ) by x (τ), we have by defnton of D ψ x (t+) = arg mn D ψ (x, x (τ) ) = arg mn = arg mn ψ(x) whch s equvalent to mnmzng x (τ+) = arg mn η τ (f(x (τ) ) + ψ(x) ψ( x (τ) ) ψ( x (τ) ), x x (τ) ψ(x (τ) ) η τ g (τ), x, g (τ), x x (τ) ) +D ψ (x, x (τ) ), whch can be nterpreted as follows: the frst term f(x (τ) )+ g (τ), x x (τ) s the lnear approxmaton of f around the current terate x (τ), and the second term D ψ (x, x (τ) ) s a non-negatve functon whch penalzes devatons from x (τ). The step sze (or learnng rate) η τ, controls the relatve weght of both terms. B. Smplex-constraned Bregman projecton In the remander of the paper, we wll assume, to smplfy the dscusson, that the feasble set s the smplex d = {x R d + : d = x = }. We observe that all the results can be readly extended to the case n whch X s a product of scaled smplexes, as follows: suppose X = α d α K d K, wth α k > 0, and let ψ k be a dstance generatng functon on d k. Then consder the functon ψ : α d α K d K R K (α x,..., α K x K ) α k ψ k (x k ). k= The gradent of ψ s smply ψ : α d α K d K R R K, (α x,..., α K x K ) ( ψ (x ),..., ψ K (x K )), and ts nverse s gven by 39

3 ( ψ) : R... R K α d α K d K, (y,..., y K ) (α ψ (y ),..., α K ψ K (y K)). Fnally, the Bregman dvergence decomposes as follows D ψ ((α k x k ) k, (α k y k ) k ) = k = k α k ψ k (x k ) k α k D ψk (x k, y k ). α k ψ k (y k ) k ψ(y k ), α k (x k y k ) Therefore, the projecton on X wth Bregman dvergence D ψ can be decomposed nto K projectons on d k wth Bregman dvergence D ψk, as follows: arg mn D ψ (x, ( ψ) ( ψ(x (τ) ) η τ g (τ) )) x k d k = arg mn x k d k α k D ψk (x k, ψ k ( ψ k(x (τ) k ) η τ g (τ) k )), assumng the consstency condton holds for each k. Example (Eucldean projecton): Consder the functon ψ(x) = x. Then ψ(x) = x, and the Bregman dvergence s smply D ψ (x, y) = x y. As a consequence, the Bregman projecton step reduces to arg mn d D ψ (x, ( ψ) ( ψ(x (τ) ) η τ g (τ) )) = arg mn x (x(τ) η τ g (τ) ), d whch corresponds to a projected gradent descent update, wth step sze η τ. C. Optmalty condtons We now derve the KKT condtons for the Bregman projecton problem gven by mnmze x R d D ψ (x, ( ψ) ( ψ( x) ḡ)) subject to x d (5) where, x d, and ḡ R d are gven. Note that we combne η τ g (τ) nto a sngle vector ḡ, to smplfy notaton. By strong convexty, the soluton s unque. Proposton : Consder the Bregman projecton problem (5). Then x R d s optmal f and only f there exst λ R d + and ν R such that x = ( ψ) ( ψ( x) ḡ + λ + ν ), d = x =,, x 0, λ x = 0, where ν s the vector whose entres are all equal to ν. Proof: Defne the Lagrangan, for x R d, λ R d +, and ν R, L(x, λ, ν) = D ψ (x, ( ψ) ( ψ( x) ḡ)) λ, x + ν( x ). (6) = For all x, y X, the gradent of the Bregman dvergence s gven by x D ψ (x, y) = ψ(x) ψ(y). Thus the gradent of L s gven by x L(x, λ, ν) = ψ(x) ψ( x) + ḡ λ ν. Wrtng the KKT condtons of problem (5), we have that (x, λ, ν ) s optmal f and only f ψ(x ) ψ( x) + ḡ λ ν = 0, x =,, x 0, λ 0, λ x = 0, and the frst equaton can be rearranged as x = ( ψ) ( ψ( x) ḡ + λ + ν ), whch proves the clam. In the next secton, we wll derve an effcent algorthm to compute an approxmate soluton for the class of Bregman dvergences nduced by ω-potentals, by solvng the KKT system gven n Proposton. III. EFFICIENT APPROXIMATE PROJECTION WITH ω-potentials Defnton : Let a (, + ] and ω 0. An ncreasng, C -dffeomorphsm φ : (, a) (ω, + ) s called an ω-potental f lm φ(u) = ω, lm u φ(u) = +, φ (u)du <. u a We assocate, to an ω-potental φ, the dstance-generatng x ω Fg.. φ(u) x φ (u)du functon ψ defned as follows Illustraton of an ω-potental ψ : (ω, + ) d R x = x 0 0 a φ (u)du. By defnton, ψ s fnte (n partcular, the thrd condton on the potental ensures that ψ s fnte on the boundary of the smplex snce 0 φ (u)du < ), dfferentable on (ω, + ) d, and ts gradent s gven by ψ : (ω, ) d R = (, a) d x ψ(x) = (φ (x )) =,...,d, and snce φ n ncreasng, ψ s convex. Smlarly, the nverse of ts gradent s ( ψ) : (, a) d (ω, ) d y (φ(y )) =,...,d. 393

4 Proposton : Consder the Bregman projecton onto the smplex gven n Problem (5), and assume that ψ s nduced by an ω-potental φ. Then x s a soluton f and only f there exsts ν R such that {, x = ( φ(φ ( x ) ḡ + ν ) ) +, d = x =, where x + denoted the postve part of x, x + = max(x, 0). Proof: Combnng the expresson of ψ and ( ψ) wth Proposton, we have that x s optmal f and only f there exst ν R and λ R d + such that, x = φ(φ ( x ) ḡ + ν + λ ), d = x =,, x 0, x λ = 0. Let I = { : x > 0} be the support of x. Then by the complementary slackness condton, we have for all I, λ = 0, thus x = φ(φ ( x ) ḡ + ν ), and for all / I, φ(φ ( x ) ḡ + ν ) φ(φ ( x ) ḡ + ν + λ ) = x = 0. snce φ s ncreasng Therefore ( x can be smply wrtten x = φ(φ ( x ) ḡ + ν ) ) whch proves the clam. + Next, we make the followng observaton regardng the support of the soluton: Proposton 3: Let x be the soluton to the projecton problem (5), and let I be ts support. Then for all, j, f I and φ ( x ) ḡ φ ( x j ) ḡ j, then j I. Proof: Follows from Proposton and the fact that φ s ncreasng. As a consequence of the prevous propostons, computng the projecton reduces to computng the optmal dual varable ν, and snce the potental s ncreasng, one can teratvely approxmate ν usng a bsecton method, gven n Algorthm 3: we start by defnng a bound on the optmal ν, ν ν ν, then we teratvely halve the sze of the nterval by nspectng the value of a carefully defned crteron functon. Theorem : Consder the Bregman projecton onto the smplex gven n Problem (5), and assume that ψ s nduced by an ω-potental φ. Let ɛ > 0, and consder the bsecton method gven n Algorthm 3. Then the Algorthm termnates after T = O(log ɛ ) steps, and ts output x( ν(t ) ) s such that x( ν (T ) ) x ɛ. Each step of the algorthm has complexty O(d), thus the total complexty s O ( d log ɛ ). Proof: Defne, as n Algorthm 3, the functon x(ν) = ( φ(φ ( x ) ḡ + ν) + )=,...,d. Snce φ s, by assumpton, ncreasng, so s ν x (ν), whch s the key fact that allows us to use a bsecton. We wll denote by a superscrpt (t) the value of each varable at teraton t of the loop. To prove the clam, we show the followng nvarant for t: Algorthm 3 Bsecton method to compute the projecton x wth precson ɛ. : Input: x, ḡ, ɛ. : Intalze ν = φ () max φ ( x ) ḡ ν = φ (/d) max φ ( x ) ḡ 3: Defne x(ν) = ( φ(φ ( x ) ḡ + ν) + )=,...,d 4: whle x(ν) x(ν) > ɛ do 5: Let ν + ν+ν 6: f x (ν + ) > then 7: ν ν + 8: else 9: ν ν + 0: end f : end whle : Return x( ν) () 0 ν (t) ν (t) ν(0) ν (0), t (), 0 x (ν (t) ) x (ν (t) ), () d = x (ν (t) ) d = x ( ν (t) ). We frst prove the nvarant for t = 0. Let 0 = arg max φ ( x ) ḡ. By defnton of ν (0) and ν (0), we have φ (/d) ν = φ ( x 0 ) ḡ 0 = φ () ν, (7) and t follows that x 0 (ν (0) ) = d and x 0 ( ν (0) ) =. By (7), ν (0) ν (0) = φ () φ (/d) 0 (snce φ s ncreasng), whch proves (). Next, snce ν x (ν) s ncreasng, we have 0 x (ν (0) ) x ( ν (0) ) x 0 ( ν (0) ) =, whch proves (). Fnally, we have d = x (ν (0) ) d x 0 (ν (0) ) =, d = x ( ν (0) ) x 0 ( ν (0) ) =, whch proves (). Ths proves the nvarant for t = 0. Now suppose t holds at teraton t, and let us prove t stll holds at t +. By defnton of the bsecton (lnes 5 0), we mmedately have ν (t+) ν (t+) = ν(t) ν (t) = ν (0) ν (0) t, whch proves (). We also have that ν (t) ν (t+) ν (t+) ν (t), whch proves () snce ν x (ν) s ncreasng. Fnally, () follows from the condton of the bsecton (lne 6). To conclude the proof, we smply observe that snce the dstance ν ν decreases exponentally, the algorthm wll termnate after a number of steps logarthmc n /ɛ. Indeed, snce φ s C on (, a), t s Lpschtz-contnuous on 394

5 [φ (0), φ ()]. Let L be ts Lpschtz constant, then x(ν (t) ) x( ν (t) ) = x (ν (t) ) x ( ν (t) ) = dl ν (t) ν (t) = dl ν(0) ν (0) t by (), thus the algorthm termnates after T = log ν (0) ν (0) ɛdl teratons, and the last terate satsfes x(ν ) x( ν (T ) ) x(ν (T ) ) x( ν (t) ) ɛ, whch concludes the proof. by () and snce x are ncreasng IV. EFFICIENT EXACT PROJECTION WITH EXPONENTIAL POTENTIALS We now consder a subclass of ω-potentals, for whch we derve the exact soluton. Defnton (Exponental potental): Let ɛ 0. The functon φ ɛ : (, + ) ( ɛ, + ) u e u ɛ, s called the exponental potental wth parameter ɛ. It s a ( ɛ)-potental. The dstance generatng functon nduced by ths class of potentals s gven by ψ ɛ(x) = = = x φ ɛ (u)du = = x + ln(u + ɛ)du (x + ɛ) ln(x + ɛ) ( + ɛ) ln( + ɛ) = = H(x + ɛ) H( + ɛ), where ɛ s the vector whose entres are all equal to ɛ, and H s the generalzed negatve entropy functon, defned on R d + H(x) = d = x ln x. The correspondng Bregman dvergence s D ψɛ (x, y) = H(x + ɛ) H(y + ɛ) H(y + ɛ), x y = D KL (x + ɛ, y + ɛ) = (x + ɛ) ln x + ɛ y + ɛ, = and wll be denoted D KL,ɛ (x, y). In partcular, when ɛ = 0, D KL,ɛ (x, y) s the KL dvergence between the dstrbuton vectors x and y. When ɛ > 0, the Bregman dvergence s the KL dvergence between x + ɛ and y + ɛ. In partcular, as we wll see n Proposton 6, D KL,ɛ (x, y) s bounded whenever ɛ > 0, whle the KL dvergence (ɛ = 0) can be unbounded. As mentoned n the ntroducton, projectng on the smplex wth the KL dvergence plays a central role n many applcatons such as onlne learnng. In partcular, the projecton problem can be solved exactly n O(d) operatons, whch H(x) H ɛ (x) = H(x + ɛ), ɛ =. 0 ɛ ln ɛ + ( + ɛ) ln( + ɛ) Fg.. Illustraton of the dstance generatng functon nduced by exponental potentals wth parameter ɛ, for d = : H(x) = x ln(x ) + ( x ) ln( x ). makes ths projecton effcent. However, some varants of mrror descent, such as stochastc mrror descent, requre the Bregman dvergence to be bounded on the smplex n order to have guarantees on the convergence rate, see for example [4]. In the remander of ths secton, we wll show that projectng wth the generalzed KL dvergence D KL,ɛ enjoys many desrable propertes (strong convexty wth respect to the l norm, boundedness), and the projecton can stll be computed effcently. A. A sortng algorthm to compute the exact projecton We frst apply the optmalty condtons of Proposton to ths specal class, and show that the soluton s entrely determned by ts support. Proposton 4: Consder the Bregman projecton onto the smplex gven n Problem (5), wth Bregman dvergence D KL,ɛ. Let x be the soluton and I = { : x > 0} ts support. Then { I, x = ɛ + ( x+ɛ)e ḡ Z, Z I = ( x+ɛ)e ḡ (8) + I ɛ. Proof: Applyng Proposton wth the expresson φ(u) = e u + ɛ and φ (u) = + ln(u + ɛ), x s a soluton f and only f there exsts ν R such that, x = ( ɛ + ( x + ɛ)e ḡ e )+ ν, and x =. Thus, f I s the support of x, then these optmalty condtons are equvalent to { I, x = ɛ + ( x + ɛ)e ḡ e ν, I ɛ + ( x + ɛ)e ḡ e ν =, and the second equaton can be rewrtten as + ɛ I = e ν I ( x + ɛ)e ḡ, whch proves the clam, wth Z = e ν. Proposton 4 shows that solvng the Bregman projecton wth generalzed KL dvergence reduces to fndng the support of the soluton. Next, we show that the support has a smple characterzaton. To ths end, we assocate to ( x, ḡ) the vector ȳ defned as follows, ȳ = ( x + ɛ)e ḡ, and we denote by ȳ σ() the -th largest element of ȳ. 395

6 Algorthm 4 Sortng method to compute the Bregman projecton wth D ψɛ : Input: x, ḡ : Output: x 3: Form the vector ȳ = ( x + ɛ)e ḡ 4: Sort y, let ȳ σ() be the -th smallest element of y. 5: Let j be the smallest ndex for whch 6: Set Z = 7: Set c(j) := ( + ɛ(d j + ))ȳ σ(j) ɛ j j ȳ σ() +ɛ(d j +) x = ( ɛ + Proposton 5: The functon ) ȳ Z(j ) + c(j) ( + ɛ(d j + ))ȳ σ(j) ɛ j ȳ σ() > 0 ȳ σ() s ncreasng, and the support of x s {σ(j ),..., σ(n)}, where j = mn{j : c(j) > 0}. Proof: Frst, straghtforward algebra shows that c(j + ) c(j) = ( + ɛ(d j))(ȳ σ(j+) ȳ σ(j) ) 0. Thus c s ncreasng. To prove the second part of the clam, we know by Proposton 3 that the support s {σ( ),..., σ(n)} for some, and to show that = j = mn{j : c(j) > 0}, t suffces to show that c( ) > 0 and c(j) 0 for all j <. Frst, by the expresson (8) of x, we have x σ( ) = ɛ + ȳ σ( ) ȳ σ() +ɛ(d +) > 0, whch s equvalent to c( ) > 0. And f j < (.e. σ(j) s outsde the support), then by the expresson (8) agan, whch s equvalent to 0 = x σ(j) ɛ + ȳ σ(j) ȳ σ() +ɛ(d +) ( + ɛ(d ))ȳ σ(j) ɛ ȳ σ() 0, but c(j) s smaller than the LHS, snce c(j) ( + ɛ(d ))ȳ σ(j) ɛ ȳ σ() = ɛ j < ȳ σ(j) ȳ σ() 0, whch concludes the proof. Theorem : Algorthm 4 solves the Bregman projecton problem wth exponental potental φ ɛ n O(d log d) teratons. Proof: Correctness of the algorthm follows from the characterzaton of the support of x n Proposton 5 and Algorthm 5 QuckProjecton Algorthm to compute the Bregman projecton wth D ψɛ : Input: x, ḡ : Output: x 3: Form the vector ȳ = ( x + ɛ)e ḡ 4: Intalze J = {,..., d}, S = 0, C = 0, s = d + 5: whle J = do 6: Select a random pvot ndex j J 7: Partton J J + = { J : ȳ ȳ j } J = { J : ȳ < ȳ j } and compute S + = J + ȳ C + = J + 8: Let γ = ( + ɛ(c + C + ))ȳ j ɛ(s + S + ) 9: f γ > 0 then 0: J J, s = j : S S + S +, C C + C + : else 3: J J + 4: end f 5: end whle 6: Set Z = S +ɛc 7: Set ( ) x = ɛ + ȳ Z + the expresson of x n Proposton 4. The complexty of the sort operaton (step 4) s O(d log d), and fndng j (step 5) can be done n lnear tme snce the crteron functon c( ) s such that c(j +) c(j) = (+ɛ(d j))(ȳ σ(j+) ȳ σ(j) ), so each crteron evaluaton costs O(). Therefore, the overall complexty of Algorthm 4 s O(d log d). B. A randomzed pvot algorthm to compute the exact soluton We now propose a randomzed verson of Algorthm 4, whch selects a random pvot at each teraton, nstead of sortng the full vector. The resultng algorthm, whch we call QuckProject, s an extenson of the QuckSelect algorthm due to Hoare [6]. A smlar dea s used n the randomzed verson of the l projecton on the smplex n [3]. Theorem 3: In expectaton, the QuckProject Algorthm termnates after O(d) operatons, and outputs the soluton x of the Bregman projecton problem 5 wth the Bregman dvergence D KL,ɛ. Proof: Frst, we prove that the algorthm has expected lnear complexty. Let T (n) be the expected complexty of the whle loop when J = n. The partton and compute step (7) takes 3n operatons, then we recursvely apply the loop to J or J +, whch have szes (m, n m) for any m {,..., n}, wth unform 396

7 probablty. Thus we can bound T (n) as follows T (n) 3n + n 3n + n n T (max(m, n m)) m= n T (m), m= n and we can show by nducton that T (n) n, snce T (0) = 0 and 3n + n n m 3n + 3n 4 = n. m= n To prove the correctness of the algorthm, we wll prove that once the whle loop termnates, s = σ(j ), and S, C are respectvely the sum and the cardnalty of {ȳ σ() : j }, then by Proposton 4, we have the correct expresson of x. We start by showng the followng nvarants: () If ȳ σ(mt), s the largest element n J (t), then σ(m t + ) = (s ) (t). () J (t) contans σ(j ) or σ(j ). () S and C are the sum and cardnalty of { : σ() s }. (v) γ (t) = c(j (t) ), where c s the crteron functon defned n Proposton 5. The nvarant holds for the frst teraton snce J () = {,..., d}, m t = d, and S () = C () = 0. Suppose the nvarant s true at teraton t of the loop. Then two cases are possble: ) If γ (t) 0, then J (t+) = (J (t) ) + and m (t+) = m (t), and the nvarant stll holds. ) If γ (t) > 0, then J (t+) = (J (t) ) and (s ) (t+) = j (t), thus { : σ() (s ) t+ } = { : σ() (s ) (t) } { : (s ) t+ σ() (s ) (t) } = { : σ() (s ) (t) } (J (t) ) +, and by the update step (lnes 0 ), the nvarant stll holds. To fnsh the proof, suppose the whle loop termnates after T teratons,.e. J (T +) =. We clam that (s ) (T +) = σ(j ). Durng the last update, two cases are possble: ) If γ (T ) > 0, then ȳ j (T ) s the smallest element of J (T ). In ths case, snce c() 0 for < j, and J (T ) contans σ(j ) or σ(j ), t must be that j (T ) = σ(j ), thus (s ) T + = j (T ) = σ(j ). ) If γ (T ) 0, then ȳ j (T ) s the largest element of J (T ), n ths case, snce c(j ) > 0, t must be that j (T ) = σ(j ), so m (t) = j and (s ) (T +) = (s ) (T ) = σ(m (t) + ) = σ(j ). Ths concludes the proof. C. Propertes of the generalzed KL dvergence Algorthms 4 and 5 gve effcent methods for computng the projecton wth generalzed KL dvergence D KL,ɛ. In ths secton, we show that ths famly of Bregman dvergences enjoys addtonal propertes, gven below. Proposton 6: For all ɛ > 0, D KL,ɛ s l ɛ -strongly convex and L ɛ -smooth w.r.t., and bounded by D ɛ on, wth l ɛ + dɛ, L ɛ ɛ, D ɛ ln + ɛ. ɛ Proof: Frst, we show strong convexty. Let x, y. By Taylor s theorem, z (x + ɛ, y + ɛ) such that D KL,ɛ (x, y) = H(x + ɛ) H(y + ɛ) H(y + ɛ), x y = x y, H(z)(x y) = (x y ), z where we used the fact that the Hessan of the negatve entropy functon s H(z) = dag( z ). And snce, z ɛ (z belongs to the segment (x + ɛ, y + ɛ)), t follows that D KL,ɛ (x, y) (x y ) ɛ ɛ x y. Furthermore, by the Cauchy-Schwartz nequalty, ( x y ) (x y ) z z, thus D KL,ɛ (x, y) x y = z D KL,ɛ (δ 0, δ j0 ) = + dɛ x y. To compute the upper bound on D KL,ɛ, we observe that D KL,ɛ (x, y) s jontly-convex n (x, y) (by jont-convexty of the KL dvergence), therefore, ts maxmum on d d s attaned on a vertex of the feasble set, that s, for (x, y) = (δ 0, δ j0 ), for some ( 0, j 0 ), where δ 0 s the Drac dstrbuton on 0. Fnally, smple calculaton shows that { 0 f 0 = j 0, p q ln +ɛ ɛ otherwse. D KL(x, y 0) D KL,ɛ(x, y 0) lɛ x y 0 Lɛ x y0 0 p Fg. 3. Illustraton of Proposton 6, when d =. The dstrbutons x and y are parameterzed as follows: x = (p, p) and y = (q, q). The surface plot (left) shows the generalzed KL dvergence for ɛ =., l wth, n dashed lnes, the quadratc upper and lower bounds, ɛ y x and Lɛ x y. The second plot (rght) compares D KL,.(x, y 0 ) and D KL (x, y 0 ) for a fxed y 0 = (.35,.65). 397

8 D. Numercal experments We provde a smple python mplementaton of the projecton algorthms at gthub.com/waldk/ BregmanProjecton. The mplementaton of Algorthm 3 s generc and can be nstantated for any ω-potental by provdng the functon φ and ts nverse. The mplementaton of Algorthm 4 and QuckProject are specfc to the generalzed exponental potental. Fnally, we report n Fgure 4 the run tmes of both algorthms as the dmenson d grows, averaged over 50 runs, for randomly generated, normally dstrbuted vectors x and ḡ. The numercal smulatons are also avalable on the same repostory. Average run tme (s) SortProjecton QuckProjecton d Average run tme (s) SortProjecton QuckProjecton d 0 6 Fg. 4. Executon tme as a functon of the dmenson d, wth ɛ =., n log-log scale (left). The hghlghted regon s zoomed-n n lnear scale on the rght. The smulaton confrms that the QuckProject algorthm s, on average, faster than the sortng algorthm, especally for large d. V. CONCLUSION We studed the Bregman projecton problem on the smplex wth ω-potentals, and derved optmalty condtons for the soluton, whch motvated a smple bsecton algorthm to compute ɛ approxmate solutons n O(d log(/ɛ)) tme. Then we focused on the projecton problem wth exponental potentals, resultng n a Bregman dvergence whch generalzes the KL dvergence. We showed that n ths case, the soluton can be computed exactly n O(d log d) tme usng a sortng algorthm, or n expected O(d) tme usng a randomzed pvot algorthm. Ths class of dvergences s of partcular nterest because t has a quadratc upper and lower bound (.e. ts dstance generatng functon s both strongly convex and smooth), a property whch s essental to obtan convergence guarantees n some settngs, such as stochastc mrror descent. A queston whch remans open s whether one can project n O(d) tme usng a determnstc algorthm akn to the medan of medans algorthm due to Blum et al. [7] whch solves the selecton problem n determnstc lnear tme. The fact that one can effcently compute the exact soluton hnges on the exstence of a closed-form soluton of the dual varable ν gven the support of the soluton (Proposton 4). Ths s also the case for the Eucldean projecton,.e. when D ψ s the squared Eucldean norm, see [3]. Ths suggests that one may derve effcent projecton algorthms for other classes of Bregman dvergences, whch would, n turn, lead to new effcent nstances of the mrror descent method. REFERENCES [] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multplcatve weghts update method: a meta-algorthm and applcatons. Theory of Computng, 8(): 64, 0. [] Jean-Yves Audbert, Sébasten Bubeck, and Gàbor Lugos. Regret n onlne combnatoral optmzaton. Mathematcs of Operatons Research, 39():3 45, 04. [3] Amr Beck and Marc Teboulle. Mrror descent and nonlnear projected subgradent methods for convex optmzaton. Oper. Res. Lett., 3(3):67 75, May 003. [4] A. Ben-Tal and A. Nemrovsk. Lectures on Modern Convex Optmzaton. Socety for Industral and Appled Mathematcs, 00. [5] Aharon Ben-Tal, Tamar Margalt, and Arkad Nemrovsk. The ordered subsets mrror descent optmzaton method wth applcatons to tomography. SIAM J. on Optmzaton, ():79 08, January 00. [6] Davd Blackwell. An analog of the mnmax theorem for vector payoffs. Pacfc Journal of Mathematcs, 6(): 8, 956. [7] Manuel Blum, Robert W. Floyd, Vaughan Pratt, Ronald L. Rvest, and Robert E. Tarjan. Tme bounds for selecton. J. Comput. Syst. Sc., 7(4):448 46, August 973. [8] Stephen Boyd and Leven Vandenberghe. Convex Optmzaton, volume 5. Cambrdge Unversty Press, 00. [9] Sébasten Bubeck and Ncolò Cesa-Banch. Regret analyss of stochastc and nonstochastc mult-armed bandt problems. Foundatons and Trends n Machne Learnng, 5():, 0. [0] Yar Censor and Stavros Zenos. Parallel Optmzaton: Theory, Algorthms and Applcatons. Oxford Unversty Press, 997. [] Ncolò Cesa-Banch and Gábor Lugos. Predcton, learnng, and games. Cambrdge Unversty Press, 006. [] Ofer Dekel, Ran Glad-Bachrach, Ohad Shamr, and Ln Xao. Optmal dstrbuted onlne predcton. In Proceedngs of the 8th Internatonal Conference on Machne Learnng (ICML), June 0. [3] John Duch, Sha Shalev-Shwartz, Yoram Snger, and Tushar Chandra. Effcent projectons onto the l-ball for learnng n hgh dmensons. In Proceedngs of the 5th Internatonal Conference on Machne Learnng, ICML 08, pages 7 79, New York, NY, USA, 008. ACM. [4] John C. Duch, Alekh Agarwal, Mkael Johansson, and Mchael Jordan. Ergodc mrror descent. SIAM Journal on Optmzaton (SIOPT), (4): , 00. [5] James Hannan. Approxmaton to Bayes rsk n repeated plays. Contrbutons to the Theory of Games, 3:97 39, 957. [6] C. A. R. Hoare. Algorthm 65: Fnd. Commun. ACM, 4(7):3 3, July 96. [7] Anatol Judtsky, Arkad Nemrovsk, and Clare Tauvel. Solvng varatonal nequaltes wth stochastc mrror-prox algorthm. Stoch. Syst., ():7 58, 0. [8] Jyrk Kvnen and Manfred K. Warmuth. Exponentated gradent versus gradent descent for lnear predctors. Informaton and Computaton, 3(): 63, 997. [9] Syrne Krchene, Wald Krchene, Roy Dong, and Alexandre Bayen. Convergence of heterogeneous dstrbuted learnng n the stochastc routng game. In Proceedngs of the 53rd Annual Allerton Conference on Communcaton, Control, and Computng, 05. [0] Wald Krchene, Syrne Krchene, and Alexandre Bayen. Convergence of mrror descent dynamcs n the routng game. In European Control Conference (ECC), accepted, 05. [] A. S. Nemrovsky and D. B. Yudn. Problem complexty and method effcency n optmzaton. Wley-Interscence seres n dscrete mathematcs. Wley,

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume