Composite Objective Mirror Descent

Size: px

Start display at page:

Download "Composite Objective Mirror Descent"

Jemimah Hoover
6 years ago
Views:

1 Composite Objective Mirror Descent John C. Duchi UC Berkeley Shai Shalev-Shartz Hebre University Yoram Singer Google Research Ambuj Teari TTI Chicago Abstract We present a ne method for regularized convex optimization and analyze it under both online and stochastic optimization settings. In addition to unifying previously knon firstorder algorithms, such as the projected gradient method, mirror descent, and forardbackard splitting, our method yields ne analysis and algorithms. We also derive specific instantiations of our method for commonly used regularization functions, such as l, mixed norm, and trace-norm. Introduction and Problem Statement Regularized loss minimization is a common learning paradigm in hich one jointly minimizes an empirical loss over a training set plus a regularization term. The paradigm yields an optimization problem of the form min Ω n n f t () + r(), () here Ω R d is the domain (a closed convex set), f t : Ω R is a (convex) loss function associated ith a single example in a training set, and r : Ω R is a (convex) regularization function. A fe examples of famous learning problems that fall into this frameork are least squares, ridge regression, support vector machines, support vector regression, lasso, and logistic regression. In this paper, e describe and analyze a general frameork for solving Eq. (). The method e propose is a first-order approach, meaning that e access the functions f t only by receiving subgradients. Recent ork has shon that from the perspective of achieving good statistical performance on unseen data, first order methods are preferable to higher order approaches, especially hen the number of training examples n is very large (Bottou and Bousquet, 2008; Shalev-Shartz and Srebro, 2008). Furthermore, in large scale problems it is often prohibitively expensive to compute the gradient of the entire objective function (thus accessing all the examples in the training set), and randomly choosing a subset of the training set and computing the gradient over the subset (perhaps only a single example) can be significantly more efficient. This approach is very closely related to online learning. Our general frameork handles both cases ith ease it applies to accessing a single example (or subset of the examples) at each iteration or accessing the entire training set at each iteration. The method e describe is an adaptation of the Mirror Descent (MD) algorithm (Nemirovski and Yudin, 983; Beck and Teboulle, 2003), an iterative method for minimizing a convex function φ : Ω R. If the dimension d is large enough, MD is optimal among first-order methods, and it has a close connection to online learning since it is possible to bound the regret φ t ( t ) inf Ω φ t (), here { t } is the sequence generated by mirror descent and the φ t are convex functions. In fact, one can vie popular online learning algorithms, such as eighted majority (Littlestone and Warmuth, 994) and online gradient descent (Zinkevich, 2003) as special cases of mirror descent. A guarantee on the online regret can be translated directly to a guarantee on the convergence rate of the algorithm to the optimum of Eq. (), as e ill sho later.

2 Folloing Beck and Teboulle s exposition, a Mirror Descent update in the online setting can be ritten as t+ = argmin B ψ (, t ) + η φ t( t ), t, (2) Ω here B ψ is a Bregman divergence and φ t denotes an arbitrary subgradient of φ t. Intuitively, MD minimizes a first-order approximation of the function φ t at the current iterate t hile forcing the next iterate t+ to lie close to t. The step-size η controls the trade-off beteen these to. Our focus in this paper is to generalize mirror descent to the case hen the functions φ t are composite, that is, they consist of to parts: φ t = f t + r. Here the f t change over time but the function r remains constant. Of course, one can ignore the composite structure of the φ t and use MD. Hoever, doing so can result in undesirable effects. For example, hen r() =, applying MD directly does not lead to sparse updates. Since the sparsity inducing property of the l -norm is a major reason for its use. The modification of mirror descent that e propose is simple: t+ argmin Ω η f t( t ), + B ψ (, t ) + ηr(). (3) This is almost the same as the mirror descent update ith an important difference: e do not linearize r. We call this algorithm Composite Objective MIrror Descent, or Comid. One of our contributions is to sho that, in a variety of cases, the Comid update is no costlier than the usual mirror descent update. In these situations, each Comid update is efficient and benefits from the presence of the regularizer r(). We no outline the remainder of the paper. We begin by revieing related ork, of hich there is a copious amount, though e try to do some justice to prior research. We then give a general O( T ) regret bound for Comid in the online optimization setting, after hich e give several extensions. We sho O(log T ) regret bounds for Comid hen the composite functions f t + r are strongly convex, after hich e sho convergence rates and concentration results for stochastic optimization using Comid. The second focus of the paper is in the derived algorithms, here e outline step rules for several choices of Bregman function ψ and regularizer r, including l, l, and mixed-norm regularization, as ell as presenting ne results on efficient matrix optimization ith Schatten p-norms. 2 Related Work Since the idea underlying Comid is simple, it is not surprising that similar algorithms have been proposed. One of our main contributions is to sho that Comid generalizes much prior ork and to give a clean unifying analysis. We do not have the space to thoroughly revie the literature, though e try to do some small justice to hat is knon. We begin by revieing ork that e ill sho is a special case of Comid. Forard-backard splitting is a long-studied frameork for minimizing composite objective functions (Lions and Mercier, 979), though it has only recently been analyzed for the online and stochastic case (Duchi and Singer, 2009). Specializations of forardbackard splitting to the case here r() = include iterative shrinkage and thresholding from the signal processing literature (Daubechies et al., 2004), and from machine learning, Truncated Gradient (Langford et al., 2009) and SMIDAS (Shalev-Shartz and Teari, 2009) are both special cases of Comid. In the optimization community there has there has been significant recent interest both applied and theoretical on minimization of composite objective functions such as that in Eq. (). Some notable examples include Wright et al. (2009); Nesterov (2007); Tseng (2009). These papers all assume that the objective f + r to be minimized is fixed and that f is smooth, i.e. that it has Lipschitz continuous derivatives. The most related of these to Comid is probably Tseng (2009, see his Sec. 3. and the references therein), hich proposes the same update as ours, but gives a Nesterov-like optimal method for the fixed f case. We do not have restrictions on f, though by going to stochastic, nondifferentiable f e naturally suffer in convergence rate. Nonetheless, e do anser in the affirmative a question posed by Tseng (2009), hich is hether stochastic or incremental subgradient methods ork for composite objectives. To recent papers for online and stochastic composite objective minimization are Xiao (2009) and Duchi and Singer (2009). The former extends Nesterov s 2009 analysis of primal-dual subgradient methods to the composite case, giving an algorithm hich is similar to ours; hoever, our algorithms are different and the analysis for each is completely different. Duchi and Singer (2009) is simply a specialization of Comid to the case here the Euclidean Bregman divergence is used. As a consequence of our general setting, e are able to give elegant ne algorithms for minimization of functions on matrices, hich include efficient and simple algorithms for trace-norm

3 minimization. Trace norm minimization has recently found strong applicability in matrix rank minimization (Recht et al., 2007), hich has been shon to be very useful, for example, in collaborative filtering (Srebro et al., 2004). A special case of Comid has recently been developed for this task, hich is very similar in spirit to fixed point and shrinkage methods from signal processing for l - minimization (Ma et al., 2009). The authors of this paper note that the method is extremely efficient for rank-minimization problems but do not give rates of convergence, hich e give as a corollary to our main convergence theorems. 3 Notation and Setting Before continuing, e establish notation and our problem setting formally. Vectors are loer case bold italic letters, such as x R d, and scalars are loer case italics such as x R. We denote a sequence of vectors by subscripts, i.e. t, t+,..., and entries in a vector by non-bold subscripts as in j. Matrices are upper case bold italic letters, such as W R d d. The subdifferential set of a function f evaluated at is denoted f() and a particular subgradient by f () f(). When a function is differentiable, e rite f(). We focus mostly on the problem of regularized online learning, in hich the goal is to achieve lo regret.r.t. a static predictor Ω on a sequence of functions φ t () f t () + r(). Here, f t and r 0 are convex functions, and Ω is some convex set (hich could be R d ). Formally, at every round of the algorithm e make a prediction t R d and then receive the function f t. We seek bounds on the regularized regret ith respect to, defined as R φ (T, [ ) ft ( t ) + r( t ) f t ( ) r( ) ]. (4) In batch optimization e set f t = f for all t, hile in stochastic optimization e choose f t to be the average of some random subset of {f,..., f n }. As mentioned previously and as e ill sho, it is not difficult to transform regret bounds for Eq. (4) into convergence rates in expectation and ith high probability for Eq. (), hich e do using techniques similar to Cesa-Bianchi et al. (2004). Throughout, ψ designates a continuously differentiable function that is α-strongly convex.r.t. a norm on the set Ω. Recall that this means that the Bregman divergence associated ith ψ, satisfies B ψ (, v) α 2 v 2 for some α > 0. B ψ (, v) = ψ() ψ(v) ψ(v), v, 4 Composite Objective MIrror Descent We use proof techniques similar to those in Beck and Teboulle (2003) to derive progress bounds on each step of the algorithm. We then use the bounds to straightforardly prove convergence results for online and batch learning. We begin by bounding the progress made by each step of the algorithm in either an online or a batch setting. This lemma is the key to our later analysis, so e prove it in full here. Lemma Let the sequence { t } be defined by the update in Eq. (3). Assume that B ψ (, ) is α-strongly convex ith respect to a norm, that is, B ψ (, v) α 2 v 2. For any Ω, η (f t ( t ) f t ( )) + η (r( t+ ) r( )) B ψ (, t ) B ψ (, t+ ) + η2 f t( t ) 2. Proof: The optimality of t+ for Eq. (3) implies for all Ω and r ( t+ ) r( t+ ), t+, ηf ( t ) + ψ( t+ ) ψ( t ) + ηr ( t+ ) 0. (5) In particular, this obtains for =. From the subgradient inequality for convex functions, e have f t ( ) f t ( t ) + f t( t ), t, or f t ( t ) f t ( ) f t( t ), t, and likeise for r( t+ ). We thus have η [f t ( t ) + r( t+ ) f t ( ) r( )] η t, f t( t ) + η t+, r ( t+ ) = η t+, f t( t ) + η t+, r ( t+ ) + η t t+, f t( t ) = t+, ψ( t ) ψ( t+ ) ηf t( t ) ηr ( t+ ) + t+, ψ( t+ ) ψ( t ) + η t t+, f t( t ).

4 No, by Eq. (5), the first term in the last equation is non-positive. Thus e have that η [f t ( t ) + r( t+ ) f t ( ) r( )] t+, ψ( t+ ) ψ( t ) + η t t+, f t( t ) = B ψ (, t ) B ψ ( t+, t ) B ψ (, t+ ) + η t t+, f t( t ) (6) = B ψ (, t ) B ψ ( t+, t ) B ψ (, t+ ) + η α η ( t t+ ), η α f t( t ) B ψ (, t ) B ψ ( t+, t ) B ψ (, t+ ) + α 2 t t+ 2 + η2 f t( t ) 2 B ψ (, t ) B ψ (, t+ ) + η2 f t( t ) 2. In the above, the first equality follos from simple algebra of B ψ, that is, ψ(b) ψ(a), c a = B ψ (c, a) + B ψ (a, b) B ψ (c, b) and setting c =, a = t+, and b = t. The second to last inequality follos from the Fenchel-Young inequality applied to the conjugate pair 2 2, 2 2 (Boyd and Vandenberghe, 2004, Example 3.27). The last inequality follos from the strong convexity of B ψ ith respect to the norm. The folloing theorem uses Lemma to establish a general regret bound for the Comid frameork. Theorem 2 Let the sequence { t } be defined by the update in Eq. (3). Then for any Ω, R φ (T, ) η B ψ(, ) + r( ) + η f t( t ) 2. Proof: By Lemma, η [f t ( t ) f t ( ) + r( t+ ) r( )] B ψ (, t ) B ψ (, t+ ) + η2 = B ψ (, ) B ψ (, T + ) + η2 f t( t ) 2 f t( t ) 2. Noting that Bregman divergences are alays non-negative, recall our assumption that r() 0. Adding ηr( ) to both sides of the above equation and dropping the r( t+ ) term gives η [f t ( t ) f t ( ) + r( t ) r( )] B ψ (, ) + ηr( ) + η2 Dividing each side by η gives the result. f t( t ) 2. A fe corollaries are immediate from the above result. First, suppose that the functions f t are Lipschitz continuous. Then there is some G such that f t( t ) G. In this case, e have Corollary 3 Let { t } be generated by the update Eq. (3) and assume that the functions f t are Lipschitz ith dual Lipschitz constant G. Then R φ (T ) η B ψ(, ) + r( ) + T η G2. If e take η / T, then e have a regret hich is O( T ) hen the functions f t are Lipschitz. If Ω is compact, the f t are guaranteed to be Lipschitz continuous (Rockafellar, 970). Corollary 4 Suppose that either Ω is compact or the functions f t are Lipschitz so f t G. Also assume r( ) = 0. Then setting η = 2 α B ψ (, )/(G T ), R φ (T ) 2T B ψ (, )G / α. It is straightforard to prove results under the slightly different restriction that f t() 2 ρf t (), hich is similar to assuming a Lipschitz condition on the gradient of f t. A common example in hich this holds is linear regression, here f i () = 2 (, x i y i ) 2, so f i () = (, x i y i )x i and ρ = 2 x i 2. The proof essentially amounts to dividing out constants dependent on η and ρ from both sides of the regret.

5 Corollary 5 Let f t() 2 ρf t(), r 0, and assume r( ) = 0. Setting η / T gives R φ (T ) = O(ρ T B ψ (, )/α). Proof: From Theorem 2, the non-negativity of r, and that r( ) = 0 e immediately have ( ρη ) ( [f t ( t ) + r( t )] ρη ) f t ( t ) + r( t ) η B ψ(, ) + f t ( ) + r( ) Setting η = /(ρ T ) gives ρη/() = ( T )/ T so that ρt T f t ( t ) + r( t ) ( T ) B T ψ(, ) + f t ( ) + r( ). T 5 Logarithmic Regret for Strongly Convex Functions Folloing the vein of research begun in Hazan et al. (2006) and Shalev-Shartz and Singer (2007), e sho that Comid can get stronger regret guarantees hen e assume curvature of the loss functions f t or r. Similar to Shalev-Shartz and Singer, e no assume that for all t, f t + r is λ-strongly convex ith respect to a differentiable function ψ, that is, for any, v Ω, f t (v) + r(v) f t () + r() + f t() + r (), v + λb ψ (v, ). (7) For example, hen ψ() = 2 2 2, e recover the usual definition of strong convexity. For simplicity, e assume that e push all the strong convexity into the function r so that the f t are simply convex (clearly, this is possible by redefining ˆf t () = f t () λψ() if the f t are λ-strongly convex). In this case, a straightforard corollary to Lemma follos. Corollary 6 Let the sequence { t } be defined by the update in Eq. (3) ith step sizes η t. Assume that B ψ (, ) is α-strongly convex ith respect to a norm and that r is λ-strongly convex ith respect to ψ. Then for any Ω η t (f t ( t ) f t ( )) + η t (r( t+ ) r( )) B ψ (, t ) B ψ (, t+ ) + η2 t f t( t ) 2 λη tb ψ (, t+ ). Proof: The proof is effectively identical to that of Lemma. We simply note that r( t+ ) r( ) r ( t+ ), t+ λb ψ (, t+ ) so that η t [f t ( t ) + r( t+ ) f t ( ) r( )] η t t, f t( t ) + η t t+, r ( t+ ) λη t B ψ (, t+ ). No e simply proceed as in the proof of Lemma folloing Eq. (5). The above corollary almost immediately gives a logarithmic regret bound. Theorem 7 Let r be λ-strongly convex ith respect to a differentiable function ψ and suppose ψ is α-strongly convex ith respect to a norm. Assume that r( ) = 0. If f t( t ) G for all t, ( ) R φ (T ) λb ψ (, ) + G2 G 2 (log T + ) = O λα λα log T. Proof: Rearranging Corollary 6, e have f t ( t ) + r( t+ ) f t ( ) r( ) [ B ψ (, t ) ] B ψ (, t+ ) λb ψ (, t+ ) + η t η t = η B ψ (, ) η T B ψ (, t+ ) + + η t f t( t ) 2 [ B ψ (, t+ ) ( η t+ η t η t f t( t ) 2. ) ] λb ψ (, t+ )

6 If e set η t = λt, then the first summation above is zero and f t ( t ) + r( t+ ) f t ( ) r( ) λb ψ (, ) + λα t f t( t ) 2. Noting that T t log T + completes the proof of the theorem. An interesting point regarding the above theorem is that e do not require B ψ (, t ) to be bounded or the set Ω to be compact, hich previous ork assumed. When the functions f t are Lipschitz, then henever r is strongly convex Comid still attains logarithmic regret. To notable examples attain the logarithmic bounds in the above theorem. It is clear that if r defines a valid Bregman divergence then that r is strongly convex ith respect to itself in the sense of Eq. (7). First, consider optimization over the simplex ith entropic regularization, that is, e set r() = λ i i log i and Ω = { : 0, = }. In this case it is straightforard to see that r() = j i log i is λ-strongly convex ith respect to ψ() = r(), hich in turn is strongly convex ith respect to the l -norm over Ω (see Shalev-Shartz and Singer, 2007, Definition ( 2 and Example 2). Since the dual of the l -norm is the l norm, e have R φ (T ) = O log T λ max t f t( t ) ). We can also use r() = λ 2 2 2, in hich case e recover the same bounds as those in Hazan et al. (2006). 6 Stochastic Convergence Results In this section, e examine the application of Comid to solving stochastic optimization problems. The techniques e use have a long history in online algorithms and make connections beteen the regret of the algorithm and generalization performance using martingale concentration results (Littlestone, 989). We build on knon techniques for data-driven generalization bounds (Cesa-Bianchi et al., 2004) to give concentration results for Comid in the stochastic optimization setting. Further ork on this subject for the strongly convex case can be found in Kakade and Teari (2008), though e focus on the case hen f t + r is eakly convex. We let f() = Ef(; Z) = f(; z)dp (z), and at every step t the algorithm receives an independent random variable Z t P that gives an unbiased estimate f t ( t ) = f( t ; Z t ) of the function f evaluated at t and an unbiased estimate f t( t ) = f ( t ; Z t ) of an arbitrary subgradient f ( t ) f( t ). We assume that B ψ (, t ) D 2 for all t and for simplicity that f t( t ) G for all t, hich are satisified hen Ω is compact. We also assume ithout loss of generality that r( ) = 0. For example, our original problem in hich f() = n n i= f i(), here e randomly sample one f i at each iteration, falls into this setup, resolving the question posed by Tseng (2009) on the existence of stochastic composite incremental subgradient methods. Theorem 8 Given the assumptions on f and Ω in the above paragraph, let { t } be the sequence generated by Eq. (3). In addition, let T = T T t and η t = D G αt. Then ( P f( T ) + r( T ) f( ) + r( ) + DG ) + ε exp ( T ) αε2 αt 6D 2 G 2. Alternatively, ith probability at least δ f( T ) + r( T ) f( ) + r( ) + DG ) ( + 4 log. αt δ Proof: We begin our derivation by recalling Lemma. Convexity of f and r imply η t [f( t ) + r( t+ ) f( ) r( )] η t t, f ( t ) + η t t+, r ( t+ ) = η t t, f t( t ) + η t t+, r ( t+ ) + η t t, f ( t ) f t( t ) We no follo the same derivation as Lemma, leaving η t t, f ( t ) f t( t ) intact, thus η t [f( t ) + r( t+ ) f( ) r( )] B ψ (, t ) B ψ (, t+ ) + η2 t f t( t ) 2 + η t t, f ( t ) f t( t ). (8)

7 No e subtract r( T + ) 0 from both sides, use the assumption that r( ) = 0, and sum to get [f( t ) + r( t ) f( ) r( )] [B ψ (, t ) B ψ (, t+ )] + η t η B ψ (, ) + t=2 T η t f t( t ) 2 + t, f ( t ) f t( t ) [ B ψ (, t ) ] + G2 η t η t η t + Let F t be a filtration ith Z τ F t for τ t. Since t F t, t, f ( t ) f t( t ). E [ t, f ( t ) f ( t ; Z t ) F t ] = t, f ( t ) E[f ( t ; Z t ) F t ] = 0, and thus the last sum in Eq. (9) is a martingale difference sequence. We next use our assumptions that B ψ (, t ) D 2 and α 2 t 2 B ψ (, t ), therefore t 2/αD. Then t, f ( t ) f t( t ) t f ( t ) f t( t ) 2 2/αDG. Thus Eq. (9) consists of a bounded difference martingale, and e can use standard concentration techniques to get strong convergence guarantees. Applying Azuma s inequality, ( T ) ) P t, f ( t ) f t( t ) ε exp ( αε2 6T D 2 G 2. (0) Define γ T = T t, f t( t ) f ( t ) and recall that B ψ (, t ) D 2. The convexity of f and r give T [f( T ) + r( T )] T f( t) + r( t ), so that [ T [f( T ) + r( T )] T [f( ) + r( )] + D 2 ( + ) ] + G2 η t + γ T η η t η t = T [f( ) + r( )] + D2 η T + G2 η t + γ T. Setting η t = D α G t e have f( T ) + r( T ) f( ) + r( ) + DG 2 T / α + T γ T immediately apply Azuma s inequality from Eq. (0) to complete the theorem. 7 Special Cases and Derived Algorithms (9), and e can In this section, e sho specific instantiations of our frameork for different regularization functions r, and e also sho that some previously developed algorithms are special cases of the frameork for optimization presented here. We also give results on learning matrices ith Schatten p-norm divergences that generalize some recent interesting ork on trace norm regularization. 7. Fobos The recently proposed Fobos algorithm of Duchi and Singer (2009) is comprised, at each iteration, of the folloing to steps: t+ = t ηf t( t ) and t+ = argmin It is straightforard to verify that the update t+ = argmin 2 t+ 2 + ηr(). 2 t η f t( t ), t + ηr() is equivalent to the to step update above. Thus, Comid reduces to Fobos hen e take ψ() = and Ω = Rd (ith constant learning rate η). This also shos that e can run Fobos by restricting to a convex set Ω R d. Further, our results give tighter convergence guarantees than Fobos, in particular, they do not depend in any negative ay on the regularization function r.

8 It is also not difficult to sho in general that the to step process of setting t+ = argmin B ψ (, t ) + η f t( t ), and t+ = argmin B ψ (, t ) + ηr() is equivalent to the original Comid update of Eq. (3) hen Ω = R d. Indeed, the optimal solution to the first step satisfies ψ( t+ ) ψ( t ) + ηf t( t ) = 0 so that t+ = ψ ( ψ( t ) ηf t( t )). Then looking at the optimal solution for the second step, for some r ( t+ ) r( t+ ) e have ψ( t+ ) ψ( t+ ) + ηr ( t+ ) = 0 i.e. ψ( t+ ) ψ( t ) + ηf t( t ) + ηr ( t+ ) = 0. This is clearly the solution to the one-step update of Eq. (3). 7.2 p-norm divergences No e consider divergence functions ψ hich are the l p -norms squared. 2 2 p is (p )-strongly convex over R d ith respect to the l p -norm for any p (, 2] (Ball et al., 994). We see that if e choose ψ() = 2 2 p to be the divergence function, e have a corollary to Theorem 2. Corollary 9 Suppose that r(0) = 0 and that = 0. Let p = + / log d and use the Bregman function ψ() = 2 2 p. Further suppose that either Ω is compact or the f t are Lipschitz so that for q = log d +, max t f t( t ) q G q. Setting η = p G q T log d, the regret of Comid satisfies R φ (T ) p G q T log d G T log d. Proof: Recall that the dual norm for an l p -norm is an l q -norm, here q = p/(p ). From Thm. 2, e immediately have that hen = 0 and ψ() = 2 2 p R(T ) 2η 2 p + η 2(p ) f t( t ) 2 q. No use the assumption that max t f t( t ) q G q, replace p ith + / log d (so q = log d + ), and set η = c T log d, hich results in T log d R(T ) 2 T log d p 2c + c G 2 2 q. Setting c = p /G q gives us our desired result. From the above, e see that Comid is a good candidate for (dense) problems in high dimensions, especially hen e use l -regularization. For high dimensions hen R d, taking p = +/ log d means our bounds depend roughly on the l -norm of the optimal predictor and the infinity norm of the function gradients f t. Shalev-Shartz and Teari (2009) recently proposed the Stochastic Mirror Descent made Sparse algorithm (SMIDAS) using this intuition. We recover SMIDAS by taking the divergence in Comid to be ψ() = 2 2 p and r() = λ. The Comid update is ψ( t+ ) = ψ( t ) ηf t( t ), The SMIDAS update, on the other hand, is t+ = argmin B ψ (, t+ ) + ηλ. ψ( t+ ) = ψ() ηf t( t ), ψ( t+ ) = S ηλ ( ψ( t+ )), here S τ is the shrinkage/thresholding operator defined by [S τ (x)] j = sign(x j ) [ x j τ] +. () The folloing lemma proves that the to updates are identical in cases including p-norm divergences. Lemma 0 Suppose ψ is strongly convex and its gradient satisfies sign([ ψ()] j ) = sign( j ). (2) Then the unique solution v of v = argmin {B ψ (, u) + τ } is given by ψ(v) = S τ ( ψ(u)). (3)

9 Proof: Since ψ is strongly convex, the solution is unique. We ill sho that if v satisfies Eq. (3) then it is a solution to the problem. Therefore, suppose Eq. (3) holds. The proof proceeds by considering three cases. Case I: [ ψ(u)] j > τ. In this case, [ ψ(v)] j = [ ψ(u)] j τ > 0 and by Eq. (2), v j > 0. Thus [ ψ(v)] j [ ψ(u)] j + τ sign(v j ) = 0. Case II: [ ψ(u)] j < τ. In this case, [ ψ(v)] j = [ ψ(u)] j + τ < 0 and Eq. (2) implies v j < 0. So [ ψ(v)] j [ ψ(u)] j + τ sign(v j ) = 0. Case III: [ ψ(u)] j [ τ, τ]. Here, e can take v j = 0 and Eq. (2) ill give [ ψ(v)] j = 0. Thus 0 [ ψ(v)] j [ ψ(u)] j + τ[, ]. Combining the three cases, v satisfies 0 ψ(v) ψ(u) + τ v, hich is the optimality condition for v argmin {B ψ (, u) + τ }. We thus have ψ(v) = S τ ( ψ(u)) as desired. Reriting the above lemma slightly gives the folloing result. The solution to t+ = argmin {B ψ (, t ) + η f t( t ), t + ηr()} hen ψ satisfies the gradient condition of Eq. (2) is t+ = ( ψ) [sign ( ψ( t ) ηf t( t )) max { ψ( t ) ηf t( t ) ηλ, 0}] = ( ψ) [S ηλ ( ψ( t ) ηf t( t ))]. (4) Note that hen ψ( ) = 2 p e recover Shalev-Shartz and Teari s SMIDAS, hile ith p = 2 e get Langford et al. s 2009 truncated gradient method. See Shalev-Shartz and Teari (2009) or Gentile and Littlestone (999) for the simple formulae to compute ( ψ) ψ. l -regularization Let us no consider the problem of setting r() to be a general l p -norm (e ill specialize this to l shortly). We describe the dual function and then use it to derive a fe particular updates, mentioning an open problem. First, let p > be the norm associated ith the Bregman function ψ and p 2 be the norm for r() = p2. Let q i = p i /(p i ) be the associated dual norm. Then, ignoring constants, the minimization problem from Eq. (3) becomes min v, p + λ p2. We introduce a variable z = and get the equivalent problem min =z v, p + λ z p2. To derive the dual of the problem, e introduce Lagrange multiplier θ and find the Lagrangian L(, z, θ) = v θ, p + λ z p2 + θ, z. Taking the infimum over and z in the Lagrangian, since the conjugate of 2 2 p is 2 2 q hen /p + /q = (Boyd and Vandenberghe, 2004, Example 3.27) e have ] inf [ v θ, + 2 = [ ] { 2p 2 v 0 if θ θ 2 q inf λ z z p2 + θ, z = q2 λ otherise. Thus, our dual is the non-euclidean projection problem min θ 2 v θ q s.t. θ q2 λ. The Lagrangian earlier is differentiable ith respect to, so e can recover the optimal from θ by noting that hen ψ() = 2 2 p, ψ()+v θ = 0 at optimum or = ( ψ) (θ v). When p 2 =, e easily recover Eq. (4) as our update. Hoever, the case p 2 = is more interesting, as it can be a building block for group-sparsity (Obozinski et al., 2007). In this case our problem is min θ 2 v θ q s.t. θ λ. It is clear by symmetry in the above that e can assume v 0 ith no loss of generality. We can raise the l q -norm to a poer greater than and maintain convexity, so our equivalent problem is min θ q v θ q q s.t., θ λ, θ 0. (5)

10 Let θ be the solution of Eq. (5). Clearly, at optimum e ill have θ i v i, though e use this only for clarity in the derivation and omit constraints as they do not affect the optimization problem. Introducing Lagrange multipliers ν and α 0 e get the Lagrangian L(θ, ν, α) = q d (v i θ i ) q + ν(, θ λ) α, θ i= Taking the derivative of the above, e have (v i θ i ) q + ν α i = 0 θ i = v i (ν α i ) /(q ) No suppose e kne the optimal ν. If an index i satisfies ν v q i, then e ill have θ i = 0. To see this, suppose for the sake of contradiction that θ i > 0. The KKT conditions for optimality of Eq. (5) (Boyd and Vandenberghe, 2004) imply that for such i e have θ i = v i (ν α i ) /(q ) = v i ν /(q ) 0, a contradiction. Similarly, if ν < v q i and α i 0, then θ i > 0. Ineed, since α i 0, θ i = v i (ν α i ) /(q ) > v i (v q i α i ) /(q ) v i v i = 0, so that θ i > 0 and the KKT conditions imply α i = 0. Had e knon ν, the optimal θ i ould have been easy to attain as θ i (ν) = v i (min{ν, v q i }) /(q ) (note that this satisfies 0 θ i v i ). Since e kno that the structure of the optimal θ must obey the above equation, e can boil our problem don to finding ν 0 so that d θ i (ν) = i= d i= v i min{ν, v q i } /(q ) = λ. (6) Interestingly, this reduces to exactly the same root-finding problem as that for solving Euclidean projection to an l -ball (Duchi et al., 2008). As shon by Duchi et al., it is straightforard to find the optimal ν in time linear in the dimension d. An open problem is to find an efficient algorithm for solving the generalized projections above hen using the 2 rather than norm. Mixed-norm regularization No e consider the problem of mixed-norm regularization, in hich e ish to minimize functions f t (W ) + r(w ) of a matrix W R d k. In particular, e define i R k to be the i th ro of W, and e set r(w ) = λ d i= i p2. We also use the p-norm Bregman functions as above ith ψ(w ) = 2 W 2 p = 2 ( i,j W p ij) /p, hich are (p )-strongly convex ith respect to the l p -norm squared. As earlier, our minimization problem becomes min W V, W + 2 W 2 p + λ W l/l p2, hose dual problem is min V Θ Θ q s.t. θi q2 λ Raising the first norm to the q -poer, e see that the problem is separable, and e can solve it using the techniques in the prequel. 7.3 Matrix Composite Mirror Descent We no consider a setting that generalizes the previous discussions in hich our variables W t are matrices W t Ω = R d d2. We use Bregman functions based on Schatten p-norms (e.g. Horn and Johnson, 985, Section 7.4). Schatten p-norms are the family of unitarily invariant matrix norms arising out of applying p-norms to the singular values of the matrix W. That is, letting σ(w ) denote the vector of singular values of W, e set W p = σ(w ) p. We use Bregman functions ψ(w ) = 2 W 2 p, hich, similar to the p-norm on vectors, are (p )-strongly convex over Ω = R d d2 ith respect to the norm p (Ball et al., 994). As in the previous subsection, e mainly consider to values for p, p = 2 and a value very near, namely p = + / log d. For p = 2, ψ is -strongly convex ith respect to 2 = Fr, the

11 Frobenius norm. For the second value, ψ is / log d-strongly convex ith respect to +/ log d, or, ith a bit more ork, ψ is /(3 log d)-strongly convex.r.t., the trace or nuclear norm. We focus on the specific setting of trace-norm regularization, or r(w ) = λ W. This norm, similar to the l -norm on vectors, gives sparsity in the singular value spectrum of W and hence is useful for rank-minimization (Recht et al., 2007). The generic Comid update ith the above choice of ψ gives a Schatten p-norm Comid algorithm for matrix applications: W t+ = argmin W Ω η f t(w t ), W + B ψ (W, W t ) + ηλ W. (7) The update is ell defined since ψ is strongly convex as per the above discussion. We also have defined W, V = tr(w V ) as the usual matrix inner product. The generic Comid convergence result in Thm. 2 immediately yields the folloing to corollaries. Corollary Let the sequence {W t } be defined by the update in Eq. (7) ith p = 2. If each f t satisfies f t(w t ) 2 G 2, then there is a stepsize η for hich the regret against W Ω is R φ (T ) G 2 W 2 T Corollary 2 Let p = + / log d in the Schatten Comid update of Eq. (7). Let q = + log d. If each f t satisfies f t(w t ) q G q then R φ (T ) G q W p T log d G W T log d here G = max t f t(w t ) = max t σ max (f t(w t )). Let us consider the actual implementation of the Comid update. Similar convergence rates (ith orse constants and a negative dependence on the spectrum of r) to those above can be achieved via simple mirror descent, i.e. by linearizing r(w ). The advantage of Comid is that it achieves sparsity in the spectrum, as the folloing proposition demonstrates. Proposition 3 Let Ψ() = 2 2 p and S τ (v) = sign(v) [ v τ] + as in the prequel. For p (, 2], the update in Eq. (7) can be implemented as follos. Compute SVD: W t = U t diag(σ(w t ))V t Gradient step: Θ t = U t diag( Ψ(σ(W t )))V t ηf t(w t ) (8a) Compute SVD: Θ t = Ũ t diag(σ(θ t ))Ṽ t Splitting update: W t+ = Ũ t diag ( ( Ψ) (S ηλ (σ(θ t ))) ) Ṽ t (8b) Note that the first SVD, Eq. (8a), is used for notational convenience only and need not be computed at each iteration, since it is maintained at the end of iteration t via Eq. (8b). The computational requirements for Comid are thus the same as standard mirror descent, hich also requires an SVD computation on each step to compute W. The last step of the update, Eq. (8b) applies the shrinkage/thresholding operator S ηλ to the spectrum of Θ t, hich introduces sparsity. Furthermore, due to the sign (and hence sparsity) preserving nature of the map ( Ψ), the sparsity in the spectrum is maintained in W t+. Lastly, the special case for p = 2, the standard Frobenius norm update, as derived (but ithout rates or alloing stochastic gradients) by Ma et al. (2009), ho report good empirical results for their algorithm. In trace-norm applications, e expect W to be small. Therefore, in such applications, our ne Schatten-p COMID algorithm ith p should give strong performance since G can be much smaller than G 2. Proof of Proposition 3: We kno from the prequel that the Comid step is equivalent to ψ( W t ) = ψ(w t ) ηf t(w t ) and W t+ = argmin W { B ψ (W, W t ) + ηr( W t ) Since W t has singular value decomposition U t diag(σ(w ))V t and ψ(w ) = Ψ(σ(W )) is unitarily invariant, ψ(w t ) = U t diag( Ψ(σ(W t )))V t (Leis, 995, Corollary 2.5). This means that the Θ t computed in step 2 above is simply ψ( W t ). The proof essentially amounts to a reduction to the vector case, since the norms are unitarily invariant, and ill be complete if e prove that V = argmin W {B ψ (W, W t ) + τ W } }.

12 has the unique solution ( V = Ũ t diag ( Ψ) (S τ (σ( ψ( W t )))) )Ṽ t, (9) }{{} here Ũ t diag(σ( W t ))Ṽ t is the SVD of W t. By subgradient optimality conditions, it is sufficient that the proposed solution V satisfy 0 d d 2 ψ(v ) ψ( W t ) + τ V. Applying Leis s Corollary 2.5, e can continue to use the orthonormal matrices Ũ t and Ṽ t, and e see that the proposed V in Eq. (9) satisfies ψ(v ) = Ũ t diag( Ψ( )))Ṽ t and V = Ũ t diag( )Ṽ t. We have thus reduced the problem to shoing that 0 d Ψ( ) + Ψ(σ( W t )) + τ, since e chose the matrices to have the same singular vectors by construction. From Lemma 0 presented earlier, e already kno that satisfies this equation if and only if Ψ( ) = S τ ( Ψ(σ( W t ))), hich is indeed the case by definition of (noting of course that σ( ψ( W t )) = Ψ(σ( W t ))). Acknoledgments JCD as supported by a National Defense Science and Engineering Graduate (NDSEG) felloship, and some of this ork as completed hile JCD as an intern at Google Research. References K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 68: , 967. K. Ball, E. Carlen, and E. Lieb. Sharp uniform convexity and smoothness inequalities for trace norms. Inventiones mathematicae, 5: , 994. A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 3:67 75, L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems 20, pages 6 68, S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9): , September I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems ith a sparsity constraint. Communication on Pure and Applied Mathematics, 57(): , J. Duchi and Y. Singer. Online and batch learning using forard backard splitting. Journal of Machine Learning Research, 0: , J. Duchi, S. Shalev-Shartz, Y. Singer, and T. Chandra. Efficient projections onto the l -ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, C. Gentile and N. Littlestone. The robustness of the p-norm algorithms. In Proceedings of the Telfth Annual Conference on Computational Learning Theory, 999. E. Hazan, A. Kalai, S. Kale, and A. Agaral. Logarithmic regret algorithms for online convex optimization. In Proceedings of the 9th Annual Conference on Computational Learning Theory, Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 985.

13 S. Kakade and A. Teari. On the generalization ability of online strongly convex programming algorithms. In Advances in Neural Information Processing Systems 22. MIT Press, J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 0: , A. Leis. The convex analysis of unitarily invariant matrix functions. Journal of Convex Analysis, 2:73 83, 995. P. L. Lions and B. Mercier. Splitting algorithms for the sum of to nonlinear operators. SIAM Journal on Numerical Analysis, 6: , 979. N. Littlestone. From on-line to batch learning. In Proceedings of the Second Annual Workshop on Computational Learning Theory, pages , July 989. Nick Littlestone and Manfred K. Warmuth. The eighted majority algorithm. Information and Computation, 08:22 26, 994. S. Ma, D. Goldfarb, and L. Chen. Fixed point and Bregman iterative methods for matrix rank minimization. Submitted, URL A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, Ne York, 983. Y. Nesterov. Gradient methods for minimizing composite objective function. Technical Report 76, Center for Operations Research and Econometrics, Catholic University of Louvain (UCL), Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 20():22 259, G. Obozinski, B. Taskar, and M. Jordan. Joint covariate selection for grouped classification. Technical Report 743, Dept. of Statistics, University of California Berkeley, B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. Submitted, R.T. Rockafellar. Convex Analysis. Princeton University Press, 970. S. Shalev-Shartz and Y. Singer. Logarithmic regret algorithms for strongly convex repeated games. Technical report, The Hebre University, URL S. Shalev-Shartz and N. Srebro. SVM optimization: Inverse dependence on training set size. In Proceedings of the 25th International Conference on Machine Learning, pages , S. Shalev-Shartz and A. Teari. Stochastic methods for l -regularized loss minimization. In Proceedings of the 26th International Conference on Machine Learning, N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems 7, P. Tseng. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Submitted, S. Wright, R. Noak, and M. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7): , L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems 23, M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, 2003.

Composite Objective Mirror Descent

Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research