Composite Objective Mirror Descent
|
|
- Jemimah Hoover
- 6 years ago
- Views:
Transcription
1 Composite Objective Mirror Descent John C. Duchi UC Berkeley Shai Shalev-Shartz Hebre University Yoram Singer Google Research Ambuj Teari TTI Chicago Abstract We present a ne method for regularized convex optimization and analyze it under both online and stochastic optimization settings. In addition to unifying previously knon firstorder algorithms, such as the projected gradient method, mirror descent, and forardbackard splitting, our method yields ne analysis and algorithms. We also derive specific instantiations of our method for commonly used regularization functions, such as l, mixed norm, and trace-norm. Introduction and Problem Statement Regularized loss minimization is a common learning paradigm in hich one jointly minimizes an empirical loss over a training set plus a regularization term. The paradigm yields an optimization problem of the form min Ω n n f t () + r(), () here Ω R d is the domain (a closed convex set), f t : Ω R is a (convex) loss function associated ith a single example in a training set, and r : Ω R is a (convex) regularization function. A fe examples of famous learning problems that fall into this frameork are least squares, ridge regression, support vector machines, support vector regression, lasso, and logistic regression. In this paper, e describe and analyze a general frameork for solving Eq. (). The method e propose is a first-order approach, meaning that e access the functions f t only by receiving subgradients. Recent ork has shon that from the perspective of achieving good statistical performance on unseen data, first order methods are preferable to higher order approaches, especially hen the number of training examples n is very large (Bottou and Bousquet, 2008; Shalev-Shartz and Srebro, 2008). Furthermore, in large scale problems it is often prohibitively expensive to compute the gradient of the entire objective function (thus accessing all the examples in the training set), and randomly choosing a subset of the training set and computing the gradient over the subset (perhaps only a single example) can be significantly more efficient. This approach is very closely related to online learning. Our general frameork handles both cases ith ease it applies to accessing a single example (or subset of the examples) at each iteration or accessing the entire training set at each iteration. The method e describe is an adaptation of the Mirror Descent (MD) algorithm (Nemirovski and Yudin, 983; Beck and Teboulle, 2003), an iterative method for minimizing a convex function φ : Ω R. If the dimension d is large enough, MD is optimal among first-order methods, and it has a close connection to online learning since it is possible to bound the regret φ t ( t ) inf Ω φ t (), here { t } is the sequence generated by mirror descent and the φ t are convex functions. In fact, one can vie popular online learning algorithms, such as eighted majority (Littlestone and Warmuth, 994) and online gradient descent (Zinkevich, 2003) as special cases of mirror descent. A guarantee on the online regret can be translated directly to a guarantee on the convergence rate of the algorithm to the optimum of Eq. (), as e ill sho later.
2 Folloing Beck and Teboulle s exposition, a Mirror Descent update in the online setting can be ritten as t+ = argmin B ψ (, t ) + η φ t( t ), t, (2) Ω here B ψ is a Bregman divergence and φ t denotes an arbitrary subgradient of φ t. Intuitively, MD minimizes a first-order approximation of the function φ t at the current iterate t hile forcing the next iterate t+ to lie close to t. The step-size η controls the trade-off beteen these to. Our focus in this paper is to generalize mirror descent to the case hen the functions φ t are composite, that is, they consist of to parts: φ t = f t + r. Here the f t change over time but the function r remains constant. Of course, one can ignore the composite structure of the φ t and use MD. Hoever, doing so can result in undesirable effects. For example, hen r() =, applying MD directly does not lead to sparse updates. Since the sparsity inducing property of the l -norm is a major reason for its use. The modification of mirror descent that e propose is simple: t+ argmin Ω η f t( t ), + B ψ (, t ) + ηr(). (3) This is almost the same as the mirror descent update ith an important difference: e do not linearize r. We call this algorithm Composite Objective MIrror Descent, or Comid. One of our contributions is to sho that, in a variety of cases, the Comid update is no costlier than the usual mirror descent update. In these situations, each Comid update is efficient and benefits from the presence of the regularizer r(). We no outline the remainder of the paper. We begin by revieing related ork, of hich there is a copious amount, though e try to do some justice to prior research. We then give a general O( T ) regret bound for Comid in the online optimization setting, after hich e give several extensions. We sho O(log T ) regret bounds for Comid hen the composite functions f t + r are strongly convex, after hich e sho convergence rates and concentration results for stochastic optimization using Comid. The second focus of the paper is in the derived algorithms, here e outline step rules for several choices of Bregman function ψ and regularizer r, including l, l, and mixed-norm regularization, as ell as presenting ne results on efficient matrix optimization ith Schatten p-norms. 2 Related Work Since the idea underlying Comid is simple, it is not surprising that similar algorithms have been proposed. One of our main contributions is to sho that Comid generalizes much prior ork and to give a clean unifying analysis. We do not have the space to thoroughly revie the literature, though e try to do some small justice to hat is knon. We begin by revieing ork that e ill sho is a special case of Comid. Forard-backard splitting is a long-studied frameork for minimizing composite objective functions (Lions and Mercier, 979), though it has only recently been analyzed for the online and stochastic case (Duchi and Singer, 2009). Specializations of forardbackard splitting to the case here r() = include iterative shrinkage and thresholding from the signal processing literature (Daubechies et al., 2004), and from machine learning, Truncated Gradient (Langford et al., 2009) and SMIDAS (Shalev-Shartz and Teari, 2009) are both special cases of Comid. In the optimization community there has there has been significant recent interest both applied and theoretical on minimization of composite objective functions such as that in Eq. (). Some notable examples include Wright et al. (2009); Nesterov (2007); Tseng (2009). These papers all assume that the objective f + r to be minimized is fixed and that f is smooth, i.e. that it has Lipschitz continuous derivatives. The most related of these to Comid is probably Tseng (2009, see his Sec. 3. and the references therein), hich proposes the same update as ours, but gives a Nesterov-like optimal method for the fixed f case. We do not have restrictions on f, though by going to stochastic, nondifferentiable f e naturally suffer in convergence rate. Nonetheless, e do anser in the affirmative a question posed by Tseng (2009), hich is hether stochastic or incremental subgradient methods ork for composite objectives. To recent papers for online and stochastic composite objective minimization are Xiao (2009) and Duchi and Singer (2009). The former extends Nesterov s 2009 analysis of primal-dual subgradient methods to the composite case, giving an algorithm hich is similar to ours; hoever, our algorithms are different and the analysis for each is completely different. Duchi and Singer (2009) is simply a specialization of Comid to the case here the Euclidean Bregman divergence is used. As a consequence of our general setting, e are able to give elegant ne algorithms for minimization of functions on matrices, hich include efficient and simple algorithms for trace-norm
3 minimization. Trace norm minimization has recently found strong applicability in matrix rank minimization (Recht et al., 2007), hich has been shon to be very useful, for example, in collaborative filtering (Srebro et al., 2004). A special case of Comid has recently been developed for this task, hich is very similar in spirit to fixed point and shrinkage methods from signal processing for l - minimization (Ma et al., 2009). The authors of this paper note that the method is extremely efficient for rank-minimization problems but do not give rates of convergence, hich e give as a corollary to our main convergence theorems. 3 Notation and Setting Before continuing, e establish notation and our problem setting formally. Vectors are loer case bold italic letters, such as x R d, and scalars are loer case italics such as x R. We denote a sequence of vectors by subscripts, i.e. t, t+,..., and entries in a vector by non-bold subscripts as in j. Matrices are upper case bold italic letters, such as W R d d. The subdifferential set of a function f evaluated at is denoted f() and a particular subgradient by f () f(). When a function is differentiable, e rite f(). We focus mostly on the problem of regularized online learning, in hich the goal is to achieve lo regret.r.t. a static predictor Ω on a sequence of functions φ t () f t () + r(). Here, f t and r 0 are convex functions, and Ω is some convex set (hich could be R d ). Formally, at every round of the algorithm e make a prediction t R d and then receive the function f t. We seek bounds on the regularized regret ith respect to, defined as R φ (T, [ ) ft ( t ) + r( t ) f t ( ) r( ) ]. (4) In batch optimization e set f t = f for all t, hile in stochastic optimization e choose f t to be the average of some random subset of {f,..., f n }. As mentioned previously and as e ill sho, it is not difficult to transform regret bounds for Eq. (4) into convergence rates in expectation and ith high probability for Eq. (), hich e do using techniques similar to Cesa-Bianchi et al. (2004). Throughout, ψ designates a continuously differentiable function that is α-strongly convex.r.t. a norm on the set Ω. Recall that this means that the Bregman divergence associated ith ψ, satisfies B ψ (, v) α 2 v 2 for some α > 0. B ψ (, v) = ψ() ψ(v) ψ(v), v, 4 Composite Objective MIrror Descent We use proof techniques similar to those in Beck and Teboulle (2003) to derive progress bounds on each step of the algorithm. We then use the bounds to straightforardly prove convergence results for online and batch learning. We begin by bounding the progress made by each step of the algorithm in either an online or a batch setting. This lemma is the key to our later analysis, so e prove it in full here. Lemma Let the sequence { t } be defined by the update in Eq. (3). Assume that B ψ (, ) is α-strongly convex ith respect to a norm, that is, B ψ (, v) α 2 v 2. For any Ω, η (f t ( t ) f t ( )) + η (r( t+ ) r( )) B ψ (, t ) B ψ (, t+ ) + η2 f t( t ) 2. Proof: The optimality of t+ for Eq. (3) implies for all Ω and r ( t+ ) r( t+ ), t+, ηf ( t ) + ψ( t+ ) ψ( t ) + ηr ( t+ ) 0. (5) In particular, this obtains for =. From the subgradient inequality for convex functions, e have f t ( ) f t ( t ) + f t( t ), t, or f t ( t ) f t ( ) f t( t ), t, and likeise for r( t+ ). We thus have η [f t ( t ) + r( t+ ) f t ( ) r( )] η t, f t( t ) + η t+, r ( t+ ) = η t+, f t( t ) + η t+, r ( t+ ) + η t t+, f t( t ) = t+, ψ( t ) ψ( t+ ) ηf t( t ) ηr ( t+ ) + t+, ψ( t+ ) ψ( t ) + η t t+, f t( t ).
4 No, by Eq. (5), the first term in the last equation is non-positive. Thus e have that η [f t ( t ) + r( t+ ) f t ( ) r( )] t+, ψ( t+ ) ψ( t ) + η t t+, f t( t ) = B ψ (, t ) B ψ ( t+, t ) B ψ (, t+ ) + η t t+, f t( t ) (6) = B ψ (, t ) B ψ ( t+, t ) B ψ (, t+ ) + η α η ( t t+ ), η α f t( t ) B ψ (, t ) B ψ ( t+, t ) B ψ (, t+ ) + α 2 t t+ 2 + η2 f t( t ) 2 B ψ (, t ) B ψ (, t+ ) + η2 f t( t ) 2. In the above, the first equality follos from simple algebra of B ψ, that is, ψ(b) ψ(a), c a = B ψ (c, a) + B ψ (a, b) B ψ (c, b) and setting c =, a = t+, and b = t. The second to last inequality follos from the Fenchel-Young inequality applied to the conjugate pair 2 2, 2 2 (Boyd and Vandenberghe, 2004, Example 3.27). The last inequality follos from the strong convexity of B ψ ith respect to the norm. The folloing theorem uses Lemma to establish a general regret bound for the Comid frameork. Theorem 2 Let the sequence { t } be defined by the update in Eq. (3). Then for any Ω, R φ (T, ) η B ψ(, ) + r( ) + η f t( t ) 2. Proof: By Lemma, η [f t ( t ) f t ( ) + r( t+ ) r( )] B ψ (, t ) B ψ (, t+ ) + η2 = B ψ (, ) B ψ (, T + ) + η2 f t( t ) 2 f t( t ) 2. Noting that Bregman divergences are alays non-negative, recall our assumption that r() 0. Adding ηr( ) to both sides of the above equation and dropping the r( t+ ) term gives η [f t ( t ) f t ( ) + r( t ) r( )] B ψ (, ) + ηr( ) + η2 Dividing each side by η gives the result. f t( t ) 2. A fe corollaries are immediate from the above result. First, suppose that the functions f t are Lipschitz continuous. Then there is some G such that f t( t ) G. In this case, e have Corollary 3 Let { t } be generated by the update Eq. (3) and assume that the functions f t are Lipschitz ith dual Lipschitz constant G. Then R φ (T ) η B ψ(, ) + r( ) + T η G2. If e take η / T, then e have a regret hich is O( T ) hen the functions f t are Lipschitz. If Ω is compact, the f t are guaranteed to be Lipschitz continuous (Rockafellar, 970). Corollary 4 Suppose that either Ω is compact or the functions f t are Lipschitz so f t G. Also assume r( ) = 0. Then setting η = 2 α B ψ (, )/(G T ), R φ (T ) 2T B ψ (, )G / α. It is straightforard to prove results under the slightly different restriction that f t() 2 ρf t (), hich is similar to assuming a Lipschitz condition on the gradient of f t. A common example in hich this holds is linear regression, here f i () = 2 (, x i y i ) 2, so f i () = (, x i y i )x i and ρ = 2 x i 2. The proof essentially amounts to dividing out constants dependent on η and ρ from both sides of the regret.
5 Corollary 5 Let f t() 2 ρf t(), r 0, and assume r( ) = 0. Setting η / T gives R φ (T ) = O(ρ T B ψ (, )/α). Proof: From Theorem 2, the non-negativity of r, and that r( ) = 0 e immediately have ( ρη ) ( [f t ( t ) + r( t )] ρη ) f t ( t ) + r( t ) η B ψ(, ) + f t ( ) + r( ) Setting η = /(ρ T ) gives ρη/() = ( T )/ T so that ρt T f t ( t ) + r( t ) ( T ) B T ψ(, ) + f t ( ) + r( ). T 5 Logarithmic Regret for Strongly Convex Functions Folloing the vein of research begun in Hazan et al. (2006) and Shalev-Shartz and Singer (2007), e sho that Comid can get stronger regret guarantees hen e assume curvature of the loss functions f t or r. Similar to Shalev-Shartz and Singer, e no assume that for all t, f t + r is λ-strongly convex ith respect to a differentiable function ψ, that is, for any, v Ω, f t (v) + r(v) f t () + r() + f t() + r (), v + λb ψ (v, ). (7) For example, hen ψ() = 2 2 2, e recover the usual definition of strong convexity. For simplicity, e assume that e push all the strong convexity into the function r so that the f t are simply convex (clearly, this is possible by redefining ˆf t () = f t () λψ() if the f t are λ-strongly convex). In this case, a straightforard corollary to Lemma follos. Corollary 6 Let the sequence { t } be defined by the update in Eq. (3) ith step sizes η t. Assume that B ψ (, ) is α-strongly convex ith respect to a norm and that r is λ-strongly convex ith respect to ψ. Then for any Ω η t (f t ( t ) f t ( )) + η t (r( t+ ) r( )) B ψ (, t ) B ψ (, t+ ) + η2 t f t( t ) 2 λη tb ψ (, t+ ). Proof: The proof is effectively identical to that of Lemma. We simply note that r( t+ ) r( ) r ( t+ ), t+ λb ψ (, t+ ) so that η t [f t ( t ) + r( t+ ) f t ( ) r( )] η t t, f t( t ) + η t t+, r ( t+ ) λη t B ψ (, t+ ). No e simply proceed as in the proof of Lemma folloing Eq. (5). The above corollary almost immediately gives a logarithmic regret bound. Theorem 7 Let r be λ-strongly convex ith respect to a differentiable function ψ and suppose ψ is α-strongly convex ith respect to a norm. Assume that r( ) = 0. If f t( t ) G for all t, ( ) R φ (T ) λb ψ (, ) + G2 G 2 (log T + ) = O λα λα log T. Proof: Rearranging Corollary 6, e have f t ( t ) + r( t+ ) f t ( ) r( ) [ B ψ (, t ) ] B ψ (, t+ ) λb ψ (, t+ ) + η t η t = η B ψ (, ) η T B ψ (, t+ ) + + η t f t( t ) 2 [ B ψ (, t+ ) ( η t+ η t η t f t( t ) 2. ) ] λb ψ (, t+ )
6 If e set η t = λt, then the first summation above is zero and f t ( t ) + r( t+ ) f t ( ) r( ) λb ψ (, ) + λα t f t( t ) 2. Noting that T t log T + completes the proof of the theorem. An interesting point regarding the above theorem is that e do not require B ψ (, t ) to be bounded or the set Ω to be compact, hich previous ork assumed. When the functions f t are Lipschitz, then henever r is strongly convex Comid still attains logarithmic regret. To notable examples attain the logarithmic bounds in the above theorem. It is clear that if r defines a valid Bregman divergence then that r is strongly convex ith respect to itself in the sense of Eq. (7). First, consider optimization over the simplex ith entropic regularization, that is, e set r() = λ i i log i and Ω = { : 0, = }. In this case it is straightforard to see that r() = j i log i is λ-strongly convex ith respect to ψ() = r(), hich in turn is strongly convex ith respect to the l -norm over Ω (see Shalev-Shartz and Singer, 2007, Definition ( 2 and Example 2). Since the dual of the l -norm is the l norm, e have R φ (T ) = O log T λ max t f t( t ) ). We can also use r() = λ 2 2 2, in hich case e recover the same bounds as those in Hazan et al. (2006). 6 Stochastic Convergence Results In this section, e examine the application of Comid to solving stochastic optimization problems. The techniques e use have a long history in online algorithms and make connections beteen the regret of the algorithm and generalization performance using martingale concentration results (Littlestone, 989). We build on knon techniques for data-driven generalization bounds (Cesa-Bianchi et al., 2004) to give concentration results for Comid in the stochastic optimization setting. Further ork on this subject for the strongly convex case can be found in Kakade and Teari (2008), though e focus on the case hen f t + r is eakly convex. We let f() = Ef(; Z) = f(; z)dp (z), and at every step t the algorithm receives an independent random variable Z t P that gives an unbiased estimate f t ( t ) = f( t ; Z t ) of the function f evaluated at t and an unbiased estimate f t( t ) = f ( t ; Z t ) of an arbitrary subgradient f ( t ) f( t ). We assume that B ψ (, t ) D 2 for all t and for simplicity that f t( t ) G for all t, hich are satisified hen Ω is compact. We also assume ithout loss of generality that r( ) = 0. For example, our original problem in hich f() = n n i= f i(), here e randomly sample one f i at each iteration, falls into this setup, resolving the question posed by Tseng (2009) on the existence of stochastic composite incremental subgradient methods. Theorem 8 Given the assumptions on f and Ω in the above paragraph, let { t } be the sequence generated by Eq. (3). In addition, let T = T T t and η t = D G αt. Then ( P f( T ) + r( T ) f( ) + r( ) + DG ) + ε exp ( T ) αε2 αt 6D 2 G 2. Alternatively, ith probability at least δ f( T ) + r( T ) f( ) + r( ) + DG ) ( + 4 log. αt δ Proof: We begin our derivation by recalling Lemma. Convexity of f and r imply η t [f( t ) + r( t+ ) f( ) r( )] η t t, f ( t ) + η t t+, r ( t+ ) = η t t, f t( t ) + η t t+, r ( t+ ) + η t t, f ( t ) f t( t ) We no follo the same derivation as Lemma, leaving η t t, f ( t ) f t( t ) intact, thus η t [f( t ) + r( t+ ) f( ) r( )] B ψ (, t ) B ψ (, t+ ) + η2 t f t( t ) 2 + η t t, f ( t ) f t( t ). (8)
7 No e subtract r( T + ) 0 from both sides, use the assumption that r( ) = 0, and sum to get [f( t ) + r( t ) f( ) r( )] [B ψ (, t ) B ψ (, t+ )] + η t η B ψ (, ) + t=2 T η t f t( t ) 2 + t, f ( t ) f t( t ) [ B ψ (, t ) ] + G2 η t η t η t + Let F t be a filtration ith Z τ F t for τ t. Since t F t, t, f ( t ) f t( t ). E [ t, f ( t ) f ( t ; Z t ) F t ] = t, f ( t ) E[f ( t ; Z t ) F t ] = 0, and thus the last sum in Eq. (9) is a martingale difference sequence. We next use our assumptions that B ψ (, t ) D 2 and α 2 t 2 B ψ (, t ), therefore t 2/αD. Then t, f ( t ) f t( t ) t f ( t ) f t( t ) 2 2/αDG. Thus Eq. (9) consists of a bounded difference martingale, and e can use standard concentration techniques to get strong convergence guarantees. Applying Azuma s inequality, ( T ) ) P t, f ( t ) f t( t ) ε exp ( αε2 6T D 2 G 2. (0) Define γ T = T t, f t( t ) f ( t ) and recall that B ψ (, t ) D 2. The convexity of f and r give T [f( T ) + r( T )] T f( t) + r( t ), so that [ T [f( T ) + r( T )] T [f( ) + r( )] + D 2 ( + ) ] + G2 η t + γ T η η t η t = T [f( ) + r( )] + D2 η T + G2 η t + γ T. Setting η t = D α G t e have f( T ) + r( T ) f( ) + r( ) + DG 2 T / α + T γ T immediately apply Azuma s inequality from Eq. (0) to complete the theorem. 7 Special Cases and Derived Algorithms (9), and e can In this section, e sho specific instantiations of our frameork for different regularization functions r, and e also sho that some previously developed algorithms are special cases of the frameork for optimization presented here. We also give results on learning matrices ith Schatten p-norm divergences that generalize some recent interesting ork on trace norm regularization. 7. Fobos The recently proposed Fobos algorithm of Duchi and Singer (2009) is comprised, at each iteration, of the folloing to steps: t+ = t ηf t( t ) and t+ = argmin It is straightforard to verify that the update t+ = argmin 2 t+ 2 + ηr(). 2 t η f t( t ), t + ηr() is equivalent to the to step update above. Thus, Comid reduces to Fobos hen e take ψ() = and Ω = Rd (ith constant learning rate η). This also shos that e can run Fobos by restricting to a convex set Ω R d. Further, our results give tighter convergence guarantees than Fobos, in particular, they do not depend in any negative ay on the regularization function r.
8 It is also not difficult to sho in general that the to step process of setting t+ = argmin B ψ (, t ) + η f t( t ), and t+ = argmin B ψ (, t ) + ηr() is equivalent to the original Comid update of Eq. (3) hen Ω = R d. Indeed, the optimal solution to the first step satisfies ψ( t+ ) ψ( t ) + ηf t( t ) = 0 so that t+ = ψ ( ψ( t ) ηf t( t )). Then looking at the optimal solution for the second step, for some r ( t+ ) r( t+ ) e have ψ( t+ ) ψ( t+ ) + ηr ( t+ ) = 0 i.e. ψ( t+ ) ψ( t ) + ηf t( t ) + ηr ( t+ ) = 0. This is clearly the solution to the one-step update of Eq. (3). 7.2 p-norm divergences No e consider divergence functions ψ hich are the l p -norms squared. 2 2 p is (p )-strongly convex over R d ith respect to the l p -norm for any p (, 2] (Ball et al., 994). We see that if e choose ψ() = 2 2 p to be the divergence function, e have a corollary to Theorem 2. Corollary 9 Suppose that r(0) = 0 and that = 0. Let p = + / log d and use the Bregman function ψ() = 2 2 p. Further suppose that either Ω is compact or the f t are Lipschitz so that for q = log d +, max t f t( t ) q G q. Setting η = p G q T log d, the regret of Comid satisfies R φ (T ) p G q T log d G T log d. Proof: Recall that the dual norm for an l p -norm is an l q -norm, here q = p/(p ). From Thm. 2, e immediately have that hen = 0 and ψ() = 2 2 p R(T ) 2η 2 p + η 2(p ) f t( t ) 2 q. No use the assumption that max t f t( t ) q G q, replace p ith + / log d (so q = log d + ), and set η = c T log d, hich results in T log d R(T ) 2 T log d p 2c + c G 2 2 q. Setting c = p /G q gives us our desired result. From the above, e see that Comid is a good candidate for (dense) problems in high dimensions, especially hen e use l -regularization. For high dimensions hen R d, taking p = +/ log d means our bounds depend roughly on the l -norm of the optimal predictor and the infinity norm of the function gradients f t. Shalev-Shartz and Teari (2009) recently proposed the Stochastic Mirror Descent made Sparse algorithm (SMIDAS) using this intuition. We recover SMIDAS by taking the divergence in Comid to be ψ() = 2 2 p and r() = λ. The Comid update is ψ( t+ ) = ψ( t ) ηf t( t ), The SMIDAS update, on the other hand, is t+ = argmin B ψ (, t+ ) + ηλ. ψ( t+ ) = ψ() ηf t( t ), ψ( t+ ) = S ηλ ( ψ( t+ )), here S τ is the shrinkage/thresholding operator defined by [S τ (x)] j = sign(x j ) [ x j τ] +. () The folloing lemma proves that the to updates are identical in cases including p-norm divergences. Lemma 0 Suppose ψ is strongly convex and its gradient satisfies sign([ ψ()] j ) = sign( j ). (2) Then the unique solution v of v = argmin {B ψ (, u) + τ } is given by ψ(v) = S τ ( ψ(u)). (3)
9 Proof: Since ψ is strongly convex, the solution is unique. We ill sho that if v satisfies Eq. (3) then it is a solution to the problem. Therefore, suppose Eq. (3) holds. The proof proceeds by considering three cases. Case I: [ ψ(u)] j > τ. In this case, [ ψ(v)] j = [ ψ(u)] j τ > 0 and by Eq. (2), v j > 0. Thus [ ψ(v)] j [ ψ(u)] j + τ sign(v j ) = 0. Case II: [ ψ(u)] j < τ. In this case, [ ψ(v)] j = [ ψ(u)] j + τ < 0 and Eq. (2) implies v j < 0. So [ ψ(v)] j [ ψ(u)] j + τ sign(v j ) = 0. Case III: [ ψ(u)] j [ τ, τ]. Here, e can take v j = 0 and Eq. (2) ill give [ ψ(v)] j = 0. Thus 0 [ ψ(v)] j [ ψ(u)] j + τ[, ]. Combining the three cases, v satisfies 0 ψ(v) ψ(u) + τ v, hich is the optimality condition for v argmin {B ψ (, u) + τ }. We thus have ψ(v) = S τ ( ψ(u)) as desired. Reriting the above lemma slightly gives the folloing result. The solution to t+ = argmin {B ψ (, t ) + η f t( t ), t + ηr()} hen ψ satisfies the gradient condition of Eq. (2) is t+ = ( ψ) [sign ( ψ( t ) ηf t( t )) max { ψ( t ) ηf t( t ) ηλ, 0}] = ( ψ) [S ηλ ( ψ( t ) ηf t( t ))]. (4) Note that hen ψ( ) = 2 p e recover Shalev-Shartz and Teari s SMIDAS, hile ith p = 2 e get Langford et al. s 2009 truncated gradient method. See Shalev-Shartz and Teari (2009) or Gentile and Littlestone (999) for the simple formulae to compute ( ψ) ψ. l -regularization Let us no consider the problem of setting r() to be a general l p -norm (e ill specialize this to l shortly). We describe the dual function and then use it to derive a fe particular updates, mentioning an open problem. First, let p > be the norm associated ith the Bregman function ψ and p 2 be the norm for r() = p2. Let q i = p i /(p i ) be the associated dual norm. Then, ignoring constants, the minimization problem from Eq. (3) becomes min v, p + λ p2. We introduce a variable z = and get the equivalent problem min =z v, p + λ z p2. To derive the dual of the problem, e introduce Lagrange multiplier θ and find the Lagrangian L(, z, θ) = v θ, p + λ z p2 + θ, z. Taking the infimum over and z in the Lagrangian, since the conjugate of 2 2 p is 2 2 q hen /p + /q = (Boyd and Vandenberghe, 2004, Example 3.27) e have ] inf [ v θ, + 2 = [ ] { 2p 2 v 0 if θ θ 2 q inf λ z z p2 + θ, z = q2 λ otherise. Thus, our dual is the non-euclidean projection problem min θ 2 v θ q s.t. θ q2 λ. The Lagrangian earlier is differentiable ith respect to, so e can recover the optimal from θ by noting that hen ψ() = 2 2 p, ψ()+v θ = 0 at optimum or = ( ψ) (θ v). When p 2 =, e easily recover Eq. (4) as our update. Hoever, the case p 2 = is more interesting, as it can be a building block for group-sparsity (Obozinski et al., 2007). In this case our problem is min θ 2 v θ q s.t. θ λ. It is clear by symmetry in the above that e can assume v 0 ith no loss of generality. We can raise the l q -norm to a poer greater than and maintain convexity, so our equivalent problem is min θ q v θ q q s.t., θ λ, θ 0. (5)
10 Let θ be the solution of Eq. (5). Clearly, at optimum e ill have θ i v i, though e use this only for clarity in the derivation and omit constraints as they do not affect the optimization problem. Introducing Lagrange multipliers ν and α 0 e get the Lagrangian L(θ, ν, α) = q d (v i θ i ) q + ν(, θ λ) α, θ i= Taking the derivative of the above, e have (v i θ i ) q + ν α i = 0 θ i = v i (ν α i ) /(q ) No suppose e kne the optimal ν. If an index i satisfies ν v q i, then e ill have θ i = 0. To see this, suppose for the sake of contradiction that θ i > 0. The KKT conditions for optimality of Eq. (5) (Boyd and Vandenberghe, 2004) imply that for such i e have θ i = v i (ν α i ) /(q ) = v i ν /(q ) 0, a contradiction. Similarly, if ν < v q i and α i 0, then θ i > 0. Ineed, since α i 0, θ i = v i (ν α i ) /(q ) > v i (v q i α i ) /(q ) v i v i = 0, so that θ i > 0 and the KKT conditions imply α i = 0. Had e knon ν, the optimal θ i ould have been easy to attain as θ i (ν) = v i (min{ν, v q i }) /(q ) (note that this satisfies 0 θ i v i ). Since e kno that the structure of the optimal θ must obey the above equation, e can boil our problem don to finding ν 0 so that d θ i (ν) = i= d i= v i min{ν, v q i } /(q ) = λ. (6) Interestingly, this reduces to exactly the same root-finding problem as that for solving Euclidean projection to an l -ball (Duchi et al., 2008). As shon by Duchi et al., it is straightforard to find the optimal ν in time linear in the dimension d. An open problem is to find an efficient algorithm for solving the generalized projections above hen using the 2 rather than norm. Mixed-norm regularization No e consider the problem of mixed-norm regularization, in hich e ish to minimize functions f t (W ) + r(w ) of a matrix W R d k. In particular, e define i R k to be the i th ro of W, and e set r(w ) = λ d i= i p2. We also use the p-norm Bregman functions as above ith ψ(w ) = 2 W 2 p = 2 ( i,j W p ij) /p, hich are (p )-strongly convex ith respect to the l p -norm squared. As earlier, our minimization problem becomes min W V, W + 2 W 2 p + λ W l/l p2, hose dual problem is min V Θ Θ q s.t. θi q2 λ Raising the first norm to the q -poer, e see that the problem is separable, and e can solve it using the techniques in the prequel. 7.3 Matrix Composite Mirror Descent We no consider a setting that generalizes the previous discussions in hich our variables W t are matrices W t Ω = R d d2. We use Bregman functions based on Schatten p-norms (e.g. Horn and Johnson, 985, Section 7.4). Schatten p-norms are the family of unitarily invariant matrix norms arising out of applying p-norms to the singular values of the matrix W. That is, letting σ(w ) denote the vector of singular values of W, e set W p = σ(w ) p. We use Bregman functions ψ(w ) = 2 W 2 p, hich, similar to the p-norm on vectors, are (p )-strongly convex over Ω = R d d2 ith respect to the norm p (Ball et al., 994). As in the previous subsection, e mainly consider to values for p, p = 2 and a value very near, namely p = + / log d. For p = 2, ψ is -strongly convex ith respect to 2 = Fr, the
11 Frobenius norm. For the second value, ψ is / log d-strongly convex ith respect to +/ log d, or, ith a bit more ork, ψ is /(3 log d)-strongly convex.r.t., the trace or nuclear norm. We focus on the specific setting of trace-norm regularization, or r(w ) = λ W. This norm, similar to the l -norm on vectors, gives sparsity in the singular value spectrum of W and hence is useful for rank-minimization (Recht et al., 2007). The generic Comid update ith the above choice of ψ gives a Schatten p-norm Comid algorithm for matrix applications: W t+ = argmin W Ω η f t(w t ), W + B ψ (W, W t ) + ηλ W. (7) The update is ell defined since ψ is strongly convex as per the above discussion. We also have defined W, V = tr(w V ) as the usual matrix inner product. The generic Comid convergence result in Thm. 2 immediately yields the folloing to corollaries. Corollary Let the sequence {W t } be defined by the update in Eq. (7) ith p = 2. If each f t satisfies f t(w t ) 2 G 2, then there is a stepsize η for hich the regret against W Ω is R φ (T ) G 2 W 2 T Corollary 2 Let p = + / log d in the Schatten Comid update of Eq. (7). Let q = + log d. If each f t satisfies f t(w t ) q G q then R φ (T ) G q W p T log d G W T log d here G = max t f t(w t ) = max t σ max (f t(w t )). Let us consider the actual implementation of the Comid update. Similar convergence rates (ith orse constants and a negative dependence on the spectrum of r) to those above can be achieved via simple mirror descent, i.e. by linearizing r(w ). The advantage of Comid is that it achieves sparsity in the spectrum, as the folloing proposition demonstrates. Proposition 3 Let Ψ() = 2 2 p and S τ (v) = sign(v) [ v τ] + as in the prequel. For p (, 2], the update in Eq. (7) can be implemented as follos. Compute SVD: W t = U t diag(σ(w t ))V t Gradient step: Θ t = U t diag( Ψ(σ(W t )))V t ηf t(w t ) (8a) Compute SVD: Θ t = Ũ t diag(σ(θ t ))Ṽ t Splitting update: W t+ = Ũ t diag ( ( Ψ) (S ηλ (σ(θ t ))) ) Ṽ t (8b) Note that the first SVD, Eq. (8a), is used for notational convenience only and need not be computed at each iteration, since it is maintained at the end of iteration t via Eq. (8b). The computational requirements for Comid are thus the same as standard mirror descent, hich also requires an SVD computation on each step to compute W. The last step of the update, Eq. (8b) applies the shrinkage/thresholding operator S ηλ to the spectrum of Θ t, hich introduces sparsity. Furthermore, due to the sign (and hence sparsity) preserving nature of the map ( Ψ), the sparsity in the spectrum is maintained in W t+. Lastly, the special case for p = 2, the standard Frobenius norm update, as derived (but ithout rates or alloing stochastic gradients) by Ma et al. (2009), ho report good empirical results for their algorithm. In trace-norm applications, e expect W to be small. Therefore, in such applications, our ne Schatten-p COMID algorithm ith p should give strong performance since G can be much smaller than G 2. Proof of Proposition 3: We kno from the prequel that the Comid step is equivalent to ψ( W t ) = ψ(w t ) ηf t(w t ) and W t+ = argmin W { B ψ (W, W t ) + ηr( W t ) Since W t has singular value decomposition U t diag(σ(w ))V t and ψ(w ) = Ψ(σ(W )) is unitarily invariant, ψ(w t ) = U t diag( Ψ(σ(W t )))V t (Leis, 995, Corollary 2.5). This means that the Θ t computed in step 2 above is simply ψ( W t ). The proof essentially amounts to a reduction to the vector case, since the norms are unitarily invariant, and ill be complete if e prove that V = argmin W {B ψ (W, W t ) + τ W } }.
12 has the unique solution ( V = Ũ t diag ( Ψ) (S τ (σ( ψ( W t )))) )Ṽ t, (9) }{{} here Ũ t diag(σ( W t ))Ṽ t is the SVD of W t. By subgradient optimality conditions, it is sufficient that the proposed solution V satisfy 0 d d 2 ψ(v ) ψ( W t ) + τ V. Applying Leis s Corollary 2.5, e can continue to use the orthonormal matrices Ũ t and Ṽ t, and e see that the proposed V in Eq. (9) satisfies ψ(v ) = Ũ t diag( Ψ( )))Ṽ t and V = Ũ t diag( )Ṽ t. We have thus reduced the problem to shoing that 0 d Ψ( ) + Ψ(σ( W t )) + τ, since e chose the matrices to have the same singular vectors by construction. From Lemma 0 presented earlier, e already kno that satisfies this equation if and only if Ψ( ) = S τ ( Ψ(σ( W t ))), hich is indeed the case by definition of (noting of course that σ( ψ( W t )) = Ψ(σ( W t ))). Acknoledgments JCD as supported by a National Defense Science and Engineering Graduate (NDSEG) felloship, and some of this ork as completed hile JCD as an intern at Google Research. References K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 68: , 967. K. Ball, E. Carlen, and E. Lieb. Sharp uniform convexity and smoothness inequalities for trace norms. Inventiones mathematicae, 5: , 994. A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 3:67 75, L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems 20, pages 6 68, S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9): , September I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems ith a sparsity constraint. Communication on Pure and Applied Mathematics, 57(): , J. Duchi and Y. Singer. Online and batch learning using forard backard splitting. Journal of Machine Learning Research, 0: , J. Duchi, S. Shalev-Shartz, Y. Singer, and T. Chandra. Efficient projections onto the l -ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, C. Gentile and N. Littlestone. The robustness of the p-norm algorithms. In Proceedings of the Telfth Annual Conference on Computational Learning Theory, 999. E. Hazan, A. Kalai, S. Kale, and A. Agaral. Logarithmic regret algorithms for online convex optimization. In Proceedings of the 9th Annual Conference on Computational Learning Theory, Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 985.
13 S. Kakade and A. Teari. On the generalization ability of online strongly convex programming algorithms. In Advances in Neural Information Processing Systems 22. MIT Press, J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 0: , A. Leis. The convex analysis of unitarily invariant matrix functions. Journal of Convex Analysis, 2:73 83, 995. P. L. Lions and B. Mercier. Splitting algorithms for the sum of to nonlinear operators. SIAM Journal on Numerical Analysis, 6: , 979. N. Littlestone. From on-line to batch learning. In Proceedings of the Second Annual Workshop on Computational Learning Theory, pages , July 989. Nick Littlestone and Manfred K. Warmuth. The eighted majority algorithm. Information and Computation, 08:22 26, 994. S. Ma, D. Goldfarb, and L. Chen. Fixed point and Bregman iterative methods for matrix rank minimization. Submitted, URL A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, Ne York, 983. Y. Nesterov. Gradient methods for minimizing composite objective function. Technical Report 76, Center for Operations Research and Econometrics, Catholic University of Louvain (UCL), Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 20():22 259, G. Obozinski, B. Taskar, and M. Jordan. Joint covariate selection for grouped classification. Technical Report 743, Dept. of Statistics, University of California Berkeley, B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. Submitted, R.T. Rockafellar. Convex Analysis. Princeton University Press, 970. S. Shalev-Shartz and Y. Singer. Logarithmic regret algorithms for strongly convex repeated games. Technical report, The Hebre University, URL S. Shalev-Shartz and N. Srebro. SVM optimization: Inverse dependence on training set size. In Proceedings of the 25th International Conference on Machine Learning, pages , S. Shalev-Shartz and A. Teari. Stochastic methods for l -regularized loss minimization. In Proceedings of the 26th International Conference on Machine Learning, N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems 7, P. Tseng. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Submitted, S. Wright, R. Noak, and M. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7): , L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems 23, M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, 2003.
Composite Objective Mirror Descent
Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationLogarithmic Regret Algorithms for Strongly Convex Repeated Games
Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationConvex Repeated Games and Fenchel Duality
Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,
More informationAdvanced Machine Learning
Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationConvex Repeated Games and Fenchel Duality
Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,
More informationLecture 16: FTRL and Online Mirror Descent
Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which
More informationA Greedy Framework for First-Order Optimization
A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts
More informationEnhancing Generalization Capability of SVM Classifiers with Feature Weight Adjustment
Enhancing Generalization Capability of SVM Classifiers ith Feature Weight Adjustment Xizhao Wang and Qiang He College of Mathematics and Computer Science, Hebei University, Baoding 07002, Hebei, China
More informationAccelerated Gradient Method for Multi-Task Sparse Learning Problem
Accelerated Gradient Method for Multi-Task Sparse Learning Problem Xi Chen eike Pan James T. Kwok Jaime G. Carbonell School of Computer Science, Carnegie Mellon University Pittsburgh, U.S.A {xichen, jgc}@cs.cmu.edu
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationTutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning
Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret
More informationProximal and First-Order Methods for Convex Optimization
Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,
More informationOnline Convex Optimization
Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed
More informationA Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices
A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationCHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum
CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum 1997 19 CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION 3.0. Introduction
More informationLecture 3a: The Origin of Variational Bayes
CSC535: 013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton The origin of variational Bayes In variational Bayes, e approximate the true posterior across parameters
More informationStochastic Methods for l 1 Regularized Loss Minimization
Shai Shalev-Shwartz Ambuj Tewari Toyota Technological Institute at Chicago, 6045 S Kenwood Ave, Chicago, IL 60637, USA SHAI@TTI-CORG TEWARI@TTI-CORG Abstract We describe and analyze two stochastic methods
More informationy(x) = x w + ε(x), (1)
Linear regression We are ready to consider our first machine-learning problem: linear regression. Suppose that e are interested in the values of a function y(x): R d R, here x is a d-dimensional vector-valued
More informationMaking Gradient Descent Optimal for Strongly Convex Stochastic Optimization
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationBregman Divergence and Mirror Descent
Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,
More informationSpectral k-support Norm Regularization
Spectral k-support Norm Regularization Andrew McDonald Department of Computer Science, UCL (Joint work with Massimiliano Pontil and Dimitris Stamos) 25 March, 2015 1 / 19 Problem: Matrix Completion Goal:
More informationAdaptive Online Learning in Dynamic Environments
Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi Elad Hazan Yoram Singer Electrical Engineering and Computer Sciences University of California at Berkeley Technical
More informationMinimax strategy for prediction with expert advice under stochastic assumptions
Minimax strategy for prediction ith expert advice under stochastic assumptions Wojciech Kotłosi Poznań University of Technology, Poland otlosi@cs.put.poznan.pl Abstract We consider the setting of prediction
More informationThe Interplay Between Stability and Regret in Online Learning
The Interplay Between Stability and Regret in Online Learning Ankan Saha Department of Computer Science University of Chicago ankans@cs.uchicago.edu Prateek Jain Microsoft Research India prajain@microsoft.com
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationNonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick
Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick Nojun Kak, Member, IEEE, Abstract In kernel methods such as kernel PCA and support vector machines, the so called kernel
More informationNotes on AdaGrad. Joseph Perla 2014
Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.
More informationOptimal Regularized Dual Averaging Methods for Stochastic Optimization
Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business
More informationBeyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
More informationMirror Descent for Metric Learning. Gautam Kunapuli Jude W. Shavlik
Mirror Descent for Metric Learning Gautam Kunapuli Jude W. Shavlik And what do we have here? We have a metric learning algorithm that uses composite mirror descent (COMID): Unifying framework for metric
More informationAccelerating Online Convex Optimization via Adaptive Prediction
Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More informationOnline Learning and Online Convex Optimization
Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game
More informationCOS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008 In the previous lecture, e ere introduced to the SVM algorithm and its basic motivation
More informationA NONCONVEX ADMM ALGORITHM FOR GROUP SPARSITY WITH SPARSE GROUPS. Rick Chartrand and Brendt Wohlberg
A NONCONVEX ADMM ALGORITHM FOR GROUP SPARSITY WITH SPARSE GROUPS Rick Chartrand and Brendt Wohlberg Los Alamos National Laboratory Los Alamos, NM 87545, USA ABSTRACT We present an efficient algorithm for
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationA ROBUST BEAMFORMER BASED ON WEIGHTED SPARSE CONSTRAINT
Progress In Electromagnetics Research Letters, Vol. 16, 53 60, 2010 A ROBUST BEAMFORMER BASED ON WEIGHTED SPARSE CONSTRAINT Y. P. Liu and Q. Wan School of Electronic Engineering University of Electronic
More informationThe FTRL Algorithm with Strongly Convex Regularizers
CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked
More informationStochastic Gradient Descent with Only One Projection
Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationOnline Learning meets Optimization in the Dual
Online Learning meets Optimization in the Dual Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., The Hebrew University, Jerusalem 91904, Israel 2 Google Inc., 1600 Amphitheater
More informationOne Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties
One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,
More informationLinear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.
Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also
More informationAgenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples
Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method
More informationProximal methods. S. Villa. October 7, 2014
Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem
More informationThe Online Approach to Machine Learning
The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57
More informationExponential Weights on the Hypercube in Polynomial Time
European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts
More informationOnline Optimization with Gradual Variations
JMLR: Workshop and Conference Proceedings vol (0) 0 Online Optimization with Gradual Variations Chao-Kai Chiang, Tianbao Yang 3 Chia-Jung Lee Mehrdad Mahdavi 3 Chi-Jen Lu Rong Jin 3 Shenghuo Zhu 4 Institute
More informationMax-Margin Ratio Machine
JMLR: Workshop and Conference Proceedings 25:1 13, 2012 Asian Conference on Machine Learning Max-Margin Ratio Machine Suicheng Gu and Yuhong Guo Department of Computer and Information Sciences Temple University,
More informationECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis
ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationBivariate Uniqueness in the Logistic Recursive Distributional Equation
Bivariate Uniqueness in the Logistic Recursive Distributional Equation Antar Bandyopadhyay Technical Report # 629 University of California Department of Statistics 367 Evans Hall # 3860 Berkeley CA 94720-3860
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationOn the approximation of real powers of sparse, infinite, bounded and Hermitian matrices
On the approximation of real poers of sparse, infinite, bounded and Hermitian matrices Roman Werpachoski Center for Theoretical Physics, Al. Lotnikó 32/46 02-668 Warszaa, Poland Abstract We describe a
More informationA Modified ν-sv Method For Simplified Regression A. Shilton M. Palaniswami
A Modified ν-sv Method For Simplified Regression A Shilton apsh@eemuozau), M Palanisami sami@eemuozau) Department of Electrical and Electronic Engineering, University of Melbourne, Victoria - 3 Australia
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationA simpler unified analysis of Budget Perceptrons
Ilya Sutskever University of Toronto, 6 King s College Rd., Toronto, Ontario, M5S 3G4, Canada ILYA@CS.UTORONTO.CA Abstract The kernel Perceptron is an appealing online learning algorithm that has a drawback:
More informationStat 8112 Lecture Notes Weak Convergence in Metric Spaces Charles J. Geyer January 23, Metric Spaces
Stat 8112 Lecture Notes Weak Convergence in Metric Spaces Charles J. Geyer January 23, 2013 1 Metric Spaces Let X be an arbitrary set. A function d : X X R is called a metric if it satisfies the folloing
More informationConvex Optimization on Large-Scale Domains Given by Linear Minimization Oracles
Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research
More informationLecture: Adaptive Filtering
ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate
More informationMaximum Margin Matrix Factorization
Maximum Margin Matrix Factorization Nati Srebro Toyota Technological Institute Chicago Joint work with Noga Alon Tel-Aviv Yonatan Amit Hebrew U Alex d Aspremont Princeton Michael Fink Hebrew U Tommi Jaakkola
More informationSOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1
SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1 Masao Fukushima 2 July 17 2010; revised February 4 2011 Abstract We present an SOR-type algorithm and a
More informationStochastic dynamical modeling:
Stochastic dynamical modeling: Structured matrix completion of partially available statistics Armin Zare www-bcf.usc.edu/ arminzar Joint work with: Yongxin Chen Mihailo R. Jovanovic Tryphon T. Georgiou
More informationRelative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent
Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order
More informationImportance Sampling for Minibatches
Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,
More informationStochastic Optimization: First order method
Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationSubgradient Methods for Maximum Margin Structured Learning
Nathan D. Ratliff ndr@ri.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. 15213 USA Martin A. Zinkevich maz@cs.ualberta.ca Department of Computing
More informationConvergence rate of SGD
Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected
More informationUC Berkeley Department of Electrical Engineering and Computer Sciences. EECS 126: Probability and Random Processes
UC Berkeley Department of Electrical Engineering and Computer Sciences EECS 26: Probability and Random Processes Problem Set Spring 209 Self-Graded Scores Due:.59 PM, Monday, February 4, 209 Submit your
More informationPreliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1
90 8 80 7 70 6 60 0 8/7/ Preliminaries Preliminaries Linear models and the perceptron algorithm Chapters, T x + b < 0 T x + b > 0 Definition: The Euclidean dot product beteen to vectors is the expression
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationLinear Regression Linear Regression with Shrinkage
Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle
More informationMulti-class classification via proximal mirror descent
Multi-class classification via proximal mirror descent Daria Reshetova Stanford EE department resh@stanford.edu Abstract We consider the problem of multi-class classification and a stochastic optimization
More informationInverse Time Dependency in Convex Regularized Learning
Inverse Time Dependency in Convex Regularized Learning Zeyuan Allen Zhu 2*, Weizhu Chen 2, Chenguang Zhu 23, Gang Wang 2, Haixun Wang 2, Zheng Chen 2 Fundamental Science Class, Department of Physics, Tsinghua
More informationA Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.
More informationOWL to the rescue of LASSO
OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,
More informationEcon 201: Problem Set 3 Answers
Econ 20: Problem Set 3 Ansers Instructor: Alexandre Sollaci T.A.: Ryan Hughes Winter 208 Question a) The firm s fixed cost is F C = a and variable costs are T V Cq) = 2 bq2. b) As seen in class, the optimal
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationA survey: The convex optimization approach to regret minimization
A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization
More information