Composite Objective Mirror Descent

Size: px
Start display at page:

Download "Composite Objective Mirror Descent"

Transcription

1 Composite Objective Mirror Descent John C. Duchi UC Berkeley Shai Shalev-Shartz Hebre University Yoram Singer Google Research Ambuj Teari TTI Chicago Abstract We present a ne method for regularized convex optimization and analyze it under both online and stochastic optimization settings. In addition to unifying previously knon firstorder algorithms, such as the projected gradient method, mirror descent, and forardbackard splitting, our method yields ne analysis and algorithms. We also derive specific instantiations of our method for commonly used regularization functions, such as l, mixed norm, and trace-norm. Introduction and Problem Statement Regularized loss minimization is a common learning paradigm in hich one jointly minimizes an empirical loss over a training set plus a regularization term. The paradigm yields an optimization problem of the form min Ω n n f t () + r(), () here Ω R d is the domain (a closed convex set), f t : Ω R is a (convex) loss function associated ith a single example in a training set, and r : Ω R is a (convex) regularization function. A fe examples of famous learning problems that fall into this frameork are least squares, ridge regression, support vector machines, support vector regression, lasso, and logistic regression. In this paper, e describe and analyze a general frameork for solving Eq. (). The method e propose is a first-order approach, meaning that e access the functions f t only by receiving subgradients. Recent ork has shon that from the perspective of achieving good statistical performance on unseen data, first order methods are preferable to higher order approaches, especially hen the number of training examples n is very large (Bottou and Bousquet, 2008; Shalev-Shartz and Srebro, 2008). Furthermore, in large scale problems it is often prohibitively expensive to compute the gradient of the entire objective function (thus accessing all the examples in the training set), and randomly choosing a subset of the training set and computing the gradient over the subset (perhaps only a single example) can be significantly more efficient. This approach is very closely related to online learning. Our general frameork handles both cases ith ease it applies to accessing a single example (or subset of the examples) at each iteration or accessing the entire training set at each iteration. The method e describe is an adaptation of the Mirror Descent (MD) algorithm (Nemirovski and Yudin, 983; Beck and Teboulle, 2003), an iterative method for minimizing a convex function φ : Ω R. If the dimension d is large enough, MD is optimal among first-order methods, and it has a close connection to online learning since it is possible to bound the regret φ t ( t ) inf Ω φ t (), here { t } is the sequence generated by mirror descent and the φ t are convex functions. In fact, one can vie popular online learning algorithms, such as eighted majority (Littlestone and Warmuth, 994) and online gradient descent (Zinkevich, 2003) as special cases of mirror descent. A guarantee on the online regret can be translated directly to a guarantee on the convergence rate of the algorithm to the optimum of Eq. (), as e ill sho later.

2 Folloing Beck and Teboulle s exposition, a Mirror Descent update in the online setting can be ritten as t+ = argmin B ψ (, t ) + η φ t( t ), t, (2) Ω here B ψ is a Bregman divergence and φ t denotes an arbitrary subgradient of φ t. Intuitively, MD minimizes a first-order approximation of the function φ t at the current iterate t hile forcing the next iterate t+ to lie close to t. The step-size η controls the trade-off beteen these to. Our focus in this paper is to generalize mirror descent to the case hen the functions φ t are composite, that is, they consist of to parts: φ t = f t + r. Here the f t change over time but the function r remains constant. Of course, one can ignore the composite structure of the φ t and use MD. Hoever, doing so can result in undesirable effects. For example, hen r() =, applying MD directly does not lead to sparse updates. Since the sparsity inducing property of the l -norm is a major reason for its use. The modification of mirror descent that e propose is simple: t+ argmin Ω η f t( t ), + B ψ (, t ) + ηr(). (3) This is almost the same as the mirror descent update ith an important difference: e do not linearize r. We call this algorithm Composite Objective MIrror Descent, or Comid. One of our contributions is to sho that, in a variety of cases, the Comid update is no costlier than the usual mirror descent update. In these situations, each Comid update is efficient and benefits from the presence of the regularizer r(). We no outline the remainder of the paper. We begin by revieing related ork, of hich there is a copious amount, though e try to do some justice to prior research. We then give a general O( T ) regret bound for Comid in the online optimization setting, after hich e give several extensions. We sho O(log T ) regret bounds for Comid hen the composite functions f t + r are strongly convex, after hich e sho convergence rates and concentration results for stochastic optimization using Comid. The second focus of the paper is in the derived algorithms, here e outline step rules for several choices of Bregman function ψ and regularizer r, including l, l, and mixed-norm regularization, as ell as presenting ne results on efficient matrix optimization ith Schatten p-norms. 2 Related Work Since the idea underlying Comid is simple, it is not surprising that similar algorithms have been proposed. One of our main contributions is to sho that Comid generalizes much prior ork and to give a clean unifying analysis. We do not have the space to thoroughly revie the literature, though e try to do some small justice to hat is knon. We begin by revieing ork that e ill sho is a special case of Comid. Forard-backard splitting is a long-studied frameork for minimizing composite objective functions (Lions and Mercier, 979), though it has only recently been analyzed for the online and stochastic case (Duchi and Singer, 2009). Specializations of forardbackard splitting to the case here r() = include iterative shrinkage and thresholding from the signal processing literature (Daubechies et al., 2004), and from machine learning, Truncated Gradient (Langford et al., 2009) and SMIDAS (Shalev-Shartz and Teari, 2009) are both special cases of Comid. In the optimization community there has there has been significant recent interest both applied and theoretical on minimization of composite objective functions such as that in Eq. (). Some notable examples include Wright et al. (2009); Nesterov (2007); Tseng (2009). These papers all assume that the objective f + r to be minimized is fixed and that f is smooth, i.e. that it has Lipschitz continuous derivatives. The most related of these to Comid is probably Tseng (2009, see his Sec. 3. and the references therein), hich proposes the same update as ours, but gives a Nesterov-like optimal method for the fixed f case. We do not have restrictions on f, though by going to stochastic, nondifferentiable f e naturally suffer in convergence rate. Nonetheless, e do anser in the affirmative a question posed by Tseng (2009), hich is hether stochastic or incremental subgradient methods ork for composite objectives. To recent papers for online and stochastic composite objective minimization are Xiao (2009) and Duchi and Singer (2009). The former extends Nesterov s 2009 analysis of primal-dual subgradient methods to the composite case, giving an algorithm hich is similar to ours; hoever, our algorithms are different and the analysis for each is completely different. Duchi and Singer (2009) is simply a specialization of Comid to the case here the Euclidean Bregman divergence is used. As a consequence of our general setting, e are able to give elegant ne algorithms for minimization of functions on matrices, hich include efficient and simple algorithms for trace-norm

3 minimization. Trace norm minimization has recently found strong applicability in matrix rank minimization (Recht et al., 2007), hich has been shon to be very useful, for example, in collaborative filtering (Srebro et al., 2004). A special case of Comid has recently been developed for this task, hich is very similar in spirit to fixed point and shrinkage methods from signal processing for l - minimization (Ma et al., 2009). The authors of this paper note that the method is extremely efficient for rank-minimization problems but do not give rates of convergence, hich e give as a corollary to our main convergence theorems. 3 Notation and Setting Before continuing, e establish notation and our problem setting formally. Vectors are loer case bold italic letters, such as x R d, and scalars are loer case italics such as x R. We denote a sequence of vectors by subscripts, i.e. t, t+,..., and entries in a vector by non-bold subscripts as in j. Matrices are upper case bold italic letters, such as W R d d. The subdifferential set of a function f evaluated at is denoted f() and a particular subgradient by f () f(). When a function is differentiable, e rite f(). We focus mostly on the problem of regularized online learning, in hich the goal is to achieve lo regret.r.t. a static predictor Ω on a sequence of functions φ t () f t () + r(). Here, f t and r 0 are convex functions, and Ω is some convex set (hich could be R d ). Formally, at every round of the algorithm e make a prediction t R d and then receive the function f t. We seek bounds on the regularized regret ith respect to, defined as R φ (T, [ ) ft ( t ) + r( t ) f t ( ) r( ) ]. (4) In batch optimization e set f t = f for all t, hile in stochastic optimization e choose f t to be the average of some random subset of {f,..., f n }. As mentioned previously and as e ill sho, it is not difficult to transform regret bounds for Eq. (4) into convergence rates in expectation and ith high probability for Eq. (), hich e do using techniques similar to Cesa-Bianchi et al. (2004). Throughout, ψ designates a continuously differentiable function that is α-strongly convex.r.t. a norm on the set Ω. Recall that this means that the Bregman divergence associated ith ψ, satisfies B ψ (, v) α 2 v 2 for some α > 0. B ψ (, v) = ψ() ψ(v) ψ(v), v, 4 Composite Objective MIrror Descent We use proof techniques similar to those in Beck and Teboulle (2003) to derive progress bounds on each step of the algorithm. We then use the bounds to straightforardly prove convergence results for online and batch learning. We begin by bounding the progress made by each step of the algorithm in either an online or a batch setting. This lemma is the key to our later analysis, so e prove it in full here. Lemma Let the sequence { t } be defined by the update in Eq. (3). Assume that B ψ (, ) is α-strongly convex ith respect to a norm, that is, B ψ (, v) α 2 v 2. For any Ω, η (f t ( t ) f t ( )) + η (r( t+ ) r( )) B ψ (, t ) B ψ (, t+ ) + η2 f t( t ) 2. Proof: The optimality of t+ for Eq. (3) implies for all Ω and r ( t+ ) r( t+ ), t+, ηf ( t ) + ψ( t+ ) ψ( t ) + ηr ( t+ ) 0. (5) In particular, this obtains for =. From the subgradient inequality for convex functions, e have f t ( ) f t ( t ) + f t( t ), t, or f t ( t ) f t ( ) f t( t ), t, and likeise for r( t+ ). We thus have η [f t ( t ) + r( t+ ) f t ( ) r( )] η t, f t( t ) + η t+, r ( t+ ) = η t+, f t( t ) + η t+, r ( t+ ) + η t t+, f t( t ) = t+, ψ( t ) ψ( t+ ) ηf t( t ) ηr ( t+ ) + t+, ψ( t+ ) ψ( t ) + η t t+, f t( t ).

4 No, by Eq. (5), the first term in the last equation is non-positive. Thus e have that η [f t ( t ) + r( t+ ) f t ( ) r( )] t+, ψ( t+ ) ψ( t ) + η t t+, f t( t ) = B ψ (, t ) B ψ ( t+, t ) B ψ (, t+ ) + η t t+, f t( t ) (6) = B ψ (, t ) B ψ ( t+, t ) B ψ (, t+ ) + η α η ( t t+ ), η α f t( t ) B ψ (, t ) B ψ ( t+, t ) B ψ (, t+ ) + α 2 t t+ 2 + η2 f t( t ) 2 B ψ (, t ) B ψ (, t+ ) + η2 f t( t ) 2. In the above, the first equality follos from simple algebra of B ψ, that is, ψ(b) ψ(a), c a = B ψ (c, a) + B ψ (a, b) B ψ (c, b) and setting c =, a = t+, and b = t. The second to last inequality follos from the Fenchel-Young inequality applied to the conjugate pair 2 2, 2 2 (Boyd and Vandenberghe, 2004, Example 3.27). The last inequality follos from the strong convexity of B ψ ith respect to the norm. The folloing theorem uses Lemma to establish a general regret bound for the Comid frameork. Theorem 2 Let the sequence { t } be defined by the update in Eq. (3). Then for any Ω, R φ (T, ) η B ψ(, ) + r( ) + η f t( t ) 2. Proof: By Lemma, η [f t ( t ) f t ( ) + r( t+ ) r( )] B ψ (, t ) B ψ (, t+ ) + η2 = B ψ (, ) B ψ (, T + ) + η2 f t( t ) 2 f t( t ) 2. Noting that Bregman divergences are alays non-negative, recall our assumption that r() 0. Adding ηr( ) to both sides of the above equation and dropping the r( t+ ) term gives η [f t ( t ) f t ( ) + r( t ) r( )] B ψ (, ) + ηr( ) + η2 Dividing each side by η gives the result. f t( t ) 2. A fe corollaries are immediate from the above result. First, suppose that the functions f t are Lipschitz continuous. Then there is some G such that f t( t ) G. In this case, e have Corollary 3 Let { t } be generated by the update Eq. (3) and assume that the functions f t are Lipschitz ith dual Lipschitz constant G. Then R φ (T ) η B ψ(, ) + r( ) + T η G2. If e take η / T, then e have a regret hich is O( T ) hen the functions f t are Lipschitz. If Ω is compact, the f t are guaranteed to be Lipschitz continuous (Rockafellar, 970). Corollary 4 Suppose that either Ω is compact or the functions f t are Lipschitz so f t G. Also assume r( ) = 0. Then setting η = 2 α B ψ (, )/(G T ), R φ (T ) 2T B ψ (, )G / α. It is straightforard to prove results under the slightly different restriction that f t() 2 ρf t (), hich is similar to assuming a Lipschitz condition on the gradient of f t. A common example in hich this holds is linear regression, here f i () = 2 (, x i y i ) 2, so f i () = (, x i y i )x i and ρ = 2 x i 2. The proof essentially amounts to dividing out constants dependent on η and ρ from both sides of the regret.

5 Corollary 5 Let f t() 2 ρf t(), r 0, and assume r( ) = 0. Setting η / T gives R φ (T ) = O(ρ T B ψ (, )/α). Proof: From Theorem 2, the non-negativity of r, and that r( ) = 0 e immediately have ( ρη ) ( [f t ( t ) + r( t )] ρη ) f t ( t ) + r( t ) η B ψ(, ) + f t ( ) + r( ) Setting η = /(ρ T ) gives ρη/() = ( T )/ T so that ρt T f t ( t ) + r( t ) ( T ) B T ψ(, ) + f t ( ) + r( ). T 5 Logarithmic Regret for Strongly Convex Functions Folloing the vein of research begun in Hazan et al. (2006) and Shalev-Shartz and Singer (2007), e sho that Comid can get stronger regret guarantees hen e assume curvature of the loss functions f t or r. Similar to Shalev-Shartz and Singer, e no assume that for all t, f t + r is λ-strongly convex ith respect to a differentiable function ψ, that is, for any, v Ω, f t (v) + r(v) f t () + r() + f t() + r (), v + λb ψ (v, ). (7) For example, hen ψ() = 2 2 2, e recover the usual definition of strong convexity. For simplicity, e assume that e push all the strong convexity into the function r so that the f t are simply convex (clearly, this is possible by redefining ˆf t () = f t () λψ() if the f t are λ-strongly convex). In this case, a straightforard corollary to Lemma follos. Corollary 6 Let the sequence { t } be defined by the update in Eq. (3) ith step sizes η t. Assume that B ψ (, ) is α-strongly convex ith respect to a norm and that r is λ-strongly convex ith respect to ψ. Then for any Ω η t (f t ( t ) f t ( )) + η t (r( t+ ) r( )) B ψ (, t ) B ψ (, t+ ) + η2 t f t( t ) 2 λη tb ψ (, t+ ). Proof: The proof is effectively identical to that of Lemma. We simply note that r( t+ ) r( ) r ( t+ ), t+ λb ψ (, t+ ) so that η t [f t ( t ) + r( t+ ) f t ( ) r( )] η t t, f t( t ) + η t t+, r ( t+ ) λη t B ψ (, t+ ). No e simply proceed as in the proof of Lemma folloing Eq. (5). The above corollary almost immediately gives a logarithmic regret bound. Theorem 7 Let r be λ-strongly convex ith respect to a differentiable function ψ and suppose ψ is α-strongly convex ith respect to a norm. Assume that r( ) = 0. If f t( t ) G for all t, ( ) R φ (T ) λb ψ (, ) + G2 G 2 (log T + ) = O λα λα log T. Proof: Rearranging Corollary 6, e have f t ( t ) + r( t+ ) f t ( ) r( ) [ B ψ (, t ) ] B ψ (, t+ ) λb ψ (, t+ ) + η t η t = η B ψ (, ) η T B ψ (, t+ ) + + η t f t( t ) 2 [ B ψ (, t+ ) ( η t+ η t η t f t( t ) 2. ) ] λb ψ (, t+ )

6 If e set η t = λt, then the first summation above is zero and f t ( t ) + r( t+ ) f t ( ) r( ) λb ψ (, ) + λα t f t( t ) 2. Noting that T t log T + completes the proof of the theorem. An interesting point regarding the above theorem is that e do not require B ψ (, t ) to be bounded or the set Ω to be compact, hich previous ork assumed. When the functions f t are Lipschitz, then henever r is strongly convex Comid still attains logarithmic regret. To notable examples attain the logarithmic bounds in the above theorem. It is clear that if r defines a valid Bregman divergence then that r is strongly convex ith respect to itself in the sense of Eq. (7). First, consider optimization over the simplex ith entropic regularization, that is, e set r() = λ i i log i and Ω = { : 0, = }. In this case it is straightforard to see that r() = j i log i is λ-strongly convex ith respect to ψ() = r(), hich in turn is strongly convex ith respect to the l -norm over Ω (see Shalev-Shartz and Singer, 2007, Definition ( 2 and Example 2). Since the dual of the l -norm is the l norm, e have R φ (T ) = O log T λ max t f t( t ) ). We can also use r() = λ 2 2 2, in hich case e recover the same bounds as those in Hazan et al. (2006). 6 Stochastic Convergence Results In this section, e examine the application of Comid to solving stochastic optimization problems. The techniques e use have a long history in online algorithms and make connections beteen the regret of the algorithm and generalization performance using martingale concentration results (Littlestone, 989). We build on knon techniques for data-driven generalization bounds (Cesa-Bianchi et al., 2004) to give concentration results for Comid in the stochastic optimization setting. Further ork on this subject for the strongly convex case can be found in Kakade and Teari (2008), though e focus on the case hen f t + r is eakly convex. We let f() = Ef(; Z) = f(; z)dp (z), and at every step t the algorithm receives an independent random variable Z t P that gives an unbiased estimate f t ( t ) = f( t ; Z t ) of the function f evaluated at t and an unbiased estimate f t( t ) = f ( t ; Z t ) of an arbitrary subgradient f ( t ) f( t ). We assume that B ψ (, t ) D 2 for all t and for simplicity that f t( t ) G for all t, hich are satisified hen Ω is compact. We also assume ithout loss of generality that r( ) = 0. For example, our original problem in hich f() = n n i= f i(), here e randomly sample one f i at each iteration, falls into this setup, resolving the question posed by Tseng (2009) on the existence of stochastic composite incremental subgradient methods. Theorem 8 Given the assumptions on f and Ω in the above paragraph, let { t } be the sequence generated by Eq. (3). In addition, let T = T T t and η t = D G αt. Then ( P f( T ) + r( T ) f( ) + r( ) + DG ) + ε exp ( T ) αε2 αt 6D 2 G 2. Alternatively, ith probability at least δ f( T ) + r( T ) f( ) + r( ) + DG ) ( + 4 log. αt δ Proof: We begin our derivation by recalling Lemma. Convexity of f and r imply η t [f( t ) + r( t+ ) f( ) r( )] η t t, f ( t ) + η t t+, r ( t+ ) = η t t, f t( t ) + η t t+, r ( t+ ) + η t t, f ( t ) f t( t ) We no follo the same derivation as Lemma, leaving η t t, f ( t ) f t( t ) intact, thus η t [f( t ) + r( t+ ) f( ) r( )] B ψ (, t ) B ψ (, t+ ) + η2 t f t( t ) 2 + η t t, f ( t ) f t( t ). (8)

7 No e subtract r( T + ) 0 from both sides, use the assumption that r( ) = 0, and sum to get [f( t ) + r( t ) f( ) r( )] [B ψ (, t ) B ψ (, t+ )] + η t η B ψ (, ) + t=2 T η t f t( t ) 2 + t, f ( t ) f t( t ) [ B ψ (, t ) ] + G2 η t η t η t + Let F t be a filtration ith Z τ F t for τ t. Since t F t, t, f ( t ) f t( t ). E [ t, f ( t ) f ( t ; Z t ) F t ] = t, f ( t ) E[f ( t ; Z t ) F t ] = 0, and thus the last sum in Eq. (9) is a martingale difference sequence. We next use our assumptions that B ψ (, t ) D 2 and α 2 t 2 B ψ (, t ), therefore t 2/αD. Then t, f ( t ) f t( t ) t f ( t ) f t( t ) 2 2/αDG. Thus Eq. (9) consists of a bounded difference martingale, and e can use standard concentration techniques to get strong convergence guarantees. Applying Azuma s inequality, ( T ) ) P t, f ( t ) f t( t ) ε exp ( αε2 6T D 2 G 2. (0) Define γ T = T t, f t( t ) f ( t ) and recall that B ψ (, t ) D 2. The convexity of f and r give T [f( T ) + r( T )] T f( t) + r( t ), so that [ T [f( T ) + r( T )] T [f( ) + r( )] + D 2 ( + ) ] + G2 η t + γ T η η t η t = T [f( ) + r( )] + D2 η T + G2 η t + γ T. Setting η t = D α G t e have f( T ) + r( T ) f( ) + r( ) + DG 2 T / α + T γ T immediately apply Azuma s inequality from Eq. (0) to complete the theorem. 7 Special Cases and Derived Algorithms (9), and e can In this section, e sho specific instantiations of our frameork for different regularization functions r, and e also sho that some previously developed algorithms are special cases of the frameork for optimization presented here. We also give results on learning matrices ith Schatten p-norm divergences that generalize some recent interesting ork on trace norm regularization. 7. Fobos The recently proposed Fobos algorithm of Duchi and Singer (2009) is comprised, at each iteration, of the folloing to steps: t+ = t ηf t( t ) and t+ = argmin It is straightforard to verify that the update t+ = argmin 2 t+ 2 + ηr(). 2 t η f t( t ), t + ηr() is equivalent to the to step update above. Thus, Comid reduces to Fobos hen e take ψ() = and Ω = Rd (ith constant learning rate η). This also shos that e can run Fobos by restricting to a convex set Ω R d. Further, our results give tighter convergence guarantees than Fobos, in particular, they do not depend in any negative ay on the regularization function r.

8 It is also not difficult to sho in general that the to step process of setting t+ = argmin B ψ (, t ) + η f t( t ), and t+ = argmin B ψ (, t ) + ηr() is equivalent to the original Comid update of Eq. (3) hen Ω = R d. Indeed, the optimal solution to the first step satisfies ψ( t+ ) ψ( t ) + ηf t( t ) = 0 so that t+ = ψ ( ψ( t ) ηf t( t )). Then looking at the optimal solution for the second step, for some r ( t+ ) r( t+ ) e have ψ( t+ ) ψ( t+ ) + ηr ( t+ ) = 0 i.e. ψ( t+ ) ψ( t ) + ηf t( t ) + ηr ( t+ ) = 0. This is clearly the solution to the one-step update of Eq. (3). 7.2 p-norm divergences No e consider divergence functions ψ hich are the l p -norms squared. 2 2 p is (p )-strongly convex over R d ith respect to the l p -norm for any p (, 2] (Ball et al., 994). We see that if e choose ψ() = 2 2 p to be the divergence function, e have a corollary to Theorem 2. Corollary 9 Suppose that r(0) = 0 and that = 0. Let p = + / log d and use the Bregman function ψ() = 2 2 p. Further suppose that either Ω is compact or the f t are Lipschitz so that for q = log d +, max t f t( t ) q G q. Setting η = p G q T log d, the regret of Comid satisfies R φ (T ) p G q T log d G T log d. Proof: Recall that the dual norm for an l p -norm is an l q -norm, here q = p/(p ). From Thm. 2, e immediately have that hen = 0 and ψ() = 2 2 p R(T ) 2η 2 p + η 2(p ) f t( t ) 2 q. No use the assumption that max t f t( t ) q G q, replace p ith + / log d (so q = log d + ), and set η = c T log d, hich results in T log d R(T ) 2 T log d p 2c + c G 2 2 q. Setting c = p /G q gives us our desired result. From the above, e see that Comid is a good candidate for (dense) problems in high dimensions, especially hen e use l -regularization. For high dimensions hen R d, taking p = +/ log d means our bounds depend roughly on the l -norm of the optimal predictor and the infinity norm of the function gradients f t. Shalev-Shartz and Teari (2009) recently proposed the Stochastic Mirror Descent made Sparse algorithm (SMIDAS) using this intuition. We recover SMIDAS by taking the divergence in Comid to be ψ() = 2 2 p and r() = λ. The Comid update is ψ( t+ ) = ψ( t ) ηf t( t ), The SMIDAS update, on the other hand, is t+ = argmin B ψ (, t+ ) + ηλ. ψ( t+ ) = ψ() ηf t( t ), ψ( t+ ) = S ηλ ( ψ( t+ )), here S τ is the shrinkage/thresholding operator defined by [S τ (x)] j = sign(x j ) [ x j τ] +. () The folloing lemma proves that the to updates are identical in cases including p-norm divergences. Lemma 0 Suppose ψ is strongly convex and its gradient satisfies sign([ ψ()] j ) = sign( j ). (2) Then the unique solution v of v = argmin {B ψ (, u) + τ } is given by ψ(v) = S τ ( ψ(u)). (3)

9 Proof: Since ψ is strongly convex, the solution is unique. We ill sho that if v satisfies Eq. (3) then it is a solution to the problem. Therefore, suppose Eq. (3) holds. The proof proceeds by considering three cases. Case I: [ ψ(u)] j > τ. In this case, [ ψ(v)] j = [ ψ(u)] j τ > 0 and by Eq. (2), v j > 0. Thus [ ψ(v)] j [ ψ(u)] j + τ sign(v j ) = 0. Case II: [ ψ(u)] j < τ. In this case, [ ψ(v)] j = [ ψ(u)] j + τ < 0 and Eq. (2) implies v j < 0. So [ ψ(v)] j [ ψ(u)] j + τ sign(v j ) = 0. Case III: [ ψ(u)] j [ τ, τ]. Here, e can take v j = 0 and Eq. (2) ill give [ ψ(v)] j = 0. Thus 0 [ ψ(v)] j [ ψ(u)] j + τ[, ]. Combining the three cases, v satisfies 0 ψ(v) ψ(u) + τ v, hich is the optimality condition for v argmin {B ψ (, u) + τ }. We thus have ψ(v) = S τ ( ψ(u)) as desired. Reriting the above lemma slightly gives the folloing result. The solution to t+ = argmin {B ψ (, t ) + η f t( t ), t + ηr()} hen ψ satisfies the gradient condition of Eq. (2) is t+ = ( ψ) [sign ( ψ( t ) ηf t( t )) max { ψ( t ) ηf t( t ) ηλ, 0}] = ( ψ) [S ηλ ( ψ( t ) ηf t( t ))]. (4) Note that hen ψ( ) = 2 p e recover Shalev-Shartz and Teari s SMIDAS, hile ith p = 2 e get Langford et al. s 2009 truncated gradient method. See Shalev-Shartz and Teari (2009) or Gentile and Littlestone (999) for the simple formulae to compute ( ψ) ψ. l -regularization Let us no consider the problem of setting r() to be a general l p -norm (e ill specialize this to l shortly). We describe the dual function and then use it to derive a fe particular updates, mentioning an open problem. First, let p > be the norm associated ith the Bregman function ψ and p 2 be the norm for r() = p2. Let q i = p i /(p i ) be the associated dual norm. Then, ignoring constants, the minimization problem from Eq. (3) becomes min v, p + λ p2. We introduce a variable z = and get the equivalent problem min =z v, p + λ z p2. To derive the dual of the problem, e introduce Lagrange multiplier θ and find the Lagrangian L(, z, θ) = v θ, p + λ z p2 + θ, z. Taking the infimum over and z in the Lagrangian, since the conjugate of 2 2 p is 2 2 q hen /p + /q = (Boyd and Vandenberghe, 2004, Example 3.27) e have ] inf [ v θ, + 2 = [ ] { 2p 2 v 0 if θ θ 2 q inf λ z z p2 + θ, z = q2 λ otherise. Thus, our dual is the non-euclidean projection problem min θ 2 v θ q s.t. θ q2 λ. The Lagrangian earlier is differentiable ith respect to, so e can recover the optimal from θ by noting that hen ψ() = 2 2 p, ψ()+v θ = 0 at optimum or = ( ψ) (θ v). When p 2 =, e easily recover Eq. (4) as our update. Hoever, the case p 2 = is more interesting, as it can be a building block for group-sparsity (Obozinski et al., 2007). In this case our problem is min θ 2 v θ q s.t. θ λ. It is clear by symmetry in the above that e can assume v 0 ith no loss of generality. We can raise the l q -norm to a poer greater than and maintain convexity, so our equivalent problem is min θ q v θ q q s.t., θ λ, θ 0. (5)

10 Let θ be the solution of Eq. (5). Clearly, at optimum e ill have θ i v i, though e use this only for clarity in the derivation and omit constraints as they do not affect the optimization problem. Introducing Lagrange multipliers ν and α 0 e get the Lagrangian L(θ, ν, α) = q d (v i θ i ) q + ν(, θ λ) α, θ i= Taking the derivative of the above, e have (v i θ i ) q + ν α i = 0 θ i = v i (ν α i ) /(q ) No suppose e kne the optimal ν. If an index i satisfies ν v q i, then e ill have θ i = 0. To see this, suppose for the sake of contradiction that θ i > 0. The KKT conditions for optimality of Eq. (5) (Boyd and Vandenberghe, 2004) imply that for such i e have θ i = v i (ν α i ) /(q ) = v i ν /(q ) 0, a contradiction. Similarly, if ν < v q i and α i 0, then θ i > 0. Ineed, since α i 0, θ i = v i (ν α i ) /(q ) > v i (v q i α i ) /(q ) v i v i = 0, so that θ i > 0 and the KKT conditions imply α i = 0. Had e knon ν, the optimal θ i ould have been easy to attain as θ i (ν) = v i (min{ν, v q i }) /(q ) (note that this satisfies 0 θ i v i ). Since e kno that the structure of the optimal θ must obey the above equation, e can boil our problem don to finding ν 0 so that d θ i (ν) = i= d i= v i min{ν, v q i } /(q ) = λ. (6) Interestingly, this reduces to exactly the same root-finding problem as that for solving Euclidean projection to an l -ball (Duchi et al., 2008). As shon by Duchi et al., it is straightforard to find the optimal ν in time linear in the dimension d. An open problem is to find an efficient algorithm for solving the generalized projections above hen using the 2 rather than norm. Mixed-norm regularization No e consider the problem of mixed-norm regularization, in hich e ish to minimize functions f t (W ) + r(w ) of a matrix W R d k. In particular, e define i R k to be the i th ro of W, and e set r(w ) = λ d i= i p2. We also use the p-norm Bregman functions as above ith ψ(w ) = 2 W 2 p = 2 ( i,j W p ij) /p, hich are (p )-strongly convex ith respect to the l p -norm squared. As earlier, our minimization problem becomes min W V, W + 2 W 2 p + λ W l/l p2, hose dual problem is min V Θ Θ q s.t. θi q2 λ Raising the first norm to the q -poer, e see that the problem is separable, and e can solve it using the techniques in the prequel. 7.3 Matrix Composite Mirror Descent We no consider a setting that generalizes the previous discussions in hich our variables W t are matrices W t Ω = R d d2. We use Bregman functions based on Schatten p-norms (e.g. Horn and Johnson, 985, Section 7.4). Schatten p-norms are the family of unitarily invariant matrix norms arising out of applying p-norms to the singular values of the matrix W. That is, letting σ(w ) denote the vector of singular values of W, e set W p = σ(w ) p. We use Bregman functions ψ(w ) = 2 W 2 p, hich, similar to the p-norm on vectors, are (p )-strongly convex over Ω = R d d2 ith respect to the norm p (Ball et al., 994). As in the previous subsection, e mainly consider to values for p, p = 2 and a value very near, namely p = + / log d. For p = 2, ψ is -strongly convex ith respect to 2 = Fr, the

11 Frobenius norm. For the second value, ψ is / log d-strongly convex ith respect to +/ log d, or, ith a bit more ork, ψ is /(3 log d)-strongly convex.r.t., the trace or nuclear norm. We focus on the specific setting of trace-norm regularization, or r(w ) = λ W. This norm, similar to the l -norm on vectors, gives sparsity in the singular value spectrum of W and hence is useful for rank-minimization (Recht et al., 2007). The generic Comid update ith the above choice of ψ gives a Schatten p-norm Comid algorithm for matrix applications: W t+ = argmin W Ω η f t(w t ), W + B ψ (W, W t ) + ηλ W. (7) The update is ell defined since ψ is strongly convex as per the above discussion. We also have defined W, V = tr(w V ) as the usual matrix inner product. The generic Comid convergence result in Thm. 2 immediately yields the folloing to corollaries. Corollary Let the sequence {W t } be defined by the update in Eq. (7) ith p = 2. If each f t satisfies f t(w t ) 2 G 2, then there is a stepsize η for hich the regret against W Ω is R φ (T ) G 2 W 2 T Corollary 2 Let p = + / log d in the Schatten Comid update of Eq. (7). Let q = + log d. If each f t satisfies f t(w t ) q G q then R φ (T ) G q W p T log d G W T log d here G = max t f t(w t ) = max t σ max (f t(w t )). Let us consider the actual implementation of the Comid update. Similar convergence rates (ith orse constants and a negative dependence on the spectrum of r) to those above can be achieved via simple mirror descent, i.e. by linearizing r(w ). The advantage of Comid is that it achieves sparsity in the spectrum, as the folloing proposition demonstrates. Proposition 3 Let Ψ() = 2 2 p and S τ (v) = sign(v) [ v τ] + as in the prequel. For p (, 2], the update in Eq. (7) can be implemented as follos. Compute SVD: W t = U t diag(σ(w t ))V t Gradient step: Θ t = U t diag( Ψ(σ(W t )))V t ηf t(w t ) (8a) Compute SVD: Θ t = Ũ t diag(σ(θ t ))Ṽ t Splitting update: W t+ = Ũ t diag ( ( Ψ) (S ηλ (σ(θ t ))) ) Ṽ t (8b) Note that the first SVD, Eq. (8a), is used for notational convenience only and need not be computed at each iteration, since it is maintained at the end of iteration t via Eq. (8b). The computational requirements for Comid are thus the same as standard mirror descent, hich also requires an SVD computation on each step to compute W. The last step of the update, Eq. (8b) applies the shrinkage/thresholding operator S ηλ to the spectrum of Θ t, hich introduces sparsity. Furthermore, due to the sign (and hence sparsity) preserving nature of the map ( Ψ), the sparsity in the spectrum is maintained in W t+. Lastly, the special case for p = 2, the standard Frobenius norm update, as derived (but ithout rates or alloing stochastic gradients) by Ma et al. (2009), ho report good empirical results for their algorithm. In trace-norm applications, e expect W to be small. Therefore, in such applications, our ne Schatten-p COMID algorithm ith p should give strong performance since G can be much smaller than G 2. Proof of Proposition 3: We kno from the prequel that the Comid step is equivalent to ψ( W t ) = ψ(w t ) ηf t(w t ) and W t+ = argmin W { B ψ (W, W t ) + ηr( W t ) Since W t has singular value decomposition U t diag(σ(w ))V t and ψ(w ) = Ψ(σ(W )) is unitarily invariant, ψ(w t ) = U t diag( Ψ(σ(W t )))V t (Leis, 995, Corollary 2.5). This means that the Θ t computed in step 2 above is simply ψ( W t ). The proof essentially amounts to a reduction to the vector case, since the norms are unitarily invariant, and ill be complete if e prove that V = argmin W {B ψ (W, W t ) + τ W } }.

12 has the unique solution ( V = Ũ t diag ( Ψ) (S τ (σ( ψ( W t )))) )Ṽ t, (9) }{{} here Ũ t diag(σ( W t ))Ṽ t is the SVD of W t. By subgradient optimality conditions, it is sufficient that the proposed solution V satisfy 0 d d 2 ψ(v ) ψ( W t ) + τ V. Applying Leis s Corollary 2.5, e can continue to use the orthonormal matrices Ũ t and Ṽ t, and e see that the proposed V in Eq. (9) satisfies ψ(v ) = Ũ t diag( Ψ( )))Ṽ t and V = Ũ t diag( )Ṽ t. We have thus reduced the problem to shoing that 0 d Ψ( ) + Ψ(σ( W t )) + τ, since e chose the matrices to have the same singular vectors by construction. From Lemma 0 presented earlier, e already kno that satisfies this equation if and only if Ψ( ) = S τ ( Ψ(σ( W t ))), hich is indeed the case by definition of (noting of course that σ( ψ( W t )) = Ψ(σ( W t ))). Acknoledgments JCD as supported by a National Defense Science and Engineering Graduate (NDSEG) felloship, and some of this ork as completed hile JCD as an intern at Google Research. References K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 68: , 967. K. Ball, E. Carlen, and E. Lieb. Sharp uniform convexity and smoothness inequalities for trace norms. Inventiones mathematicae, 5: , 994. A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 3:67 75, L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems 20, pages 6 68, S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9): , September I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems ith a sparsity constraint. Communication on Pure and Applied Mathematics, 57(): , J. Duchi and Y. Singer. Online and batch learning using forard backard splitting. Journal of Machine Learning Research, 0: , J. Duchi, S. Shalev-Shartz, Y. Singer, and T. Chandra. Efficient projections onto the l -ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, C. Gentile and N. Littlestone. The robustness of the p-norm algorithms. In Proceedings of the Telfth Annual Conference on Computational Learning Theory, 999. E. Hazan, A. Kalai, S. Kale, and A. Agaral. Logarithmic regret algorithms for online convex optimization. In Proceedings of the 9th Annual Conference on Computational Learning Theory, Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 985.

13 S. Kakade and A. Teari. On the generalization ability of online strongly convex programming algorithms. In Advances in Neural Information Processing Systems 22. MIT Press, J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 0: , A. Leis. The convex analysis of unitarily invariant matrix functions. Journal of Convex Analysis, 2:73 83, 995. P. L. Lions and B. Mercier. Splitting algorithms for the sum of to nonlinear operators. SIAM Journal on Numerical Analysis, 6: , 979. N. Littlestone. From on-line to batch learning. In Proceedings of the Second Annual Workshop on Computational Learning Theory, pages , July 989. Nick Littlestone and Manfred K. Warmuth. The eighted majority algorithm. Information and Computation, 08:22 26, 994. S. Ma, D. Goldfarb, and L. Chen. Fixed point and Bregman iterative methods for matrix rank minimization. Submitted, URL A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, Ne York, 983. Y. Nesterov. Gradient methods for minimizing composite objective function. Technical Report 76, Center for Operations Research and Econometrics, Catholic University of Louvain (UCL), Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 20():22 259, G. Obozinski, B. Taskar, and M. Jordan. Joint covariate selection for grouped classification. Technical Report 743, Dept. of Statistics, University of California Berkeley, B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. Submitted, R.T. Rockafellar. Convex Analysis. Princeton University Press, 970. S. Shalev-Shartz and Y. Singer. Logarithmic regret algorithms for strongly convex repeated games. Technical report, The Hebre University, URL S. Shalev-Shartz and N. Srebro. SVM optimization: Inverse dependence on training set size. In Proceedings of the 25th International Conference on Machine Learning, pages , S. Shalev-Shartz and A. Teari. Stochastic methods for l -regularized loss minimization. In Proceedings of the 26th International Conference on Machine Learning, N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems 7, P. Tseng. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Submitted, S. Wright, R. Noak, and M. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7): , L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems 23, M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, 2003.

Composite Objective Mirror Descent

Composite Objective Mirror Descent Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Convex Repeated Games and Fenchel Duality

Convex Repeated Games and Fenchel Duality Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Convex Repeated Games and Fenchel Duality

Convex Repeated Games and Fenchel Duality Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,

More information

Lecture 16: FTRL and Online Mirror Descent

Lecture 16: FTRL and Online Mirror Descent Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Enhancing Generalization Capability of SVM Classifiers with Feature Weight Adjustment

Enhancing Generalization Capability of SVM Classifiers with Feature Weight Adjustment Enhancing Generalization Capability of SVM Classifiers ith Feature Weight Adjustment Xizhao Wang and Qiang He College of Mathematics and Computer Science, Hebei University, Baoding 07002, Hebei, China

More information

Accelerated Gradient Method for Multi-Task Sparse Learning Problem

Accelerated Gradient Method for Multi-Task Sparse Learning Problem Accelerated Gradient Method for Multi-Task Sparse Learning Problem Xi Chen eike Pan James T. Kwok Jaime G. Carbonell School of Computer Science, Carnegie Mellon University Pittsburgh, U.S.A {xichen, jgc}@cs.cmu.edu

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum

CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum 1997 19 CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION 3.0. Introduction

More information

Lecture 3a: The Origin of Variational Bayes

Lecture 3a: The Origin of Variational Bayes CSC535: 013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton The origin of variational Bayes In variational Bayes, e approximate the true posterior across parameters

More information

Stochastic Methods for l 1 Regularized Loss Minimization

Stochastic Methods for l 1 Regularized Loss Minimization Shai Shalev-Shwartz Ambuj Tewari Toyota Technological Institute at Chicago, 6045 S Kenwood Ave, Chicago, IL 60637, USA SHAI@TTI-CORG TEWARI@TTI-CORG Abstract We describe and analyze two stochastic methods

More information

y(x) = x w + ε(x), (1)

y(x) = x w + ε(x), (1) Linear regression We are ready to consider our first machine-learning problem: linear regression. Suppose that e are interested in the values of a function y(x): R d R, here x is a d-dimensional vector-valued

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Spectral k-support Norm Regularization

Spectral k-support Norm Regularization Spectral k-support Norm Regularization Andrew McDonald Department of Computer Science, UCL (Joint work with Massimiliano Pontil and Dimitris Stamos) 25 March, 2015 1 / 19 Problem: Matrix Completion Goal:

More information

Adaptive Online Learning in Dynamic Environments

Adaptive Online Learning in Dynamic Environments Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi Elad Hazan Yoram Singer Electrical Engineering and Computer Sciences University of California at Berkeley Technical

More information

Minimax strategy for prediction with expert advice under stochastic assumptions

Minimax strategy for prediction with expert advice under stochastic assumptions Minimax strategy for prediction ith expert advice under stochastic assumptions Wojciech Kotłosi Poznań University of Technology, Poland otlosi@cs.put.poznan.pl Abstract We consider the setting of prediction

More information

The Interplay Between Stability and Regret in Online Learning

The Interplay Between Stability and Regret in Online Learning The Interplay Between Stability and Regret in Online Learning Ankan Saha Department of Computer Science University of Chicago ankans@cs.uchicago.edu Prateek Jain Microsoft Research India prajain@microsoft.com

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick

Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick Nojun Kak, Member, IEEE, Abstract In kernel methods such as kernel PCA and support vector machines, the so called kernel

More information

Notes on AdaGrad. Joseph Perla 2014

Notes on AdaGrad. Joseph Perla 2014 Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

More information

Mirror Descent for Metric Learning. Gautam Kunapuli Jude W. Shavlik

Mirror Descent for Metric Learning. Gautam Kunapuli Jude W. Shavlik Mirror Descent for Metric Learning Gautam Kunapuli Jude W. Shavlik And what do we have here? We have a metric learning algorithm that uses composite mirror descent (COMID): Unifying framework for metric

More information

Accelerating Online Convex Optimization via Adaptive Prediction

Accelerating Online Convex Optimization via Adaptive Prediction Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008 COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008 In the previous lecture, e ere introduced to the SVM algorithm and its basic motivation

More information

A NONCONVEX ADMM ALGORITHM FOR GROUP SPARSITY WITH SPARSE GROUPS. Rick Chartrand and Brendt Wohlberg

A NONCONVEX ADMM ALGORITHM FOR GROUP SPARSITY WITH SPARSE GROUPS. Rick Chartrand and Brendt Wohlberg A NONCONVEX ADMM ALGORITHM FOR GROUP SPARSITY WITH SPARSE GROUPS Rick Chartrand and Brendt Wohlberg Los Alamos National Laboratory Los Alamos, NM 87545, USA ABSTRACT We present an efficient algorithm for

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

A ROBUST BEAMFORMER BASED ON WEIGHTED SPARSE CONSTRAINT

A ROBUST BEAMFORMER BASED ON WEIGHTED SPARSE CONSTRAINT Progress In Electromagnetics Research Letters, Vol. 16, 53 60, 2010 A ROBUST BEAMFORMER BASED ON WEIGHTED SPARSE CONSTRAINT Y. P. Liu and Q. Wan School of Electronic Engineering University of Electronic

More information

The FTRL Algorithm with Strongly Convex Regularizers

The FTRL Algorithm with Strongly Convex Regularizers CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Online Learning meets Optimization in the Dual

Online Learning meets Optimization in the Dual Online Learning meets Optimization in the Dual Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., The Hebrew University, Jerusalem 91904, Israel 2 Google Inc., 1600 Amphitheater

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7. Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

Proximal methods. S. Villa. October 7, 2014

Proximal methods. S. Villa. October 7, 2014 Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

Online Optimization with Gradual Variations

Online Optimization with Gradual Variations JMLR: Workshop and Conference Proceedings vol (0) 0 Online Optimization with Gradual Variations Chao-Kai Chiang, Tianbao Yang 3 Chia-Jung Lee Mehrdad Mahdavi 3 Chi-Jen Lu Rong Jin 3 Shenghuo Zhu 4 Institute

More information

Max-Margin Ratio Machine

Max-Margin Ratio Machine JMLR: Workshop and Conference Proceedings 25:1 13, 2012 Asian Conference on Machine Learning Max-Margin Ratio Machine Suicheng Gu and Yuhong Guo Department of Computer and Information Sciences Temple University,

More information

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Bivariate Uniqueness in the Logistic Recursive Distributional Equation

Bivariate Uniqueness in the Logistic Recursive Distributional Equation Bivariate Uniqueness in the Logistic Recursive Distributional Equation Antar Bandyopadhyay Technical Report # 629 University of California Department of Statistics 367 Evans Hall # 3860 Berkeley CA 94720-3860

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

On the approximation of real powers of sparse, infinite, bounded and Hermitian matrices

On the approximation of real powers of sparse, infinite, bounded and Hermitian matrices On the approximation of real poers of sparse, infinite, bounded and Hermitian matrices Roman Werpachoski Center for Theoretical Physics, Al. Lotnikó 32/46 02-668 Warszaa, Poland Abstract We describe a

More information

A Modified ν-sv Method For Simplified Regression A. Shilton M. Palaniswami

A Modified ν-sv Method For Simplified Regression A. Shilton M. Palaniswami A Modified ν-sv Method For Simplified Regression A Shilton apsh@eemuozau), M Palanisami sami@eemuozau) Department of Electrical and Electronic Engineering, University of Melbourne, Victoria - 3 Australia

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

A simpler unified analysis of Budget Perceptrons

A simpler unified analysis of Budget Perceptrons Ilya Sutskever University of Toronto, 6 King s College Rd., Toronto, Ontario, M5S 3G4, Canada ILYA@CS.UTORONTO.CA Abstract The kernel Perceptron is an appealing online learning algorithm that has a drawback:

More information

Stat 8112 Lecture Notes Weak Convergence in Metric Spaces Charles J. Geyer January 23, Metric Spaces

Stat 8112 Lecture Notes Weak Convergence in Metric Spaces Charles J. Geyer January 23, Metric Spaces Stat 8112 Lecture Notes Weak Convergence in Metric Spaces Charles J. Geyer January 23, 2013 1 Metric Spaces Let X be an arbitrary set. A function d : X X R is called a metric if it satisfies the folloing

More information

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

Maximum Margin Matrix Factorization

Maximum Margin Matrix Factorization Maximum Margin Matrix Factorization Nati Srebro Toyota Technological Institute Chicago Joint work with Noga Alon Tel-Aviv Yonatan Amit Hebrew U Alex d Aspremont Princeton Michael Fink Hebrew U Tommi Jaakkola

More information

SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1

SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1 SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1 Masao Fukushima 2 July 17 2010; revised February 4 2011 Abstract We present an SOR-type algorithm and a

More information

Stochastic dynamical modeling:

Stochastic dynamical modeling: Stochastic dynamical modeling: Structured matrix completion of partially available statistics Armin Zare www-bcf.usc.edu/ arminzar Joint work with: Yongxin Chen Mihailo R. Jovanovic Tryphon T. Georgiou

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Subgradient Methods for Maximum Margin Structured Learning

Subgradient Methods for Maximum Margin Structured Learning Nathan D. Ratliff ndr@ri.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. 15213 USA Martin A. Zinkevich maz@cs.ualberta.ca Department of Computing

More information

Convergence rate of SGD

Convergence rate of SGD Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected

More information

UC Berkeley Department of Electrical Engineering and Computer Sciences. EECS 126: Probability and Random Processes

UC Berkeley Department of Electrical Engineering and Computer Sciences. EECS 126: Probability and Random Processes UC Berkeley Department of Electrical Engineering and Computer Sciences EECS 26: Probability and Random Processes Problem Set Spring 209 Self-Graded Scores Due:.59 PM, Monday, February 4, 209 Submit your

More information

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1 90 8 80 7 70 6 60 0 8/7/ Preliminaries Preliminaries Linear models and the perceptron algorithm Chapters, T x + b < 0 T x + b > 0 Definition: The Euclidean dot product beteen to vectors is the expression

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle

More information

Multi-class classification via proximal mirror descent

Multi-class classification via proximal mirror descent Multi-class classification via proximal mirror descent Daria Reshetova Stanford EE department resh@stanford.edu Abstract We consider the problem of multi-class classification and a stochastic optimization

More information

Inverse Time Dependency in Convex Regularized Learning

Inverse Time Dependency in Convex Regularized Learning Inverse Time Dependency in Convex Regularized Learning Zeyuan Allen Zhu 2*, Weizhu Chen 2, Chenguang Zhu 23, Gang Wang 2, Haixun Wang 2, Zheng Chen 2 Fundamental Science Class, Department of Physics, Tsinghua

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

Econ 201: Problem Set 3 Answers

Econ 201: Problem Set 3 Answers Econ 20: Problem Set 3 Ansers Instructor: Alexandre Sollaci T.A.: Ryan Hughes Winter 208 Question a) The firm s fixed cost is F C = a and variable costs are T V Cq) = 2 bq2. b) As seen in class, the optimal

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

A survey: The convex optimization approach to regret minimization

A survey: The convex optimization approach to regret minimization A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization

More information