arxiv: v2 [stat.ml] 3 Jun 2017

Size: px
Start display at page:

Download "arxiv: v2 [stat.ml] 3 Jun 2017"

Transcription

1 : A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient Lam M. Nguyen Jie Liu Katya Scheinberg Martin Takáč arxiv: v [stat.ml] 3 Jun 07 Abstract In this paper, we propose a StochAstic Recursive gradient algorithm (), as well as its practical variant +, as a novel approach to the finite-sum minimization problems. Different from the vanilla SGD and other modern stochastic methods such as, SGD, and A, admits a simple recursive framework for updating stochastic gradient estimates; when comparing to /A, does not require a storage of past gradients. The linear convergence rate of is proven under strong convexity assumption. We also prove a linear convergence rate (in the strongly convex case) for an inner loop of, the property that does not possess. Numerical experiments demonstrate the efficiency of our algorithm.. Introduction We are interested in solving a problem of the form min w R d def P(w) = n i [n] f i (w), () where each f i, i [n] def = {,...,n}, is convex with a Lipschitz continuous gradient. Throughout the paper, we assume that there exists an optimal solutionw of (). Department of Industrial and Systems Engineering, Lehigh University, USA. On leave at The University of Oxford, UK. All authors were supported by NSF Grant CCF Katya Scheinberg was partially supported by NSF Grants DMS , CCF-3037 and CCF Correspondence to: Lam M. Nguyen <lamnguyen.mltd@gmail.com>, Jie Liu <jie.liu.08@gmail.com>, Katya Scheinberg <katyas@lehigh.edu>, Martin Takáč <Takac.MT@gmail.com>. Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 07. Copyright 07 by the author(s). Problems of this type arise frequently in supervised learning applications (Hastie et al., 009). Given a training set {(x i,y i )} n i= with x i R d,y i R, the least squares regression model, for example, is written as () withf i (w) def = (x T i w y i) + λ w, where denotes thel -norm. The l -regularized logistic regression for binary classification is written withf i (w) def = log(+exp( y i x T i w))+ λ w (y i {,}). In recent years, many advanced optimization methods have been developed for problem (). While the objective function is smooth and convex, the traditional optimization methods, such as gradient descent (GD) or Newton method are often impractical for this problem, when n the number of training samples and hence the number of f i s is very large. In particular, GD updates iterates as follows w t+ = w t η t P(w t ), t = 0,,,... Under strong convexity assumption on P and with appropriate choice ofη t, GD converges at a linear rate in terms of objective function valuesp(w t ). However, whennis large, computing P(w t ) at each iteration can be prohibitive. As an alternative, stochastic gradient descent (SGD), originating from the seminal work of Robbins and Monro in 95 (Robbins & Monro, 95), has become the method of choice for solving (). At each step, SGD picks an index i [n] uniformly at random, and updates the iterate as w t+ = w t η t f i (w t ), which is up-to n times cheaper than an iteration of a full gradient method. The convergence rate of SGD is slower than that of GD, in particular, it is sublinear in the strongly convex case. The tradeoff, however, is advantageous due to the tremendous per-iteration savings and the fact that low accuracy solutions are sufficient. This trade-off has been thoroughly analyzed in (Bottou, 998). Unfortunately, in practice SGD method is often too slow and its performance is too sensitive to the variance in the sample gradients f i (w t ). Use of mini-batches (averaging multiple sample gradients f i (w t )) was used in (Shalev-Shwartz et al., 007; We mark here that even though stochastic gradient is referred to as SG in literature, the term stochastic gradient descent (SGD) has been widely used in many important works of large-scale learning, including /A, SDCA, and MISO.

2 Table : Comparisons between different algorithms for strongly convex functions. κ = L/µ is the condition number. Method Complexity Fixed Learning Rate Low Storage Cost GD O(nκ log(/ǫ)) SGD O(/ǫ) O((n+κ)log(/ǫ)) /A O((n+κ)log(/ǫ)) O((n+κ)log(/ǫ)) Table : Comparisons between different algorithms for convex functions. Method Complexity GD O(n/ǫ) SGD O ( /ǫ ) O(n+( n/ǫ)) A O(n+(n/ǫ)) O((n+(/ǫ))log(/ǫ)) (one outer loop) O ( n+(/ǫ ) ) Cotter et al., 0; Takáč et al., 03) to reduce the variance and improve convergence rate by constant factors. Using diminishing sequence {η t } is used to control the variance (Shalev-Shwartz et al., 0; Bottou et al., 06), but the practical convergence of SGD is known to be very sensitive to the choice of this sequence, which needs to be hand-picked. Recently, a class of more sophisticated algorithms have emerged, which use the specific finite-sum form of () and combine some deterministic and stochastic aspects to reduce variance of the steps. The examples of these methods are /A (Le Roux et al., 0; Defazio et al., 04), SDCA (Shalev-Shwartz & Zhang, 03), (Johnson & Zhang, 03; Xiao & Zhang, 04), DIAG (Mokhtari et al., 07), MISO (Mairal, 03) and SGD (Konečný & Richtárik, 03), all of which enjoy faster convergence rate than that of SGD and use a fixed learning rate parameterη. In this paper we introduce a new method in this category,, which further improves several aspects of the existing methods. In Table we summarize complexity and some other properties of the existing methods and when applied to strongly convex problems. Although and have the same convergence rate, we introduce a practical variant of that outperforms in our experiments. In addition, theoretical results for complexity of the methods or their variants when applied to general convex functions have been derived (Schmidt et al., 06; Defazio et al., 04; Reddi et al., 06; Allen-Zhu & Yuan, 06; Allen-Zhu, 07). In Table we summarize the key complexity results, noting that convergence rate is now sublinear. Our Contributions. In this paper, we propose a novel algorithm which combines some of the good properties of existing algorithms, such as A and, while aiming to improve on both of these methods. In particular, our algorithm does not take steps along a stochastic gradient direction, but rather along an accumulated direction using past stochastic gradient information (as in A) and occasional exact gradient information (as in ). We summarize the key properties of the proposed algorithm below. Similarly to, s iterations are divided into the outer loop where a full gradient is computed and the inner loop where only stochastic gradient is computed. Unlike the case of, the steps of the inner loop of are based on accumulated stochastic information. Like /A and, has a sublinear rate of convergence for general convex functions, and a linear rate of convergence for strongly convex functions. uses a constant learning rate, whose size is larger than that of. We analyze and discuss the optimal choice of the learning rate and the number of inner loop steps. However, unlike /A but similar to, does not require a storage ofnpast stochastic gradients. We also prove a linear convergence rate (in the strongly convex case) for the inner loop of, the property that does not possess. We show that the variance of the steps inside the inner loop goes to zero, thus is theoretically more stable and reliable than. We provide a practical variant of based on the convergence properties of the inner loop, where the simple stable stopping criterion for the inner loop is used (see Section 4 for more details). This variant shows how can be made more stable than in practice.. Stochastic Recursive Gradient Algorithm Now we are ready to present our (Algorithm ). The key step of the algorithm is a recursive update of the stochastic gradient estimate ( update) v t = f it (w t ) f it (w t )+v t, () followed by the iterate update: w t+ = w t ηv t. (3) For comparison, update can be written in a similar way as v t = f it (w t ) f it (w 0 )+v 0. (4)

3 Algorithm Parameters: the learning rate η > 0 and the inner loop size m. Initialize: w 0 Iterate: fors =,,... do w 0 = w s v 0 = n n i= f i(w 0 ) w = w 0 ηv 0 Iterate: fort =,...,m do Samplei t uniformly at random from[n] v t = f it (w t ) f it (w t )+v t w t+ = w t ηv t end for Set w s = w t with t chosen uniformly at random from {0,,...,m} end for Observe that in, v t is an unbiased estimator of the gradient, while it is not true for. Specifically, E[v t F t ] = P(w t ) P(w t )+v t P(w t ), (5) where 3 F t = σ(w 0,i,i,...,i t ) is the σ-algebra generated by w 0,i,i,...,i t ; F 0 = F = σ(w 0 ). Hence, is different from SGD and type of methods, however, the following total expectation holds, E[v t ] = E[ P(w t )], differentiating from /A. is similar to since they both contain outer loops which require one full gradient evaluation per outer iteration followed by one full gradient descent step with a given learning rate. The difference lies in the inner loop, where updates the stochastic step direction v t recursively by adding and subtracting component gradients to and from the previousv t (t ) in (). Each inner iteration evaluatesstochastic gradients and hence the total work per outer iteration iso(n+m) in terms of the number of gradient evaluations. Note that due to its nature, without running the inner loop, i.e., m =, reduces to the GD algorithm. 3. Theoretical Analysis To proceed with the analysis of the proposed algorithm, we will make the following common assumptions. Assumption (L-smooth). Eachf i : R d R, i [n], is L-smooth, i.e., there exists a constant L > 0 such that f i (w) f i (w ) L w w, w,w R d. E[ F t] = E it [ ], which is expectation with respect to the random choice of indexi t (conditioned on w 0,i,i,...,i t ). 3 F t also contains all the information of w 0,...,w t as well as v 0,...,v t. Note that this assumption implies that P(w) = n n i= f i(w) is also L-smooth. The following strong convexity assumption will be made for the appropriate parts of the analysis, otherwise, it would be dropped. Assumption a (µ-strongly convex). The function P : R d R, is µ-strongly convex, i.e., there exists a constant µ > 0 such that w,w R d, P(w) P(w )+ P(w ) T (w w )+ µ w w. Another, stronger, assumption of µ-strong convexity for () will also be imposed when required in our analysis. Note that Assumption b implies Assumption a but not vice versa. Assumption b. Each function f i : R d R, i [n], is strongly convex with µ > 0. Under Assumption a, let us define the (unique) optimal solution of () as w, Then strong convexity of P implies that µ[p(w) P(w )] P(w), w R d. (6) We note here, for future use, that for strongly convex functions of the form (), arising in machine learning applications, the condition number is defined as κ def = L/µ. Furthermore, we should also notice that Assumptions a and b both cover a wide range of problems, e.g. l -regularized empirical risk minimization problems with convex losses. Finally, as a special case of the strong convexity of all f i s with µ = 0, we state the general convexity assumption, which we will use for convergence analysis. Assumption 3. Each function f i : R d R, i [n], is convex, i.e., f i (w) f i (w )+ f i (w ) T (w w ), i [n]. Again, we note that Assumption b implies Assumption 3, but Assumption a does not. Hence in our analysis, depending on the result we aim at, we will require Assumption 3 to hold by itself, or Assumption a and Assumption 3 to hold together, or Assumption b to hold by itself. We will always use Assumption. Our iteration complexity analysis aims to bound the number of outer iterations T (or total number of stochastic gradient evaluations) which is needed to guarantee that P(w T ) ǫ. In this case we will say that w T is an ǫ-accurate solution. However, as is common practice for stochastic gradient algorithms, we aim to obtain the bound on the number of iterations, which is required to guarantee the bound on the expected squared norm of a gradient, i.e., E[ P(w T ) ] ǫ. (7)

4 x A Simple Example with x x A Simple Example with x Figure : A two-dimensional example ofmin wp(w) withn = 5 for (left) and (right). vt 0 4 rcv, Moving Average with Span vt 0 3 rcv, Moving Average with Span Figure : An example of l -regularized logistic regression on rcv training dataset for,, and with multiple outer iterations (left) and a single outer iteration (right). 3.. Linearly Diminishing Step-Size in a Single Inner Loop The most important property of the algorithm is the variance reduction of the steps. This property holds as the number of outer iteration grows, but it does not hold, if only the number of inner iterations increases. In other words, if we simply run the inner loop for many iterations (without executing additional outer loops), the variance of the steps does not reduce in the case of, while it goes to zero in the case of. To illustrate this effect, let us take a look at Figures and. In Figure, we applied one outer loop of and to a sum of 5 quadratic functions in a twodimensional space, where the optimal solution is at the origin, the black lines and black dots indicate the trajectory of each algorithm and the red point indicates the final iterate. Initially, both and take steps along stochastic gradient directions towards the optimal solution. However, later iterations of wander randomly around the origin with large deviation from it, while follows a much more stable convergent trajectory, with a final iterate falling in a small neighborhood of the optimal solution. In Figure, the x-axis denotes the number of effective passes which is equivalent to the number of passes through all of the data in the dataset, the cost of each pass being equal to the cost of one full gradient evaluation; and y-axis represents v t. Figure shows the evolution of v t for,, (SGD with decreasing learning rate) and (an accelerated version of GD (Beck & Teboulle, 009)) withm = 4n, where the left plot shows the trend over multiple outer iterations and the right plot shows a single outer iteration 4. We can see that for, v t decreases over the outer iterations, while it has an increasing trend or oscillating trend for each inner loop. In contrast, enjoys decreasing trends both in the outer and the inner loop iterations. We will now show that the stochastic steps computed by converge linearly in the inner loop. We present two linear convergence results based on our two different assumptions of µ-strong convexity. These results substantiate our conclusion that uses more stable stochastic gradient estimates than. The following theorem is our first result to demonstrate the linear convergence of our stochastic recursive step v t. Theorem a. Suppose that Assumptions, a and 3 hold. Consider v t defined by () in (Algorithm ) with η < /L. Then, for anyt, [ ( E[ v t ] )µ η ] E[ v t ] [ ( )µ η ] t E[ P(w0 ) ]. This result implies that by choosing η = O(/L), we obtain the linear convergence of v t in expectation with the rate ( /κ ). Below we show that a better convergence rate can be obtained under a stronger convexity assumption. Theorem b. Suppose that Assumptions and b hold. Consider v t defined by () in (Algorithm ) with η /(µ+l). Then the following bound holds, t, ( E[ v t ] µlη µ+l ) E[ v t ] ( µlη µ+l) te[ P(w0 ) ]. Again, by setting η = O(/L), we derive the linear convergence with the rate of ( /κ), which is a significant improvement over the result of Theorem a, when the problem is severely ill-conditioned. 3.. Convergence Analysis In this section, we derive the general convergence rate results for Algorithm. First, we present two important Lemmas as the foundation of our theory. Then, we proceed to prove sublinear convergence rate of a single outer iteration when applied to general convex functions. In the end, we prove that the algorithm with multiple outer iterations has linear convergence rate in the strongly convex case. 4 In the plots of Figure, since the data for is noisy, we smooth it by using moving average filters with spans 00 for the left plot and 0 for the right one.

5 We begin with proving two useful lemmas that do not require any convexity assumption. The first Lemma bounds the sum of expected values of P(w t ). The second, Lemma, boundse[ P(w t ) v t ]. Lemma. Suppose that Assumption holds. Consider (Algorithm ). Then, we have E[ P(w t ) ] η E[P(w 0) P(w )] (8) + E[ P(w t ) v t ] ( Lη) E[ v t ]. Lemma. Suppose that Assumption holds. Consider v t defined by () in (Algorithm ). Then for anyt, E[ P(w t ) v t ] = t E[ v j v j ] j= t E[ P(w j ) P(w j ) ]. j= Now we are ready to provide our main theoretical results GENERAL CONVEX CASE Following from Lemma, we can obtain the following upper bound for E[ P(w t ) v t ] for convex functions f i,i [n]. Lemma 3. Suppose that Assumptions and 3 hold. Consider v t defined as () in (Algorithm ) with η < /L. Then we have that for anyt, E[ P(w t ) v t ] [ ] E[ v 0 ] E[ v t ] E[ v 0 ]. (9) Using the above lemmas, we can state and prove one of our core theorems as follows. Theorem. Suppose that Assumptions and 3 hold. Consider (Algorithm ) with η /L. Then for any s, we have E[ P( w s ) ] η(m+) E[P( w s ) P(w )] + E[ P( w s ) ]. (0) Proof. Sincev 0 = P(w 0 ) implies P(w 0 ) v 0 = 0 then by Lemma 3, we can write m E[ P(w t) v t ] m E[ v 0 ]. () Hence, by Lemma withη /L, we have m E[ P(w t) ] η E[P(w 0) P(w )]+ m E[ P(w t) v t ] () η E[P(w 0) P(w )]+ m E[ v 0 ]. () Since we are considering one outer iteration, with s, then we have v 0 = P(w 0 ) = P( w s ) (since w 0 = w s ), and w s = w t, where t is picked uniformly at random from{0,,...,m}. Therefore, the following holds, E[ P( w s ) ] = m+ m E[ P(w t) ] () η(m+) E[P( w s ) P(w )] + E[ P( w s ) ]. Theorem, in the case whenη /L implies that E[ P( w s ) ] η(m+) E[P( w s ) P(w )] +E[ P( w s ) ]. By choosing the learning rateη = L(m+) (withmsuch that L(m+) /L) we can derive the following convergence result, E[ P( w s ) ] L m+ E[P( w s ) P(w )+ P( w s ) ]. Clearly, this result shows a sublinear convergence rate for under general convexity assumption within a single inner loop, with increasing m, and consequently, we have the following result for complexity bound. Corollary. Suppose that Assumptions and 3 hold. Consider (Algorithm ) within a single outer iteration with the learning rateη = L(m+) wherem L is the total number of iterations, then P(w t ) converges sublinearly in expectation with a rate of L m+, and therefore, the total complexity to achieve anǫ-accurate solution defined in (7) is O(n+/ǫ ). We now turn to estimating convergence of with multiple outer steps. Simply using Theorem for each of the outer steps we have the following result. Theorem 3. Suppose that Assumptions and 3 hold. Consider (Algorithm ) and define δ k = η(m+) E[P( w k) P(w )], k = 0,,...,s, andδ = max 0 k s δ k. Then we have E[ P( w s ) ] α s ( P( w 0 ) ), (3) ( ) where = δ + ( ), andα =.

6 Learning Rate Evolutions of Learning Rates σ m () α m () m 0 7 Convergence Rate 0 6 Evolutions of Convergence Rates σ m () α m () m 0 7 Convergence Rate Evolutions of Convergence Rates σ m () α m () m 0 7 Figure 3: Theoretical comparisons of learning rates (left) and convergence rates (middle and right) withn =,000,000 for and in one inner loop. Based on Theorem 3, we have the following total complexity for in the general convex case. Corollary. Let us choose = ǫ/4, α = / (with η = /(3L)), and m = O(/ǫ) in Theorem 3. Then, the total complexity to achieve an ǫ-accuracy solution defined in (7) iso((n+(/ǫ))log(/ǫ)) STRONGLY CONVEX CASE We now turn to the discussion of the linear convergence rate of under the strong convexity assumption on P. From Theorem, for any s, using property (6) of theµ-strongly convexp, we have E[ P( w s ) ] η(m+) E[P( w s ) P(w )] (6) ( and equivalently, + µη(m+) + E[ P( w s ) ] ) E[ P( w s ) ], E[ P( w s ) ] σ m E[ P( w s ) ]. (4) def Let us define σ m = µη(m+) +. Then by choosing η and m such that σ m <, and applying (4) recursively, we are able to reach the following convergence result. Theorem 4. Suppose that Assumptions, a and 3 hold. Consider (Algorithm ) with the choice ofη andm such that Then, we have σ m def = µη(m+) + <. (5) E[ P( w s ) ] (σ m ) s P( w 0 ). Remark. Theorem 4 implies that any η < /L will work for. Let us compare our convergence rate to that of. The linear rate of, as presented in (Johnson & Zhang, 03), is given by α m = µη( Lη)m + <. We observe that it implies that the learning rate has to satisfy η < /(4L), which is a tighter restriction than η < /L required by. In addition, with the same values of m and η, the rate or convergence of (the outer iterations) of is always smaller than that of. σ m = µη(m+) + = µη(m+) + /() < µη( Lη)m + 0.5/() = α m. Remark. To further demonstrate the better convergence properties of, let us consider following optimization problem min σ m, 0<η</L min α m, 0<η</4L which can be interpreted as the best convergence rates for different values of m, for both and. After simple calculations, we plot both learning rates and the corresponding theoretical rates of convergence, as shown in Figure 3, where the right plot is a zoom-in on a part of the middle plot. The left plot shows that the optimal learning rate for is significantly larger than that of, while the other two plots show significant improvement upon outer iteration convergence rates for over. Based on Theorem 4, we are able to derive the following total complexity for in the strongly convex case. Corollary 3. Fix ǫ (0,), and let us run with η = /(L) and m = 4.5κ for T iterations where T = log( P( w 0 ) /ǫ)/log(9/7), then we can derive an ǫ-accuracy solution defined in (7). Furthermore, we can obtain the total complexity of, to achieve the ǫ-accuracy solution, as O((n + κ) log(/ǫ)). 4. A Practical Variant While is an efficient variance-reducing stochastic gradient method, one of its main drawbacks is the sensitivity of the practical performance with respect to the choice of m. It is know that m should be around O(κ), 5 while it still remains unknown that what the exact best choice is. In this section, we propose a practical variant of as 5 In practice, when n is large, P(w) is often considered as a regularized Empirical Loss Minimization problem with regularization parameter λ =, then κ O(n). n

7 + (Algorithm ), which provides an automatic and adaptive choice of the inner loop sizem. Guided by the linear convergence of the steps in the inner loop, demonstrated in Figure, we introduce a stopping criterion based on the values of v t while upper-bounding the total number of steps by a large enough m for robustness. The other modification compared to (Algorithm ) is the more practical choice w s = w t, where t is the last index of the particular inner loop, instead of randomly selected intermediate index. Algorithm + Parameters: the learning rateη > 0,0 < γ and the maximum inner loop size m. Initialize: w 0 Iterate: fors =,,... do w 0 = w s v 0 = n n i= f i(w 0 ) w = w 0 ηv 0 t = while v t > γ v 0 and t < m do Samplei t uniformly at random from[n] v t = f it (w t ) f it (w t )+v t w t+ = w t ηv t t = t+ end while Set w s = w t end for Different from, + provides a possibility of earlier termination and unnecessary careful choices of m, and it also covers the classical gradient descent when we set γ = (since the while loop does not proceed). In Figure 4 we present the numerical performance of + with different γs on rcv and news0 datasets. The size of the inner loop provides a trade-off between the fast sublinear convergence in the inner loop and linear convergence in the outer loop. From the results, it appears that γ = /8 is the optimal choice. With a larger γ, i.e. γ > /8, the iterates in the inner loop do not provide sufficient reduction, before another full gradient computation is required, while with γ < /8 an unnecessary number of inner steps is performed without gaining substantial progress. Clearly γ is another parameter that requires tuning, however, in our experiments, the performance of + has been very robust with respect to the choices of γ and did not vary much from one data set to another. Similarly to, v t decreases in the outer iterations of +. However, unlike, + also inherits from the consistent decrease of v t in expectation in the inner loops. It is not possible to apply the same idea of adaptively terminating the inner loop of 0 0 rcv γ= γ=/ γ=/4 γ=/8 γ=/6 γ=/64 γ=/ news0 γ= γ=/ γ=/4 γ=/8 γ=/6 γ=/64 γ=/ Figure 4: An example of l -regularized logistic regression on rcv (left) and news0 (right) training datasets for + with different γs on loss residuals P(w) P(w ). Table 3: Summary of datasets used for experiments. Dataset d n (train) Sparsity n (test) L covtype ,709.% 74, ijcnn 9, % 49, news0,355,9 3, % 5, rcv 47,36 677, % 0, based on the reduction in v t, as v t may have side fluctuations as shown in Figure. 5. Numerical Experiments To support the theoretical analyses and insights, we present our empirical experiments, comparing and + with the state-of-the-art first-order methods for l -regularized logistic regression problems with f i (w) = log(+exp( y i x T i w))+ λ w, on datasets covtype, ijcnn, news0 and rcv 6. For ijcnn and rcv we use the predefined testing and training sets, while covtype and news0 do not have test data, hence we randomly split the datasets with 70% for training and 30% for testing. Some statistics of the datasets are summarized in Table 3. The penalty parameter λ is set to /n as is common practice (Le Roux et al., 0). Note that like /SGD and /A, also allows an efficient sparse implementation named lazy updates (Konečný et al., 06). We conduct and compare numerical results of with,, and. (Johnson & Zhang, 03) and (Le Roux et al., 0) are classic modern stochastic methods. is SGD with decreasing learning rate η = η 0 /(k +) where k is the number of effective passes and η 0 is some initial constant learning rate. (Beck & Teboulle, 009) is the Fast Iterative Shrinkage-Thresholding Algorithm, wellknown as an efficient accelerated version of the gradient descent. Even though for each method, there is a theoretical safe learning rate, we compare the results for the best learning rates in hindsight. Figure 5 shows numerical results in terms of loss residuals 6 All datasets are available at edu.tw/ cjlin/libsvmtools/datasets/.

8 0 0 covtype 0 0 ijcnn 0 0 news0 0 0 rcv Test Error Rate covtype Test Error Rate ijcnn Test Error Rate news Test Error Rate rcv Figure 5: Comparisons of loss residuals P(w) P(w ) (top) and test errors (bottom) from different modern stochastic methods on covtype, ijcnn, news0 and rcv. Table 4: Summary of best parameters for all the algorithms on different datasets. Dataset (m,η ) (m,η ) (η ) (η ) (η ) covtype (n, 0.9/L) (n, 0.8/L) 0.3/L 0.06/L 50/L ijcnn (0.5n, 0.8/L) (n, 0.5/L) 0.7/L 0./L 90/L news0 (0.5n, 0.9/L) (n, 0.5/L) 0./L 0./L 30/L rcv (0.7n, 0.7/L) (0.5n, 0.9/L) 0./L 0./L 0/L (top) and test errors (bottom) on the four datasets, is sometimes comparable or a little worse than other methods at the beginning. However, it quickly catches up to or surpasses all other methods, demonstrating a faster rate of decrease across all experiments. We observe that on covtype and rcv,, and are comparable with some advantage of on covtype. On ijcnn and news0, and consistently surpass the other methods. In particular, to validate the efficiency of our practical variant +, we provide an insight into how important the choices of m and η are for and in Table 4 and Figure 6. Table 4 presents the optimal choices of m and η for each of the algorithm, while Figure 6 shows the behaviors of and with different choices of m for covtype and ijcnn, where m s denote the best choices. In Table 4, the optimal learning rates of vary less among different datasets compared to all the other methods and they approximate the theoretical upper bound for (/L); on the contrary, for the other methods the empirical optimal rates can exceed their theoretical limits ( with /(4L), with /(6L), with /L). This empirical studies suggest that it is much easier to tune and find the ideal learning rate for. As observed in Figure 6, the behaviors of both and are quite sensitive to the choices of m. With improper choices of m, the loss residuals can be increased considerably from 0 5 to 0 3 on both covtype in 40 effective passes and ijcnn in 7 effective passes for m*/8 m*/4 m*/ m* m* 4m* 8m* (covtype) (covtype) m*/8 m*/4 m*/ m* m* 4m* 8m* (ijcnn) m*/8 m*/4 m*/ m* m* 4m* 8m* (ijcnn) m*/8 m*/4 m*/ m* m* 4m* 8m* Figure 6: Comparisons of loss residuals P(w) P(w ) for different inner loop sizes with (top) and (bottom) on covtype and ijcnn. /. 6. Conclusion We propose a new variance reducing stochastic recursive gradient algorithm, which combines some of the properties of well known existing algorithms, such as A and. For smooth convex functions, we show a sublinear convergence rate, while for strongly convex cases, we prove the linear convergence rate and the computational complexity as those of and. However, compared to, s convergence rate constant is smaller and the algorithms is more stable both theoretically and numerically. Additionally, we prove the linear convergence for inner loops of which support the claim of stability. Based on this convergence we derive a practical version of, with a simple stopping criterion for the inner loops.

9 Acknowledgements The authors would like to thank the reviewers for useful suggestions which helped to improve the exposition in the paper. References Allen-Zhu, Zeyuan. Katyusha: The First Direct Acceleration of Stochastic Gradient Methods. Proceedings of the 49th Annual ACM on Symposium on Theory of Computing (to appear), 07. Allen-Zhu, Zeyuan and Yuan, Yang. Improved for Non-Strongly-Convex or Sum-of-Non-Convex Objectives. In ICML, pp , 06. Beck, Amir and Teboulle, Marc. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, ():83 0, 009. Bottou, Léon. Online learning and stochastic approximations. In Saad, David (ed.), Online Learning in Neural Networks, pp Cambridge University Press, New York, NY, USA, 998. ISBN Bottou, Léon, Curtis, Frank E, and Nocedal, Jorge. Optimization methods for large-scale machine learning. arxiv: , 06. Cotter, Andrew, Shamir, Ohad, Srebro, Nati, and Sridharan, Karthik. Better mini-batch algorithms via accelerated gradient methods. In NIPS, pp , 0. Defazio, Aaron, Bach, Francis, and Lacoste-Julien, Simon. A: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS, pp , 04. Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics, nd edition, 009. Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp , 03. Konečný, Jakub, Liu, Jie, Richtárik, Peter, and Takáč, Martin. Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 0:4 55, 06. Konečný, Jakub and Richtárik, Peter. Semi-stochastic gradient descent methods. arxiv:3.666, 03. Le Roux, Nicolas, Schmidt, Mark, and Bach, Francis. A stochastic gradient method with an exponential convergence rate for finite training sets. In NIPS, pp , 0. Mairal, Julien. Optimization with first-order surrogate functions. In ICML, pp , 03. Mokhtari, Aryan, Gürbüzbalaban, Mert, and Ribeiro, Alejandro. A double incremental aggregated gradient method with linear convergence rate for large-scale optimization. Proceedings of IEEE International Conference on Acoustic, Speech and Signal Processing (to appear), 07. Nesterov, Yurii. Introductory lectures on convex optimization : a basic course. Applied optimization. Kluwer Academic Publ., Boston, Dordrecht, London, 004. ISBN Reddi, Sashank J., Hefny, Ahmed, Sra, Suvrit, Póczos, Barnabás, and Smola, Alexander J. Stochastic variance reduction for nonconvex optimization. In ICML, pp , 06. Robbins, Herbert and Monro, Sutton. A stochastic approximation method. The Annals of Mathematical Statistics, (3): , 95. Schmidt, Mark, Le Roux, Nicolas, and Bach, Francis. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, pp. 30, 06. Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 4(): , 03. Shalev-Shwartz, Shai, Singer, Yoram, and Srebro, Nathan. Pegasos: Primal estimated sub-gradient solver for SVM. In ICML, pp , 007. Shalev-Shwartz, Shai, Singer, Yoram, Srebro, Nathan, and Cotter, Andrew. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 7(): 3 30, 0. Takáč, Martin, Bijral, Avleen Singh, Richtárik, Peter, and Srebro, Nathan. Mini-batch primal and dual methods for SVMs. In ICML, pp , 03. Xiao, Lin and Zhang, Tong. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 4(4): , 04.

10 Supplementary Material A. Technical Results Lemma 4 (Theorem..5 in (Nesterov, 004)). Suppose thatf is convex andl-smooth. Then, for anyw, w R d, f(w) f(w )+ f(w ) T (w w )+ L w w, (6) f(w) f(w )+ f(w ) T (w w )+ L f(w) f(w ), (7) ( f(w) f(w )) T (w w ) L f(w) f(w ). (8) Note that (6) does not require the convexity off. Lemma 5 (Theorem.. in (Nesterov, 004)). Suppose that f is µ-strongly convex and L-smooth. Then, for any w, w R d, ( f(w) f(w )) T (w w ) µl µ+l w w + µ+l f(w) f(w ). (9) Lemma 6 (Choices ofmandη). Consider the rate of convergenceσ m in Theorem 4. If we chooseη = /(θl) withθ > and fixσ m, then the best choice ofmis m = (θ ) κ, where κ def = L/µ, with θ calculated as: θ = σ m ++ σ m + σ m. Furthermore, we require θ > + / forσ m <. B. Proofs B.. Proof of Lemma By Assumption andw t+ = w t ηv t, we have E[P(w t+ )] (6) E[P(w t )] ηe[ P(w t ) T v t ]+ Lη E[ v t ] = E[P(w t )] η E[ P(w t) ]+ η E[ P(w t) v t ] where the last equality follows from the fact a T b = By summing overt = 0,...,m, we have E[P(w m+ )] E[P(w 0 )] η E[ P(w t ) ]+ η [ a + b a b ]. ( ) η Lη E[ v t ], ( ) η E[ P(w t ) v t ] Lη m E[ v t ],

11 which is equivalent to (η > 0): E[ P(w t ) ] η E[P(w 0) P(w m+ )]+ E[ P(w t ) v t ] ( Lη) E[ v t ] η E[P(w 0) P(w )]+ E[ P(w t ) v t ] ( Lη) E[ v t ], where the last inequality follows since w is a global minimizer ofp. B.. Proof of Lemma Note thatf j contains all the information ofw 0,...,w j as well asv 0,...,v j. Forj, we have E[ P(w j ) v j F j ] = E[ [ P(w j ) v j ]+[ P(w j ) P(w j )] [v j v j ] F j ] where the last equality follows from = P(w j ) v j + P(w j ) P(w j ) +E[ v j v j F j ] +( P(w j ) v j ) T ( P(w j ) P(w j )) ( P(w j ) v j ) T E[v j v j F j ] ( P(w j ) P(w j )) T E[v j v j F j ] = P(w j ) v j P(w j ) P(w j ) +E[ v j v j F j ], E[v j v j F j ] () = E[ f ij (w j ) f ij (w j ) F j ] = P(w j ) P(w j ). By taking expectation for the above equation, we have E[ P(w j ) v j ] = E[ P(w j ) v j ] E[ P(w j ) P(w j ) ]+E[ v j v j ]. Note that P(w 0 ) v 0 = 0. By summing overj =,...,t(t ), we have B.3. Proof of Lemma 3 Forj, we have E[ P(w t ) v t ] = t E[ v j v j ] j= t E[ P(w j ) P(w j ) ]. E[ v j F j ] = E[ v j ( f ij (w j ) f ij (w j )) F j ] ] = v j +E [ f ij (w j ) f ij (w j ) η ( f i j (w j ) f ij (w j )) T (w j w j ) F j (8) [ ] v j +E f ij (w j ) f ij (w j ) Lη f i j (w j ) f ij (w j ) F j ( ) = v j + E[ f ij (w j ) f ij (w j ) F j ] ( ) () = v j + E[ v j v j F j ], which, if we take expectation, implies that whenη < /L. j= E[ v j v j ] [ ] E[ v j ] E[ v j ],

12 By summing the above inequality overj =,...,t(t ), we have By Lemma, we have B.4. Proof of Lemma 6 t j= E[ P(w t ) v t ] E[ v j v j ] [ ] E[ v 0 ] E[ v t ]. (0) t j= E[ v j v j ] (0) [ ] E[ v 0 ] E[ v t ]. Withη = /(θl) andκ = L/µ, the rate of convergenceα m can be written as which is equivalent to (5) σ m = µη(m+) + = θl µ(m+) + /θ /θ = m(θ) def θ(θ ) = m = σ m (θ ) κ. ( ) κ θ + m+ θ, Since σ m is considered fixed, then the optimal choice of m in terms of θ can be solved from min θ m(θ), or equivalently, 0 = ( m)/( θ) = m (θ), and therefore we have the equation with the optimalθ satisfying and by plugging it intom(θ) we conclude the optimalm: σ m = (4θ )/(θ ), () m = m(k ) = (K ) κ, while by solving forθ in () and taking into account that θ >, we have the optimal choice ofθ: Obviously, forσ m <, we requireθ > + /. B.5. Proof of Theorem a θ = σ m ++ σ m + σ m. Fort, we have P(w t ) P(w t ) = n n n [ f i (w t ) f i (w t )] i= n f i (w t ) f i (w t ) i= = E[ f it (w t ) f it (w t ) F t ]. () Using the proof of Lemma 3, fort, we have ( ) E[ v t F t ] v t + E[ f it (w t ) f it (w t ) F t ] () ( ) v t + P(w t ) P(w t ) ( ) v t + µ η v t.

13 Note that < 0 since η < /L. The last inequality follows by the strong convexity of P, that is, µ w t w t P(w t ) P(w t ) and the fact that w t = w t ηv t. By taking the expectation and applying recursively, we have [ ( E[ v t ] )µ η ] E[ v t ] [ ( )µ η ] t E[ v0 ] [ ( = )µ η ] t E[ P(w0 ) ]. B.6. Proof of Theorem b We obviously havee[ v 0 F 0 ] = P(w 0 ). Fort, we have E[ v t F t ] () = E[ v t ( f it (w t ) f it (w t )) F t ] (3) = v t +E[ f it (w t ) f it (w t ) η ( f i t (w t ) f it (w t )) T (w t w t ) F t ] (9) v t µlη µ+l v t +E[ f it (w t ) f it (w t ) F t ] η µ+l E[ f i t (w t ) f it (w t ) F t ] = ( µlη µ+l ) v t +( η µ+l )E[ f i t (w t ) f it (w t ) F t ] ( ) v t, µlη µ+l where in last inequality we have used that η /(µ + L). By taking the expectation and applying recursively, the desired result is achieved. B.7. Proof of Theorem 3 By Theorem, we have E[ P( w s ) ] where the second last equality follows since Hence, the desired result is achieved. B.8. Proof of Corollary η(m+) E[P( w s ) P(w )]+ E[ P( w s ) ] = δ s +αe[ P( w s ) ] δ s +αδ s + +α s δ 0 +α s P( w 0 ) δ +αδ + +α s δ +α s P( w 0 ) δ αs α +αs P( w 0 ) = ( α s )+α s P( w 0 ) = +α s ( P( w 0 ) ), ( ) ( ) δ α = δ = δ + =. ( ) Based on Theorem 3, if we would aim for anǫ-accuracy solution, we can choose = ǫ/4 andα = / (withη = /(3L)). To obtain the convergence to an ǫ-accuracy solution, we need to have δ = O(ǫ), or equivalently, m = O(/ǫ). Then we (3)

14 have E[ P( w s ) ] (3) + E[ P( w s ) ] + + E[ P( w s ) ] ( ) s + s P( w 0) + s P( w 0). To guarantee that E[ P( w s ) ] ǫ, it is sufficient to make P( w s 0 ) 3 4ǫ, ors = O(log(/ǫ)). This implies the total complexity to achieve an ǫ-accuracy solution is(n + m)s = O((n +(/ǫ)) log(/ǫ)). B.9. Proof of Corollary 3 Based on Lemma 6 and Theorem 4, let us pick θ =, i.e, then we have m = 4.5κ. So let us run with η = /(L) andm = 4.5κ, then we can calculateσ m in (5) as σ m = µη(m+) + = [µ/(l)](4.5κ+) + / / < = 7 9. According to Theorem 4, if we run for T iterations where then we have T = log( P( w 0 ) /ǫ)/log(9/7) = log 7/9 (ǫ/ P( w 0 ) ) log 7/9 (ǫ/ P( w 0 ) ), E[ P( w T ) ] (σ m ) T P( w 0 ) < (7/9) T P( w 0 ) (7/9) log 7/9 (ǫ/ P( w0) ) P( w 0 ) = ǫ, thus we can derive (7). If we consider the number of gradient evaluations as the main computational complexity, then the total complexity can be obtained as (n+m)t = O((n+κ)log(/ǫ)).

arxiv: v1 [stat.ml] 20 May 2017

arxiv: v1 [stat.ml] 20 May 2017 Stochastic Recursive Gradient Algorithm for Nonconvex Optimization arxiv:1705.0761v1 [stat.ml] 0 May 017 Lam M. Nguyen Industrial and Systems Engineering Lehigh University, USA lamnguyen.mltd@gmail.com

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Tomoya Murata Taiji Suzuki More recently, several authors have proposed accelerarxiv:70300439v4

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

A Universal Catalyst for Gradient-Based Optimization

A Universal Catalyst for Gradient-Based Optimization A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication

More information

Coordinate Descent Faceoff: Primal or Dual?

Coordinate Descent Faceoff: Primal or Dual? JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

arxiv: v2 [stat.ml] 16 Jun 2015

arxiv: v2 [stat.ml] 16 Jun 2015 Semi-Stochastic Gradient Descent Methods Jakub Konečný Peter Richtárik arxiv:1312.1666v2 [stat.ml] 16 Jun 2015 School of Mathematics University of Edinburgh United Kingdom June 15, 2015 (first version:

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

arxiv: v1 [math.oc] 18 Mar 2016

arxiv: v1 [math.oc] 18 Mar 2016 Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x),

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x), SIAM J. OPTIM. Vol. 24, No. 4, pp. 2057 2075 c 204 Society for Industrial and Applied Mathematics A PROXIMAL STOCHASTIC GRADIENT METHOD WITH PROGRESSIVE VARIANCE REDUCTION LIN XIAO AND TONG ZHANG Abstract.

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

arxiv: v2 [cs.na] 3 Apr 2018

arxiv: v2 [cs.na] 3 Apr 2018 Stochastic variance reduced multiplicative update for nonnegative matrix factorization Hiroyui Kasai April 5, 208 arxiv:70.078v2 [cs.a] 3 Apr 208 Abstract onnegative matrix factorization (MF), a dimensionality

More information

arxiv: v2 [cs.lg] 4 Oct 2016

arxiv: v2 [cs.lg] 4 Oct 2016 Appearing in 2016 IEEE International Conference on Data Mining (ICDM), Barcelona. Efficient Distributed SGD with Variance Reduction Soham De and Tom Goldstein Department of Computer Science, University

More information

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION Aryan Mokhtari, Alec Koppel, Gesualdo Scutari, and Alejandro Ribeiro Department of Electrical and Systems

More information

Stochastic Second-Order Method for Large-Scale Nonconvex Sparse Learning Models

Stochastic Second-Order Method for Large-Scale Nonconvex Sparse Learning Models Stochastic Second-Order Method for Large-Scale Nonconvex Sparse Learning Models Hongchang Gao, Heng Huang Department of Electrical and Computer Engineering, University of Pittsburgh, USA hongchanggao@gmail.com,

More information

Minimizing finite sums with the stochastic average gradient

Minimizing finite sums with the stochastic average gradient Math. Program., Ser. A (2017) 162:83 112 DOI 10.1007/s10107-016-1030-6 FULL LENGTH PAPER Minimizing finite sums with the stochastic average gradient Mark Schmidt 1 Nicolas Le Roux 2 Francis Bach 3 Received:

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract Nonconvex Sparse Learning via Stochastic Optimization with Progressive Variance Reduction Xingguo Li, Raman Arora, Han Liu, Jarvis Haupt, and Tuo Zhao arxiv:1605.02711v5 [cs.lg] 23 Dec 2017 Abstract We

More information

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

On the Convergence Rate of Incremental Aggregated Gradient Algorithms

On the Convergence Rate of Incremental Aggregated Gradient Algorithms On the Convergence Rate of Incremental Aggregated Gradient Algorithms M. Gürbüzbalaban, A. Ozdaglar, P. Parrilo June 5, 2015 Abstract Motivated by applications to distributed optimization over networks

More information

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Approximate Newton Methods and Their Local Convergence

Approximate Newton Methods and Their Local Convergence Haishan Ye Luo Luo Zhihua Zhang 2 Abstract Many machine learning models are reformulated as optimization problems. Thus, it is important to solve a large-scale optimization problem in big data applications.

More information

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

IN this work we are concerned with the problem of minimizing. Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting

IN this work we are concerned with the problem of minimizing. Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting Jakub Konečný, Jie Liu, Peter Richtárik, Martin Takáč arxiv:504.04407v2 [cs.lg] 6 Nov 205 Abstract We propose : a method incorporating

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization Linbo Qiao, and Tianyi Lin 3 and Yu-Gang Jiang and Fan Yang 5 and Wei Liu 6 and Xicheng Lu, Abstract We consider

More information

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Yossi Arjevani Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot 7610001,

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

A Proximal Stochastic Gradient Method with Progressive Variance Reduction

A Proximal Stochastic Gradient Method with Progressive Variance Reduction A Proximal Stochastic Gradient Method with Progressive Variance Reduction arxiv:403.4699v [math.oc] 9 Mar 204 Lin Xiao Tong Zhang March 8, 204 Abstract We consider the problem of minimizing the sum of

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Randomized Smoothing for Stochastic Optimization

Randomized Smoothing for Stochastic Optimization Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and

More information

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic

More information

Adaptive Newton Method for Empirical Risk Minimization to Statistical Accuracy

Adaptive Newton Method for Empirical Risk Minimization to Statistical Accuracy Adaptive Neton Method for Empirical Risk Minimization to Statistical Accuracy Aryan Mokhtari? University of Pennsylvania aryanm@seas.upenn.edu Hadi Daneshmand? ETH Zurich, Sitzerland hadi.daneshmand@inf.ethz.ch

More information

Inverse Time Dependency in Convex Regularized Learning

Inverse Time Dependency in Convex Regularized Learning Inverse Time Dependency in Convex Regularized Learning Zeyuan A. Zhu (Tsinghua University) Weizhu Chen (MSRA) Chenguang Zhu (Tsinghua University) Gang Wang (MSRA) Haixun Wang (MSRA) Zheng Chen (MSRA) December

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg

More information

Stochastic Dual Coordinate Ascent with Adaptive Probabilities

Stochastic Dual Coordinate Ascent with Adaptive Probabilities Dominik Csiba Zheng Qu Peter Richtárik University of Edinburgh CDOMINIK@GMAIL.COM ZHENG.QU@ED.AC.UK PETER.RICHTARIK@ED.AC.UK Abstract This paper introduces AdaSDCA: an adaptive variant of stochastic dual

More information

Stochastic Methods for l 1 Regularized Loss Minimization

Stochastic Methods for l 1 Regularized Loss Minimization Shai Shalev-Shwartz Ambuj Tewari Toyota Technological Institute at Chicago, 6045 S Kenwood Ave, Chicago, IL 60637, USA SHAI@TTI-CORG TEWARI@TTI-CORG Abstract We describe and analyze two stochastic methods

More information

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms On the Iteration Complexity of Oblivious First-Order Optimization Algorithms Yossi Arjevani Weizmann Institute of Science, Rehovot 7610001, Israel Ohad Shamir Weizmann Institute of Science, Rehovot 7610001,

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods

Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods Robert M. Gower robert.gower@telecom-paristech.fr icolas Le Roux nicolas.le.roux@gmail.com Francis Bach francis.bach@inria.fr

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University

More information

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

l 1 and l 2 Regularization

l 1 and l 2 Regularization David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to

More information