Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

Size: px
Start display at page:

Download "Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization"

Transcription

1 Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization Mar Schmidt, Nicolas e Roux, Francis Bach To cite this version: Mar Schmidt, Nicolas e Roux, Francis Bach. Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization. NIPS 11-5 th Annual Conference on Neural Information Processing Systems, Dec 011, Grenada, Spain <inria v3> HA Id: inria Submitted on 1 Dec 011 HA is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. archive ouverte pluridisciplinaire HA, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

2 Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization Mar Schmidt Nicolas e Roux nicolas@le-roux.name Francis Bach francis.bach@ens.fr INRIA - SIERRA Project - Team aboratoire d Informatique de l École Normale Supérieure Paris, France December 1, 011 Abstract We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates. Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems. 1 Introduction In recent years the importance of taing advantage of the structure of convex optimization problems has become a topic of intense research in the machine learning community. This is particularly true of techniques for non-smooth optimization, where taing advantage of the structure of non-smooth terms seems to be crucial to obtaining good performance. Proximal-gradient methods and accelerated proximal-gradient methods [1, ] are among the most important methods for taing advantage of the structure of many of the nonsmooth optimization problems that arise in practice. In particular, these methods address composite optimization problems of the form minimize fx := gx + hx, 1 x R d where g and h are convex functions but only g is smooth. One of the most well-studied instances of this type of problem is l 1 -regularized least squares [3, 4], 1 minimize x R d Ax b + λ x 1, 1

3 where we use to denote the standard l -norm. Proximal-gradient methods are an appealing approach for solving these types of non-smooth optimization problems because of their fast theoretical convergence rates and strong practical performance. While classical subgradient methods only achieve an error level on the objective function of O1/ after iterations, proximal-gradient methods have an error of O1/ while accelerated proximal-gradient methods futher reduce this to O1/ [1, ]. That is, accelerated proximal-gradient methods for non-smooth convex optimization achieve the same optimal convergence rate that accelerated gradient methods achieve for smooth optimization. Each iteration of a proximal-gradient method requires the calculation of the proximity operator, prox y = arg min x R d x y + hx, where is the ipschitz constant of the gradient of g. We can efficiently compute an analytic solution to this problem for several notable choices of h, including the case of l 1 -regularization and disjoint group l 1 -regularization [5, 6]. However, in many scenarios the proximity operator may not have an analytic solution, or it may be very expensive to compute this solution exactly. This includes important problems such as total-variation regularization and its generalizations lie the graph-guided fused-asso [7, 8], nuclearnorm regularization and other regularizers on the singular values of matrices [9, 10], and different formulations of overlapping group l 1 -regularization with general groups [11, 1]. Despite the difficulty in computing the exact proximity operator for these regularizers, efficient methods have been developed to compute approximate proximity operators in all of these cases; accelerated projected gradient and Newton-lie methods that wor with a smooth dual problem have been used to compute approximate proximity operators in the context of total-variation regularization [7, 13], Krylov subspace methods and low-ran representations have been used to compute approximate proximity operators in the context of nuclear-norm regularization [9, 10], and variants of Dystra s algorithm and related dual methods have been used to compute approximate proximity operators in the context of overlapping group l 1 -regularization [1, 14, 15]. It is nown that proximal-gradient methods that use an approximate proximity operator converge under only wea assumptions [16, 17]; we briefly review this and other related wor in the next section. However, despite the many recent wors showing impressive empirical performance of accelerated proximal-gradient methods that use an approximate proximity operator [7, 9, 10, 13, 14, 15], up until recently there was no theoretical analysis on how the error in the calculation of the proximity operator affects the convergence rate of proximal-gradient methods. In this wor, we show in several contexts that, provided the error in the proximity operator calculation is controlled in an appropriate way, inexact proximal-gradient strategies achieve the same convergence rates as the corresponding exact methods. In particular, in Section 4 we first consider convex objectives and analyze the inexact proximal-gradient Proposition 1 and accelerated proximal-gradient Proposition methods. We then analyze these two algorithms for strongly convex objectives Proposition 3 and Proposition 4. Note that, in these analyses, we also consider the possibility

4 that there is an error in the calculation of the gradient of g. We then present an experimental comparison of various inexact proximal-gradient strategies in the context of solving a structured sparsity problem Section 5. Related Wor The algorithm we shall focus on in this paper is the proximal-gradient method x = prox [ y 1 1/g y 1 + e ], 3 where e is the error in the calculation of the gradient and the proximity problem is solved inexactly so that x has an error of ε in terms of the proximal objective function. In the basic proximal-gradient method we choose y = x, while in the accelerated proximalgradient method we choose y = x + β x x 1, where the sequence {β } is chosen to accelerate the convergence rate. There is a substantial amount of wor on methods that use an exact proximity operator but have an error in the gradient calculation, corresponding to the special case where ε = 0 but e is non-zero. For example, when the e are independent, zero-mean, and finitevariance random variables, then proximal-gradient methods achieve the optimal error level of O1/ [18, 19]. This is different than the scenario we analyze in this paper since we do not assume unbiased nor independent errors but instead consider a sequence of errors converging to 0. This leads to faster convergence rates and maes our analysis applicable to the case of deterministic and even adversarial errors. Several authors have recently analyzed the case of a fixed deterministic error in the gradient, and shown that accelerated gradient methods achieve the optimal convergence rate up to some accuracy that depends on the fixed error level [0, 1, ], while the earlier wor of [3] analyzes the gradient method in the context of a fixed error level. This contrasts with our analysis where by allowing the error to change at every iteration we can achieve convergence to the optimal solution. Also, we can tolerate a large error in early iterations when we are far from the solution, which may lead to substantial computational gains. Other authors have analyzed the convergence rate of the gradient and projected-gradient methods with a decreasing sequence of errors [4, 5] but this analysis does not consider the important class of accelerated gradient methods. In contrast, the analysis of [] allows a decreasing sequence of errors though convergence rates in this context are not explicitly mentioned and considers the accelerated projected-gradient method. However, the authors of this wor only consider the case of an exact projection step and they assume the availability of an oracle that yields global lower and upper bounds on the function. This non-intuitive oracle leads to a novel analysis of smoothing methods, but also to slower convergence rates than proximal-gradient methods. The analysis of [1] considers errors in both the gradient and projection operators for accelerated projected-gradient methods but requires that the domain of the function is compact. None of these wors consider proximal-gradient methods. 3

5 In the context of proximal-point algorithms, there is a substantial literature on using inexact proximity operators with a decreasing sequence of errors, dating bac to the seminal wor of Rocafellar [6]. Accelerated proximal-point methods with a decreasing sequence of errors have also been examined, beginning with [7]. However, unlie proximal-gradient methods where the proximity operator is only computed with respect to the non-smooth function h, proximal-point methods require the calculation of the proximity operator with respect to the full objective function. In the context of composite optimization problems of the form 1, this requires the calculation of the proximity operator with respect to g + h. Since it ignores the structure of the problem, this proximity operator may be as difficult to compute even approximately as the minimizer of the original problem. Convergence of inexact proximal-gradient methods can be established with only wea assumptions on the method used to approximately solve. For example, we can establish that inexact proximal-gradient methods converge under some closedness assumptions on the mapping induced by the approximate proximity operator, and the assumption that the algorithm used to compute the inexact proximity operator achieves sufficient descent on problem compared to the previous iteration x 1 [16]. Convergence of inexact proximalgradient methods can also be established under the assumption that the norms of the errors are summable [17]. However, these prior wors did not consider the rate of convergence of inexact proximal-gradient methods, nor did they consider accelerated proximal-gradient methods. Indeed, the authors of [7] chose to use the non-accelerated variant of the proximalgradient algorithm since even convergence of the accelerated proximal-gradient method had not been established under an inexact proximity operator. While preparing the final version of this wor, [8] independently gave an analysis of the accelerated proximal-gradient method with an inexact proximity operator and a decreasing sequence of errors assuming an exact gradient. Further, their analysis leads to a weaer dependence on the errors than in our Proposition. However, while we only assume that the proximal problem can be solved up to a certain accuracy, they mae the much stronger assumption that the inexact proximity operator yields an ε -subdifferential of h [8, Definition.1]. Our analysis can be modified to give an improved dependence on the errors under this stronger assumption. In particular, the terms in ε i disappear from the expressions of A, Ã and Â. In the case of Propositions 1 and, this leads to the optimal convergence rate with a slower decay of ε i. More details may be found after emma in the Appendix. More recently, [9] gave an alternative analysis of an accelerated proximal-gradient method with an inexact proximity operator and a decreasing sequence of errors assuming an exact gradient, but under a non-intuitive assumption on the relationship between the approximate solution of the proximal problem and the ε -subdifferential of h. 3 Notation and Assumptions In this wor, we assume that the smooth function g in 1 is convex and differentiable, and that its gradient g is ipschitz-continuous with constant, meaning that for all x and y in R d we have g x g y x y. 4

6 This is a standard assumption in differentiable optimization, see [30,.1.1]. If g is twicedifferentiable, this corresponds to the assumption that the eigenvalues of its Hessian are bounded above by. In Propositions 3 and 4 only, we will also assume that g is -strongly convex see [30,.1.3], meaning that for all x and y in R d we have gy gx + g x, y x + y x. However, apart from Propositions 3 and 4, we only assume that this holds with = 0, which is equivalent to convexity of g. In contrast to these assumptions on g, we will only assume that h in 1 is a lower semicontinuous proper convex function see [31, 1.], but will not assume that h is differentiable or ipschitz-continuous. This allows h to be any real-valued convex function, but also allows for the possibility that h is an extended real-valued convex function. For example, h could be the indicator function of a convex set, and in this case the proximity operator becomes the projection operator. We will use x to denote the parameter vector at iteration, and x to denote a minimizer of f. We assume that such an x exists, but do not assume that it is unique. We use e to denote the error in the calculation of the gradient at iteration, and we use ε to denote the error in the proximal objective function achieved by x, meaning that x y + hx ε + min x R d { } x y + hx, 4 where y = y 1 1/g y 1 + e. Note that the proximal optimization problem is strongly convex and in practice we are often able to obtain such bounds via a duality gap e.g., see [1] for the case of overlapping group l 1 -regularization. 4 Convergence Rates of Inexact Proximal-Gradient Methods In this section we present the analysis of the convergence rates of inexact proximal-gradient methods as a function of the sequences of solution accuracies to the proximal problems {ε }, and the sequences of magnitudes of the errors in the gradient calculations { e }. We shall use H to denote the set of four assumptions which will be made for each proposition: g is convex and has -ipschitz-continuous gradient; h is a lower semi-continuous proper convex function; The function f = g + h attains its minimum at a certain x R n ; x is an ε -optimal solution to the proximal problem in the sense of 4. We first consider the basic proximal-gradient method in the convex case: 5

7 Proposition 1 Basic proximal-gradient method - Convexity Assume H and that we iterate recursion 3 with y = x. Then, for all 1, we have with f 1 x i fx A = e i + x 0 x + A + B, 5 εi, B = The proof may be found in the Appendix. Note that while we have stated the proposition in terms of the function value achieved by the average of the iterates, it trivially also holds for the iteration that achieves the lowest function value. This result implies that the wellnown O1/ convergence rate for the gradient method without errors still holds when both { e } and { ε } are summable. A sufficient condition to achieve this is for e and ε to decrease as O1/ 1+δ for any δ > 0. Note that a faster convergence of these two errors will not improve the convergence rate but will yield a better constant factor. It is interesting to consider what happens if { e } or { ε } is not summable. For instance, if e and ε decrease as O1/, then A grows as Olog note that B is always smaller than A and the convergence of the function values is in O. Finally, a ε i. log necessary condition to obtain convergence is that the partial sums A and B need to be in o. We now turn to the case of an accelerated proximal-gradient method. We focus on a basic variant of the algorithm where β is set to 1/ + [3, Eq. 19 and 7]: Proposition Accelerated proximal-gradient method - Convexity Assume H and that we iterate recursion 3 with y = x x x 1. Then, for all 1, we have fx fx + 1 x 0 x + à + B, 6 with à = i e i + εi, B = i ε i. In this case, we require the series { e } and { ε } to be summable to achieve the optimal O1/ rate, which is an unsurprisingly stronger constraint than in the basic case. A sufficient condition is for e and ε to decrease as O1/ +δ for any δ > 0. Note that, as opposed to Proposition 1 that is stated for the average iterate, this bound is for the last iterate x. Again, it is interesting to see what happens when the summability assumption is not met. First, if e or ε decreases at a rate of O1/, then e + e decreases as O1/ and à grows as Olog note that B is always smaller than Ã, yielding a convergence rate of O log for fx fx. Also, and perhaps more interestingly, if e or ε 6

8 decreases at a rate of O1/, Eq. 6 does not guarantee convergence of the function values. More generally, the form of à and B indicates that errors have a greater effect on the accelerated method than on the basic method. Hence, as also discussed in [], unlie in the error-free case, the accelerated method may not necessarily be better than the basic method because it is more sensitive to errors in the computation. In the case where g is strongly convex it is possible to obtain linear convergence rates that depend on the ratio γ = /, as opposed to the sublinear convergence rates discussed above. In particular, we obtain the following convergence rate on the iterates of the basic proximal-gradient method: Proposition 3 Basic proximal-gradient method - Strong convexity Assume H, that g is -strongly convex, and that we iterate recursion 3 with y = x. Then, for all 1, we have: x x 1 γ x 0 x + Ā, 7 with Ā = 1 γ i e i + εi A consequence of this proposition is that we obtain a linear rate of convergence even in the presence of errors, provided that e and ε decrease linearly to 0. If they do so at a rate of Q < 1 γ, then the convergence rate of x x is linear with constant 1 γ, as in the error-free algorithm. If we have Q > 1 γ, then the convergence of x x is linear with constant Q. If we have Q = 1 γ, then x x converges to 0 as O 1 γ = o [1 γ + δ ] for all δ > 0. Finally, we consider the accelerated proximal-gradient algorithm when g is strongly convex. We focus on a basic variant of the algorithm where β is set to 1 γ/1 + γ [30,..1]: Proposition 4 Accelerated proximal-gradient method - Strong convexity Assume H, that g is -strongly convex, and that we iterate recursion 3 with y = x + 1 γ 1+ γ x x 1. Then, for all 1, we have with fx fx 1 γ fx0 fx +   = e i + ε i 1 γ i/, B =. + B, 8 ε i 1 γ i. Note that while we have stated the result in terms of function values, we obtain an analogous result on the iterates because by strong convexity of f we have x x fx fx. 7

9 This proposition implies that we obtain a linear rate of convergence in the presence of errors provided that e and ε decrease linearly to 0. If they do so at a rate Q < 1 γ, then the constant is 1 γ, while if Q > 1 γ then the constant will be Q. Thus, the accelerated inexact proximal-gradient method will have a faster convergence rate than the exact basic proximal-gradient method provided that Q < 1 γ. Oddly, in our analysis of the strongly convex case, the accelerated method is less sensitive to errors than the basic method. However, unlie the basic method, the accelerated method requires nowing in addition to. If is misspecified, then the convergence rate of the accelerated method may be slower than the basic method. 5 Experiments We tested the basic inexact proximal-gradient and accelerated proximal-gradient methods on the CUR-lie factorization optimization problem introduced in [33] to approximate a given matrix W, min X 1 W W XW F + λ row n r n c X i p + λ col X j p. Under an appropriate choice of p, this optimization problem yields a matrix X with sparse rows and sparse columns, meaning that entire rows and columns of the matrix X are set to exactly zero. In [33], the authors used an accelerated proximal-gradient method and chose p = since under this choice the proximity operator can be computed exactly. However, this has the undesirable effect that it also encourages all values in the same row or column to have the same magnitude. The more natural choice of p = was not explored since in this case there is no nown algorithm to exactly compute the proximity operator. Our experiments focused on the case of p =. In this case, it is possible to very quicly compute an approximate proximity operator using the bloc coordinate descent BCD algorithm presented in [1], which is equivalent to the proximal variant of Dystra s algorithm introduced by [34]. In our implementation of the BCD method, we alternate between computing the proximity operator with respect to the rows and to the columns. Since the BCD method allows us to compute a duality gap when solving the proximal problem, we can run the method until the duality gap is below a given error threshold ε to find an x +1 satisfying 4. In our experiments, we used the four data sets examined by [33] 1 and we choose λ row =.01 and λ col =.01, which yielded approximately 5 40% non-zero entries in X depending on the data set. Rather than assuming we are given the ipschitz constant, on the first iteration we set to 1 and following [] we double our estimate anytime gx > gy 1 + g y 1, x y 1 + / x y 1. We tested three different ways to terminate the approximate proximal problem, each parameterized by a parameter α: ε = 1/ α : Running the BCD algorithm until the duality gap is below 1/ α. 1 The datasets are freely available at j=1 8

10 ε = α: Running the BCD algorithm until the duality gap is below α. n = α: Running the BCD algorithm for a fixed number of iterations α. Note that all three strategies lead to global convergence in the case of the basic proximalgradient method, the first two give a convergence rate up to some fixed optimality tolerance, and in this paper we have shown that the first one for large enough α yields a convergence rate for an arbitrary optimality tolerance. Note that the iterates produced by the BCD iterations are sparse, so we expected the algorithms to spend the majority of their time solving the proximity problem. Thus, we used the function value against the number of BCD iterations as a measure of performance. We plot the results after 500 BCD iterations for the four data sets for the proximal-gradient method in Figure 1, and the accelerated proximal-gradient method in Figure. In these plots, the first column varies α using the choice ε = 1/ α, the second column varies α using the choice ε = α, and the third column varies α using the choice n = α. We also include one of the best methods from the first column in the second and third columns as a reference. In the context of proximal-gradient methods the choice of ε = 1/ 3, which is one choice that achieves the fastest convergence rate according to our analysis, gives the best performance across all four data sets. However, in these plots we also see that reasonable performance can be achieved by any of the three strategies above provided that α is chosen carefully. For example, choosing n = 3 or choosing ε = 10 6 both give reasonable performance. However, these are only empirical observations for these data sets and they may be ineffective for other data sets or if we change the number of iterations, while we have given theoretical justification for the choice ε = 1/ 3. Similar trends are observed for the case of accelerated proximal-gradient methods, though the choice of ε = 1/ 3 which no longer achieves the fastest convergence rate according to our analysis no longer dominates the other methods in the accelerated setting. For the SRBCT data set the choice ε = 1/ 4, which is a choice that achieves the fastest convergence rate up to a poly-logarithmic factor, yields better performance than ε = 1/ 3. Interestingly, the only choice that yields the fastest possible convergence rate ε = 1/ 5 had reasonable performance but did not give the best performance on any data set. This seems to reflect the trade-off between performing inner BCD iterations to achieve a small duality gap and performing outer gradient iterations to decrease the value of f. Also, the constant terms which were not taen into account in the analysis do play an important role here, due to the relatively small number of outer iterations performed. 6 Discussion An alternative to inexact proximal methods for solving structured sparsity problems are smoothing methods [35] and alternating direction methods [36]. However, a major disadvantage of both these approaches is that the iterates are not sparse, so they can not tae advantage of the sparsity of the problem when running the algorithm. In contrast, the method proposed in this paper has the appealing property that it tends to generate 9

11 10 0 ε = 1/ ε = 1/ ε = 1/ ε = 1/ 4 ε = 1/ n=1 n= n=3 n=5 ε = 1/ ε =1e ε =1e 4 ε =1e ε =1e 10 ε = 1/ ε = 1/ n= ε =1e ε = 1/ ε = 1/ 3 ε = 1/ 4 ε = 1/ n= n=3 n=5 ε = 1/ ε =1e 4 ε =1e 6 ε =1e 10 ε = 1/ ε = 1/ ε = 1/ ε = 1/ ε = 1/ 4 ε = 1/ n=1 n= n=3 n=5 ε = 1/ ε =1e ε =1e 4 ε =1e ε =1e 10 ε = 1/ ε = 1/ n= ε =1e ε = 1/ ε = 1/ 3 ε = 1/ 4 ε = 1/ n= n=3 n=5 ε = 1/ ε =1e 4 ε =1e 6 ε =1e 10 ε = 1/ Figure 1: Objective function against number of proximal iterations for the proximal-gradient method with different strategies for terminating the approximate proximity calculation. From top to bottom we have the 9 Tumors, Brain Tumor1, euemia1, and SRBCT data sets. 10

12 10 0 ε = 1/ ε = 1/ ε = 1/ ε = 1/ 4 ε = 1/ n=1 n= n=3 n=5 ε = 1/ ε =1e ε =1e 4 ε =1e ε =1e 10 ε = 1/ ε = 1/ n= ε =1e ε = 1/ ε = 1/ 3 ε = 1/ 4 ε = 1/ n= n=3 n=5 ε = 1/ ε =1e 4 ε =1e 6 ε =1e 10 ε = 1/ ε = 1/ ε = 1/ ε = 1/ ε = 1/ 4 ε = 1/ n=1 n= n=3 n=5 ε = 1/ ε =1e ε =1e 4 ε =1e ε =1e 10 ε = 1/ ε = 1/ n= ε =1e ε = 1/ ε = 1/ 3 ε = 1/ 4 ε = 1/ n= n=3 n=5 ε = 1/ ε =1e 4 ε =1e 6 ε =1e 10 ε = 1/ Figure : Objective function against number of proximal iterations for the accelerated proximal-gradient method with different strategies for terminating the approximate proximity calculation. From top to bottom we have the 9 Tumors, Brain Tumor1, euemia1, and SRBCT data sets. 11

13 sparse iterates. Further, the accelerated smoothing method only has a convergence rate of O1/, and the performance of alternating direction methods is often sensitive to the exact choice of their penalty parameter. On the other hand, while our analysis suggests using a sequence of errors lie O1/ α for α large enough, the practical performance of inexact proximal-gradients methods will be sensitive to the exact choice of this sequence. Although we have illustrated the use of our results in the context of a structured sparsity problem, inexact proximal-gradient methods are also used in other applications such as total-variation [7, 8] and nuclear-norm [9, 10] regularization. This wor provides a theoretical justification for using inexact proximal-gradient methods in these and other applications, and suggests some guidelines for practioners that do not want to lose the appealing convergence rates of these methods. Further, although our experiments and much of our discussion focus on errors in the calculation of the proximity operator, our analysis also allows for an error in the calculation of the gradient. This may also be useful in a variety of contexts. For example, errors in the calculation of the gradient arise when fitting undirected graphical models and using an iterative method to approximate the gradient of the log-partition function [37]. Other examples include using a reduced set of training examples within ernel methods [38] or subsampling to solve semidefinite programming problems [39]. In our analysis, we assume that the smoothness constant is nown, but it would be interesting to extend methods for estimating in the exact case [] to the case of inexact algorithms. In the context of accelerated methods for strongly convex optimization, our analysis also assumes that is nown, and it would be interesting to explore variants that do not mae this assumption. We also note that if the basic proximal-gradient method is given nowledge of, then our analysis can be modified to obtain a faster linear convergence rate of 1 γ/1+γ instead of 1 γ for strongly-convex optimization using a step size of / +, see Theorem.1.15 of [30]. Finally, we note that there has been recent interest in inexact proximal Newton-lie methods [40], and it would be interesting to analyze the effect of errors on the convergence rates of these methods. Acnowledgements Mar Schmidt, Nicolas e Roux, and Francis Bach are supported by the European Research Council SIERRA-ERC Appendix: Proofs of the propositions We first prove a lemma which will be used for the propositions. emma 1 Assume that the nonnegative sequence {u } satisfies the following recursion for all 1: u S + λ i u i, 1

14 with {S } an increasing sequence, S 0 u 0 and λ i 0 for all i. Then, for all 1, then u 1 λ i + S + 1 λ i 1/ Proof We prove the result by induction. It is true for = 0 by assumption. We assume it is true for 1, and we denote by v 1 = max{u 1,..., u 1 }. From the recursion, we thus get u λ / S + λ 4 + v 1 1 λ i leading to and thus u λ + S + λ 4 + v 1 1 1/ λ i v max v 1, λ + S + λ 1 1/ 4 + v 1 λ i The two terms in the maximum are equal if v 1 = S + v 1 λ i, i.e., for v 1 = 1 λ i + S + 1 1/ i λ. If v 1 v 1, then v v 1 since the two terms in the max are increasing functions of v 1. If v 1 v 1, then v 1 λ + S + λ 4 + v 1 1/. 1 i λ Hence, v v 1, and the induction hypotheses ensure that the property is satisfied for. The following lemma will allow us to characterize the elements of the ε -subdifferential of h at x, ε hx. As a reminder, the ε-subdifferential of a convex function a at x is the set of vectors y such that at ax y t x ε for all t. emma If x i is an ε i -optimal solution to the proximal problem in the sense of 4, then there exists f i such that f i ε i and y i 1 x i 1 g y i 1 + e i f i εi hx i. Proof We first recall some properties of ε-subdifferentials see, e.g., [41, Section 4.3] for more details. By definition, x is an ε-minimizer of a convex function a if and only if ax inf y R n ay + ε. This is equivalent to 0 belonging to the ε-subdifferential ε ax. If a = a 1 + a, where both a 1 and a are convex, we have ε ax ε a 1 x + ε a x. 13

15 If a 1 x = x z, then { ε a 1 x = y R n x z y } ε { } = y R n, y = x z + f f ε. If a = h and x is an ε-minimizer of a 1 + a, then 0 belongs to ε ax. Since ε ax ε a 1 x + ε a x, we have that 0 is the sum of an element of ε a 1 x and of an element of ε hx. Hence, there is an f such that ε z x f ε hx with f. 9 Using z = y i 1 1/g y i 1 + e i and x = x i, this implies that there exists f i such that f i ε i and y i 1 x i 1 g y i 1 + e i f i εi hx i. In [8, Definition.1], Eq. 9 is replaced by z x ε hx. Hence, their definition of an approximate solution is equivalent to ours but using f = 0. If we replace f i by 0 in the proof of Proposition, we get the O1/ convergence rate using any sequence of errors {ε } necessary to achieve the O1/ rate in [8, Th. 4.4]. We can also mae the same assumption on f in Proposition 1 to achieve the optimal convergence rate with a decay of ε in O 1/ 0.5+δ instead of O 1/ 1+δ. 6.1 Basic proximal-gradient method with errors in the convex case We now give the proof of Proposition of 1. Proof Since x is an ε -optimal solution to the proximal problem in the sense of 4, ε we can use emma to yield that there exists f such that f x 1 x 1 g x 1 + e f εi hx. and We now bound gx i and hx i as follows: gx i gx i 1 + g x i 1, x i x i 1 + x i x i 1 using -ipschitz gradient and the convexity of g, gx + g x i 1, x i 1 x + g x i 1, x i x i 1 + x i x i 1 using convexity of g. 14

16 Using the ε i -subgradient, we have hx i hx g x i 1 + e i + x i + f i x i 1, x i x + ε i. Adding the two together, we get: fx i = gx i + hx i = fx + x i x i 1 x i x i 1, x i x + ε i e i + f i, x i x = fx + x i x i 1, x i x i 1 x i + x + ε i e i + f i, x i x = fx + x i x x i 1 x, x x i + x x i 1 + ε i e i + f i, x i x = fx x i x + x i 1 x + ε i e i + f i, x i x fx i fx x i x + x i 1 x + ε i + e i + ε i x i x ε using Cauchy-Schwartz and f i i. Moving fx on the other side and summing from i = 1 to, we get: [fx i fx ] x x + [ x 0 x + ε i + e i + ] ε i x i x i.e. [fx i fx ] + x x x 0 x + ε i + [ e i + ] ε i x i x Eq. 10 has two purposes. The first one is to bound the values of x i x using the recursive definition. Once we have a bound on these quantities, we shall be able to bound the function values using only x 0 x and the values of the errors., Bounding x i x We now need to bound the quantities x i x in terms of x 0 x, e i and ε i. Dropping the first term in Eq. 10, which is positive due to the optimality of fx, we have: [ x x x 0 x + ] e i ε i + + εi x i x [ We now use emma 1 using S = x 0 x + ε e i and λ i = i + x x e i + εi + x 0 x + 15 [ ε i + e i + εi ε i ] to get ] 1/.

17 Denoting A = e i + ε i and B = ε i, we get x x A + x 0 x + B + A 1/. Since A i and B i are increasing sequences e i and ε i being positive, we have for i x i x A i + x 0 x + B i + A 1/ i A + x 0 x + B + A 1/ A + x 0 x + B + A using the positivity of x 0 x, B and A Bounding the function values Now that we have a common bound for all x i x with i, we can upper-bound the right-hand side of Eq. 10 using only terms depending on x 0 x, e i and ε i. Indeed, discarding x x which is positive, Eq. 10 becomes [fx i fx ] x 0 x + B + A A + x 0 x + B + A Since f is convex, we get 1 f x i fx 1 x 0 x + B + A + A x 0 x + A B x 0 x + A + B. [fx i fx ] x 0 x + A + B. 6. Accelerated proximal-gradient method with errors in the convex case We now give the proof of Proposition. Proof Defining θ = / + 1 v = x x x 1, θ 16

18 we can rewrite the update for y as y = 1 θ +1 x + θ +1 v, because 1 θ +1 x + θ +1 v = 1 + x + + [x x x 1 ] = x + x x x x 1 = x 1 + x x 1 = y. Because g is ipschitz and g is convex, we get for any z that gx gy 1 + g y 1, x y 1 + x y 1 gz + g y 1, y 1 z + g y 1, x y 1 + x y 1. Because [g y 1 + e + x + f y 1 ] ε hx, we have for any z that hx ε + hz + y 1 x g y 1 e + f, x z = ε + hz + g y 1, z x + x y 1, z x + e + f, z x Adding these bounds together gives: gx + hx = fx ε + fz + x y 1, z x + x y 1 + e + f, z x Choosing z = θ x + 1 θ x 1 gives fx ε + fθ x + 1 θx 1 + x y 1, θ x + 1 θ x 1 x + x y 1 + e + f, θ x + 1 θ x 1 x ε + θ fx + 1 θ fx 1 + x y 1, θ x + 1 θ x 1 x + x y 1 + e + f, θ x + 1 θ x 1 x 11 using the convexity of f and the fact that θ is in [0, 1]. Since and θ x + 1 θ x 1 x = θ x v x y 1 = θ v + 1 θ x 1 y 1 = θ v θ v 1, 17

19 we have x y 1, θ x + 1 θ x 1 x = θ v v 1, x v = θ v x + θ v x, v 1 x x y 1 = θ v v 1 = θ e + f, θ x + 1 θ x 1 x = θ e + f, x v. Summing Eq. 1 and 13, we get x y 1, θ x + 1 θ x 1 x + x y 1 = θ 1 v x + v 1 x v x, v 1 x 13 v 1 x v x Moving all function values in Eq. 11 to the left-side, we then get fx θ fx 1 θ fx 1 θ v 1 x v x + ε + θ e + f, x v. Reordering the terms and dividing by θ gives 1 θ fx fx + v x 1 θ θ Now we use that for all greater than or equal to 1, fx 1 fx + v 1 x + ε θ + 1 e + f, x v. θ 1 θ θ 1 θ 1 to apply this recursively and obtain 1 θ fx fx + v x 1 θ 0 θ0 fx 0 fx + v 0 x e i + ε i x v i θ i using f i ε i. Since v 0 = x 0 and θ 0 =, we get fx fx + θ v x θ x 0 x +θ ε i θ +θ i ε i θ i 1 e i + ε i x v i. θ i As in the previous proof, we will now use Eq. 14 to first bound the values of v i x then, using these bounds, bound the function values

20 6..1 Bounding v i x We now need to bound the quantities v i x in terms of x 0 x, e i and ε i. v x x 0 x + ε i e i + + εi x v i. θ i θ i Since θ i = /i + 1, 1 θ i = i+1 i since i 1. Thus, we have v x x 0 x + i ε i + i e i + εi x v i. [ From emma 1 using S = x 0 x + i e ε i and λ i = i i denoting à = i e i + ε i and B = i ε i, we get v x à + x 0 x + B 1/ + Ã. Since Ãi and B i are increasing sequences, we also have for i : v i x Ãi + x 0 x + B i + à i x 0 x + Ãi + x 0 x + à + 1/ B i 1/ B. 1/ + ε i ], and 6.. Bounding the function values Dropping θ v x in Eq. 14 since it is positive, we thus have [ x 0 x + B + à x 0 x + à + fx fx θ θ θ B ] x 0 x + B + à x 0 x + 4à + à B x 0 x + à + B and 1 θ fx fx x 0 x + à + B. 19

21 6.3 Basic proximal-gradient method with errors in the strongly convex case Below is the proof of Proposition 3 Proof Again, there exists f i such that f i ε i and x i 1 x i 1 g x i 1 + e i f i εi hx i. Since x is optimal, we have that x = prox x 1 g x. We first separate f, the error in the proximal, from the rest: x x = prox x 1 1 g x 1 1 e + f prox x 1 g x = prox x 1 1 g x 1 1 e prox x 1 g x + f + f, prox x 1 1 g x 1 1 e prox x 1 g x prox x 1 1 g x 1 1 e prox x 1 g x + ε ε + prox x 1 1 g x 1 1 e prox x 1 g x ε using Cauchy-Schwartz and f x 1 1 g x 1 1 e x + 1 g x + ε ε + x 1 1 g x 1 1 e x + 1 g x using the non-expansiveness of the proximal x 1 1 g x 1 1 e x + 1 g x + ε ε + x 1 x 1 g x 1 g x + e using the triangular inequality. 0

22 We continue this computation, but now separating e, the error in the gradient, from the rest: x x = x 1 x 1 g x 1 g x + e e, x 1 x 1 g x 1 1 g x + ε + ε x 1 x 1 g x 1 g x + e x 1 x 1 g x 1 g x + e + e x 1 x 1 g x 1 g x + ε + ε x 1 x 1 g x 1 g x + e using Cauchy-Schwartz x 1 x 1 g x 1 g x + e + ε + ε e e + + ε x 1 x 1 g x 1 g x. We now need to bound x 1 x 1 g x 1 g x to get the final result. We have: x 1 x 1 g x 1 g x = x 1 x + 1 g x 1 g x g x 1 g x, x 1 x x 1 x + 1 g x 1 g x 1 + g x 1 g x + + x 1 x using theorem.1.1 of [30] = 1 + x 1 x g x 1 g x x 1 x x 1 x using the negativity of 1 + and the strong convexity of g = 1 x 1 x. Thus x x 1 x 1 x + e + ε + e + + ε 1 x 1 x ε e = [ 1 x 1 x + e + ] ε. 1

23 Taing the square root of both sides and applying the bound recursively yields x x 1 x0 x + 1 i e i + εi. 6.4 Accelerated proximal-gradient method with errors in the strongly convex case We now give the proof of Proposition 4. Proof We have following [30] We define If we choose α 0 = γ, then this yields We can bound gx with x = y 1 1 g y 1. α = 1 α α 1 + α v = x α 1 x x 1 θ = α 1 y = x + θ v x. y = x + 1 γ 1 + γ x x 1. gx gy 1 + g y 1, x y 1 + x y 1 using the convexity of g gz + g y 1, y 1 z + g y 1, x y 1 + x y 1 y 1 z using the -strong convexity of g. Using emma, we have that [g y 1 + e + x + f y 1 ] ε hx. Hence, we have for any z that hx ε + hz + y 1 x g y 1 e f, x z = ε + hz + g y 1, z x + x y 1, z x + e + f, z x

24 Adding these two bounds, we get for any z fx ε + fz + x y 1, z x + x y 1 y 1 z + e + f, z x. Using z = α 1 x + 1 α 1 x 1, we get fx ε + fα 1 x + 1 α 1 x 1 + x y 1, α 1 x + 1 α 1 x 1 x + x y 1 y 1 α 1 x 1 α 1 x 1 + e + f, α 1 x + 1 α 1 x 1 x ε + α 1 fx + 1 α 1 fx 1 + x y 1, α 1 x + 1 α 1 x 1 x + x y 1 y 1 α 1 x 1 α 1 x 1 α 11 α 1 x x 1 + e + f, α 1 x + 1 α 1 x 1 x using the -strong convexity of f. We can replace x y 1 using x y 1 = x x 1 θ 1 v 1 x 1 We also have and = θ 1 x 1 + θ 1 x x 1 + α 1 = θ 1 v v θ 1 α 1 1 θ 1 x x 1 θ 1 v 1 α 1 x x 1. 1 α 1 x 1 x = α 1 v α 1 x + 1 α 1 x 1 x = α 1 x v, y 1 α 1 x 1 α 1 x 1 = y 1 α 1 x v x = α 1 v x θ 1 v v 1 1 θ 1 x x 1 α 1 3

25 Thus, fx ε + α 1 fx + 1 α 1 fx 1 θ 1 α 1 v v 1, v x α 1 θ 1 x x 1, v x 1 θ 1 α 1 + θ 1 v v θ 1 1 θ 1 α 1 v v 1, x x 1 x x 1 α 1 v x θ 1 θ 1 1 v v 1 α 1 x x 1 + α 1 θ 1 v x, v v 1 + α 1 θ 1 v x, x x 1 θ 1 1 θ 1 α 1 + α 1 e + f, x v. v v 1, x x 1 α 11 α 1 x x 1 To avoid unnecessary clutter, we shall denote E the additional term induced by the errors, i.e. E = ε + α 1 e + f, x v. Before reordering the terms together, we shall also replace all instances of v v 1 with v x v 1 x : fx E + α 1 fx + 1 α 1 fx 1 θ 1 α 1 v x + θ 1 α 1 v 1 x, v x α 1 θ 1 x x 1, v x + θ 1 v x + θ 1 v 1 x θ 1 v x, v 1 x 1 θ 1 α θ 1 1 θ 1 α 1 x x 1 v x, x x 1 θ 1 1 θ 1 α 1 α 1 v x + α 1 θ 1 v x α 1 θ 1 v x, v 1 x + α 1 θ 1 v x, x x 1 θ 1 1 θ 1 α 1 α 11 α 1 x x 1. v x, x x 1 + θ 1 1 θ 1 α 1 v 1 x, x x 1 v 1 x, x x 1 4

26 With a bit of well-needed cleaning, this becomes fx E + α 1 fx + 1 α 1 fx 1 [ ] + θ 1 α 1 α 1 v x + θ 1 v 1 x + θ 1 α 1 θ 1 v 1 x, v x + θ 1 α 1 1 θ 1 x x 1, v x α 1 θ 1 1 θ 1 v 1 x, x x 1 + We can rewrite x x 1 using 1 θ 1 α 1 α 1 x x 1 α 11 α 1 x x 1. x x 1 = α 1 v x 1 = α 1 v x α 1 x 1 x. We may now compute the coefficients for the following terms: v x, v 1 x, v 1 x, v x, x 1 x, v x, x 1 x, v 1 x and x x 1. For v x, we have θ 1 α 1 α 1 } {{ } v x term θ 1 α 1 }{{} x x 1, v x term + θ 1 α 1 = α 1 }{{ } x x 1 term. For v 1 x, there is only one term and we eep θ 1. For v 1 x, v x, we get θ 1 α 1 θ 1 θ }{{} 1 α 1 θ 1 = 0. }{{} v 1 x, v x term v 1 x, x x 1 term For x 1 x, v x, we get θ 1 α 1 }{{} θ 1 α 1 = 0. }{{} x x 1, v x term x x 1 term 5

27 For x 1 x, v 1 x, we get For x x 1, we get Hence, we have θ 1 α 1 θ 1 = θ }{{} 1 α 1 θ 1. v 1 x, x x 1 term θ 1 α 1 }{{ } α 11 α 1 = }{{} θ 11 α 1. x x 1 term x x 1 term fx E + α 1 fx + 1 α 1 fx 1 α 1 v x + θ 1 v 1 x + θ 1 α 1 θ 1 v 1 x, x 1 x θ 11 α 1 x 1 x θ 1 θ 1 α 1 v 1 x 1 α 1 + θ 1 θ 1 α 1 v 1 x, 1 α 1 the last two lines allowing us to complete the square. We may now factor it to get fx E + α 1 fx + 1 α 1 fx 1 α 1 v x + θ 1 v 1 x θ 11 α 1 x 1 x α 1 θ 1 v 1 x 1 α 1 + θ 1 θ 1 α 1 v 1 x. 1 α 1 Discarding the term depending on x 1 x and regrouping the terms depending on v 1 x, we have fx E + α 1 fx + 1 α 1 fx 1 α 1 v x + α 1 α 1 v 1 x. 6

28 Reordering the terms, we have fx fx + α 1 We can rewrite Eq. 15 as v x 1 α 1 fx 1 fx + α 1 α 1 v 1 x +E. 15 fx fx + α 1 v x 1 α 1 fx 1 fx + α v 1 x + α 1 α 1 1 α 1 α v 1 x + E. Using and denoting α = we get the following recursion: δ = fx fx + v x, 16 δ 1 Applying this relationship recursively, we get δ 1 δ 0 + δ 1 + E. 17 E t 1 t. 18 Since E = ε + α 1 e + f, x v, we can bound it by E ε + e + ε v x 19 ε using f. Plugging Eq. 19 into Eq. 18, we get δ 1 δ 0 + ε t + e t + ε t v t x 1 t Again, we shall use Eq. 18 to first bound the values of v i x, then the function values themselves.. 0 7

29 6.4.1 Bounding v i x We will now use emma 1 to bound the value of v x. Since v x is bounded by using Eq. 16, we can use Eq. 18 to get δ v x Multiplying both sides by [ v x δ 0 + Using emma 1 with S = we get with 1 [ δ 0 + t ] ε t 1 1 yields [ e t + ε t 1 ε t 1 + δ 0 + ε t 1 / v x 1 B = t ] e t + ε t 1 t v t x. t/ 1 t/ v t x. 1 t ] and λ i = ei + ε i 1 i/, λ t + δ 0 + B + ε t 1 t 1 λ t Since B t is an increasing sequence and the λ t are positive, we have for i 1 i/ v i x 1 1 = i λ t + δ 0 + B i + i λ t + λ t + δ 0 + δ 0 + B i B / i λ t i λ t 1/. 8

30 Hence, v i x 1 i/ λ t + δ 0 + B Bounding the function values t et + ε t 1 t/, we obtain after plug- Denoting  = λ t = ging Eq. 3 into Eq. 0 δ = = [ δ 0 + ε t 1 t ] e t + ε t 1 δ 0 + B + δ 0 + B +  λ t  + t/  +  + δ0 +  + B. δ 0 + δ 0 + B δ 0 + B Using the -ipschitz gradient of f and the fact that v 0 = x 0, we have δ0 = fx 0 fx + v 0 x fx 0 fx. Hence, discarding the term v x of δ, we have fx fx 1 B fx0 fx +  + B. 4 References [1] A. Bec and M. Teboulle. A fast iterative shrinage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 1:183 0,

31 [] Y. Nesterov. Gradient methods for minimizing composite objective function. CORE Discussion Papers, 007/76, 007. [3] R. Tibshirani. Regression shrinage and selection via the asso. Journal of the Royal Statistical Society: Series B, 581:67 88, [4] S.S. Chen, D.. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 01:33 61, [5] S.J. Wright, R.D. Nowa, and M.A.T. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 577: , 009. [6] F. Bach, R. Jenatton, J. Mairal, and G. Obozinsi. Convex optimization with sparsity-inducing norms. In S. Sra, S. Nowozin, and S.J. Wright, editors, Optimization for Machine earning. MIT Press, 011. [7] J. Fadili and G. Peyré. Total variation projection with first order schemes. IEEE Transactions on Image Processing, 03: , 011. [8] X. Chen, S. Kim, Q. in, J.G. Carbonell, and E.P. Xing. Graph-structured multi-tas regression and an efficient optimization method for general fused asso. arxiv: v1, 010. [9] J.-F. Cai, E.J. Candès, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 04, 010. [10] S. Ma, D. Goldfarb, and. Chen. Fixed point and Bregman iterative methods for matrix ran minimization. Mathematical Programming, 181:31 353, 011. [11]. Jacob, G. Obozinsi, and J.-P. Vert. Group asso with overlap and graph asso. ICM, 009. [1] R. Jenatton, J. Mairal, G. Obozinsi, and F. Bach. Proximal methods for sparse hierarchical dictionary learning. JMR, 1:97 334, 011. [13] A. Barbero and S. Sra. Fast Newton-type methods for total variation regularization. ICM, 011. [14] J. iu and J. Ye. Fast overlapping group asso. arxiv: v1, 010. [15] M. Schmidt and K. Murphy. Convex structure learning in log-linear models: Beyond pairwise potentials. AISTATS, 010. [16] M. Patrisson. A unified framewor of descent algorithms for nonlinear programs and variational inequalities. PhD thesis, Department of Mathematics, inöping University, Sweden, [17] P.. Combettes. Solving monotone inclusions via compositions of nonexpansive averaged operators. Optimization, 535-6: , 004. [18] J. Duchi and Y. Singer. Efficient online and batch learning using forward bacward splitting. JMR, 10: , 009. [19] J. angford,. i, and T. Zhang. Sparse online learning via truncated gradient. JMR, 10: , 009. [0] A. d Aspremont. Smooth optimization with approximate gradient. SIAM Journal on Optimization, 193: , 008. [1] M. Baes. Estimate sequence methods: extensions and approximations. IFOR internal report, ETH Zurich,

32 [] O. Devolder, F. Glineur, and Y. Nesterov. First-order methods of smooth convex optimization with inexact oracle. CORE Discussion Papers, 011/0, 011. [3] A. Nedic and D. Bertseas. Convergence rate of incremental subgradient algorithms. Stochastic Optimization: Algorithms and Applications, pages , 000. [4] Z.-Q. uo and P. Tseng. Error bounds and convergence analysis of feasible descent methods: A general approach. Annals of Operations Research, : , [5] M.P. Friedlander and M. Schmidt. Hybrid deterministic-stochastic methods for data fitting. arxiv: , 011. [6] R.T. Rocafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 145: , [7] O. Güler. New proximal point algorithms for convex minimization. SIAM Journal on Optimization, 4: , 199. [8] S. Villa, S. Salzo,. Baldassarre, and A. Verri. Accelerated and inexact forward-bacward algorithms. Optimization Online, 011. [9] K. Jiang, D. Sun, and K.C. Toh. An inexact accelerated proximal gradient method for large scale linearly constrained convex SDP. Optimization Online, 011. [30] Y. Nesterov. Introductory ectures on Convex Optimization: A Basic Course. Springer, 004. [31] D.P. Bertseas. Convex optimization theory. Athena Scientific, 009. [3] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization, 008. [33] J. Mairal, R. Jenatton, G. Obozinsi, and F. Bach. Convex and networ flow optimization for structured sparsity. JMR, 1:681 70, 011. [34] H.H. Bausche and P.. Combettes. A Dystra-lie algorithm for two monotone operators. Pacific Journal of Optimization, 43: , 008. [35] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Prog., 1031:17 15, 005. [36] P.. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. In H.H. Bausche, R.S. Burachi, P.. Combettes, V. Elser, D.R. ue, and H. Wolowicz, editors, Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pages Springer, 011. [37] M.J. Wainwright, T.S. Jaaola, and A.S. Willsy. Tree-reweighted belief propagation algorithms and approximate M estimation by pseudo-moment matching. AISTATS, 003. [38] J. Kivinen, A.J. Smola, and R.C. Williamson. Online learning with ernels. IEEE Transactions on Signal Processing, 58: , 004. [39] A. d Aspremont. Subsampling algorithms for semidefinite programming. arxiv: v5, 009. [40] M. Schmidt, D. Kim, and S. Sra. Projected Newton-type methods in machine learning. In S. Sra, S. Nowozin, and S. Wright, editors, Optimization for Machine earning. MIT Press, 011. [41] D.P. Bertseas, A. Nedić, and A.E. Ozdaglar. Convex Analysis and Optimization. Athena Scientific,

Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization Mark Schmidt, Nicolas e Roux, Francis Bach To cite this version: Mark Schmidt, Nicolas e Roux, Francis Bach. Convergence Rates

More information

A proximal approach to the inversion of ill-conditioned matrices

A proximal approach to the inversion of ill-conditioned matrices A proximal approach to the inversion of ill-conditioned matrices Pierre Maréchal, Aude Rondepierre To cite this version: Pierre Maréchal, Aude Rondepierre. A proximal approach to the inversion of ill-conditioned

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

Fast Computation of Moore-Penrose Inverse Matrices

Fast Computation of Moore-Penrose Inverse Matrices Fast Computation of Moore-Penrose Inverse Matrices Pierre Courrieu To cite this version: Pierre Courrieu. Fast Computation of Moore-Penrose Inverse Matrices. Neural Information Processing - Letters and

More information

On Symmetric Norm Inequalities And Hermitian Block-Matrices

On Symmetric Norm Inequalities And Hermitian Block-Matrices On Symmetric Norm Inequalities And Hermitian lock-matrices Antoine Mhanna To cite this version: Antoine Mhanna On Symmetric Norm Inequalities And Hermitian lock-matrices 016 HAL Id: hal-0131860

More information

On Newton-Raphson iteration for multiplicative inverses modulo prime powers

On Newton-Raphson iteration for multiplicative inverses modulo prime powers On Newton-Raphson iteration for multiplicative inverses modulo prime powers Jean-Guillaume Dumas To cite this version: Jean-Guillaume Dumas. On Newton-Raphson iteration for multiplicative inverses modulo

More information

GSOS: Gauss-Seidel Operator Splitting Algorithm for Multi-Term Nonsmooth Convex Composite Optimization

GSOS: Gauss-Seidel Operator Splitting Algorithm for Multi-Term Nonsmooth Convex Composite Optimization : Gauss-Seidel Operator Splitting Algorithm for Multi-Term Nonsmooth Convex Composite Optimization Li Shen 1 Wei Liu 1 Ganzhao Yuan Shiqian Ma 3 Abstract In this paper, we propose a fast Gauss-Seidel Operator

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

On Symmetric Norm Inequalities And Hermitian Block-Matrices

On Symmetric Norm Inequalities And Hermitian Block-Matrices On Symmetric Norm Inequalities And Hermitian lock-matrices Antoine Mhanna To cite this version: Antoine Mhanna On Symmetric Norm Inequalities And Hermitian lock-matrices 015 HAL Id: hal-0131860

More information

Dissipative Systems Analysis and Control, Theory and Applications: Addendum/Erratum

Dissipative Systems Analysis and Control, Theory and Applications: Addendum/Erratum Dissipative Systems Analysis and Control, Theory and Applications: Addendum/Erratum Bernard Brogliato To cite this version: Bernard Brogliato. Dissipative Systems Analysis and Control, Theory and Applications:

More information

Convex relaxation for Combinatorial Penalties

Convex relaxation for Combinatorial Penalties Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,

More information

Completeness of the Tree System for Propositional Classical Logic

Completeness of the Tree System for Propositional Classical Logic Completeness of the Tree System for Propositional Classical Logic Shahid Rahman To cite this version: Shahid Rahman. Completeness of the Tree System for Propositional Classical Logic. Licence. France.

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Simulated Data for Linear Regression with Structured and Sparse Penalties

Simulated Data for Linear Regression with Structured and Sparse Penalties Simulated Data for Linear Regression with Structured and Sparse Penalties Tommy Lofstedt, Vincent Guillemot, Vincent Frouin, Édouard Duchesnay, Fouad Hadj-Selem To cite this version: Tommy Lofstedt, Vincent

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg

More information

Linear Quadratic Zero-Sum Two-Person Differential Games

Linear Quadratic Zero-Sum Two-Person Differential Games Linear Quadratic Zero-Sum Two-Person Differential Games Pierre Bernhard To cite this version: Pierre Bernhard. Linear Quadratic Zero-Sum Two-Person Differential Games. Encyclopaedia of Systems and Control,

More information

Norm Inequalities of Positive Semi-Definite Matrices

Norm Inequalities of Positive Semi-Definite Matrices Norm Inequalities of Positive Semi-Definite Matrices Antoine Mhanna To cite this version: Antoine Mhanna Norm Inequalities of Positive Semi-Definite Matrices 15 HAL Id: hal-11844 https://halinriafr/hal-11844v1

More information

Tropical Graph Signal Processing

Tropical Graph Signal Processing Tropical Graph Signal Processing Vincent Gripon To cite this version: Vincent Gripon. Tropical Graph Signal Processing. 2017. HAL Id: hal-01527695 https://hal.archives-ouvertes.fr/hal-01527695v2

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Some explanations about the IWLS algorithm to fit generalized linear models

Some explanations about the IWLS algorithm to fit generalized linear models Some explanations about the IWLS algorithm to fit generalized linear models Christophe Dutang To cite this version: Christophe Dutang. Some explanations about the IWLS algorithm to fit generalized linear

More information

About Split Proximal Algorithms for the Q-Lasso

About Split Proximal Algorithms for the Q-Lasso Thai Journal of Mathematics Volume 5 (207) Number : 7 http://thaijmath.in.cmu.ac.th ISSN 686-0209 About Split Proximal Algorithms for the Q-Lasso Abdellatif Moudafi Aix Marseille Université, CNRS-L.S.I.S

More information

Minimizing finite sums with the stochastic average gradient

Minimizing finite sums with the stochastic average gradient Math. Program., Ser. A (2017) 162:83 112 DOI 10.1007/s10107-016-1030-6 FULL LENGTH PAPER Minimizing finite sums with the stochastic average gradient Mark Schmidt 1 Nicolas Le Roux 2 Francis Bach 3 Received:

More information

Learning an Adaptive Dictionary Structure for Efficient Image Sparse Coding

Learning an Adaptive Dictionary Structure for Efficient Image Sparse Coding Learning an Adaptive Dictionary Structure for Efficient Image Sparse Coding Jérémy Aghaei Mazaheri, Christine Guillemot, Claude Labit To cite this version: Jérémy Aghaei Mazaheri, Christine Guillemot,

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Unbiased minimum variance estimation for systems with unknown exogenous inputs

Unbiased minimum variance estimation for systems with unknown exogenous inputs Unbiased minimum variance estimation for systems with unknown exogenous inputs Mohamed Darouach, Michel Zasadzinski To cite this version: Mohamed Darouach, Michel Zasadzinski. Unbiased minimum variance

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

A Context free language associated with interval maps

A Context free language associated with interval maps A Context free language associated with interval maps M Archana, V Kannan To cite this version: M Archana, V Kannan. A Context free language associated with interval maps. Discrete Mathematics and Theoretical

More information

Chebyshev polynomials, quadratic surds and a variation of Pascal s triangle

Chebyshev polynomials, quadratic surds and a variation of Pascal s triangle Chebyshev polynomials, quadratic surds and a variation of Pascal s triangle Roland Bacher To cite this version: Roland Bacher. Chebyshev polynomials, quadratic surds and a variation of Pascal s triangle.

More information

The Windy Postman Problem on Series-Parallel Graphs

The Windy Postman Problem on Series-Parallel Graphs The Windy Postman Problem on Series-Parallel Graphs Francisco Javier Zaragoza Martínez To cite this version: Francisco Javier Zaragoza Martínez. The Windy Postman Problem on Series-Parallel Graphs. Stefan

More information

A simple test to check the optimality of sparse signal approximations

A simple test to check the optimality of sparse signal approximations A simple test to check the optimality of sparse signal approximations Rémi Gribonval, Rosa Maria Figueras I Ventura, Pierre Vergheynst To cite this version: Rémi Gribonval, Rosa Maria Figueras I Ventura,

More information

Exact Comparison of Quadratic Irrationals

Exact Comparison of Quadratic Irrationals Exact Comparison of Quadratic Irrationals Phuc Ngo To cite this version: Phuc Ngo. Exact Comparison of Quadratic Irrationals. [Research Report] LIGM. 20. HAL Id: hal-0069762 https://hal.archives-ouvertes.fr/hal-0069762

More information

Active-set complexity of proximal gradient

Active-set complexity of proximal gradient Noname manuscript No. will be inserted by the editor) Active-set complexity of proximal gradient How long does it take to find the sparsity pattern? Julie Nutini Mark Schmidt Warren Hare Received: date

More information

On the longest path in a recursively partitionable graph

On the longest path in a recursively partitionable graph On the longest path in a recursively partitionable graph Julien Bensmail To cite this version: Julien Bensmail. On the longest path in a recursively partitionable graph. 2012. HAL Id:

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

A Novel Aggregation Method based on Graph Matching for Algebraic MultiGrid Preconditioning of Sparse Linear Systems

A Novel Aggregation Method based on Graph Matching for Algebraic MultiGrid Preconditioning of Sparse Linear Systems A Novel Aggregation Method based on Graph Matching for Algebraic MultiGrid Preconditioning of Sparse Linear Systems Pasqua D Ambra, Alfredo Buttari, Daniela Di Serafino, Salvatore Filippone, Simone Gentile,

More information

Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox

Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox Stephane Canu To cite this version: Stephane Canu. Understanding SVM (and associated kernel machines) through

More information

MATH 680 Fall November 27, Homework 3

MATH 680 Fall November 27, Homework 3 MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

On infinite permutations

On infinite permutations On infinite permutations Dmitri G. Fon-Der-Flaass, Anna E. Frid To cite this version: Dmitri G. Fon-Der-Flaass, Anna E. Frid. On infinite permutations. Stefan Felsner. 2005 European Conference on Combinatorics,

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Methylation-associated PHOX2B gene silencing is a rare event in human neuroblastoma.

Methylation-associated PHOX2B gene silencing is a rare event in human neuroblastoma. Methylation-associated PHOX2B gene silencing is a rare event in human neuroblastoma. Loïc De Pontual, Delphine Trochet, Franck Bourdeaut, Sophie Thomas, Heather Etchevers, Agnes Chompret, Véronique Minard,

More information

Analysis of Boyer and Moore s MJRTY algorithm

Analysis of Boyer and Moore s MJRTY algorithm Analysis of Boyer and Moore s MJRTY algorithm Laurent Alonso, Edward M. Reingold To cite this version: Laurent Alonso, Edward M. Reingold. Analysis of Boyer and Moore s MJRTY algorithm. Information Processing

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

On Poincare-Wirtinger inequalities in spaces of functions of bounded variation

On Poincare-Wirtinger inequalities in spaces of functions of bounded variation On Poincare-Wirtinger inequalities in spaces of functions of bounded variation Maïtine Bergounioux To cite this version: Maïtine Bergounioux. On Poincare-Wirtinger inequalities in spaces of functions of

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

On the Griesmer bound for nonlinear codes

On the Griesmer bound for nonlinear codes On the Griesmer bound for nonlinear codes Emanuele Bellini, Alessio Meneghetti To cite this version: Emanuele Bellini, Alessio Meneghetti. On the Griesmer bound for nonlinear codes. Pascale Charpin, Nicolas

More information

Some tight polynomial-exponential lower bounds for an exponential function

Some tight polynomial-exponential lower bounds for an exponential function Some tight polynomial-exponential lower bounds for an exponential function Christophe Chesneau To cite this version: Christophe Chesneau. Some tight polynomial-exponential lower bounds for an exponential

More information

Topographic Dictionary Learning with Structured Sparsity

Topographic Dictionary Learning with Structured Sparsity Topographic Dictionary Learning with Structured Sparsity Julien Mairal 1 Rodolphe Jenatton 2 Guillaume Obozinski 2 Francis Bach 2 1 UC Berkeley 2 INRIA - SIERRA Project-Team San Diego, Wavelets and Sparsity

More information

About partial probabilistic information

About partial probabilistic information About partial probabilistic information Alain Chateauneuf, Caroline Ventura To cite this version: Alain Chateauneuf, Caroline Ventura. About partial probabilistic information. Documents de travail du Centre

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

Convergence Rates of Biased Stochastic Optimization for Learning Sparse Ising Models

Convergence Rates of Biased Stochastic Optimization for Learning Sparse Ising Models Convergence Rates of Biased Stochastic Optimization for Learning Sparse Ising Models Jean Honorio Stony Broo University, Stony Broo, NY 794, USA jhonorio@cs.sunysb.edu Abstract We study the convergence

More information

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Analyzing large-scale spike trains data with spatio-temporal constraints

Analyzing large-scale spike trains data with spatio-temporal constraints Analyzing large-scale spike trains data with spatio-temporal constraints Hassan Nasser, Olivier Marre, Selim Kraria, Thierry Viéville, Bruno Cessac To cite this version: Hassan Nasser, Olivier Marre, Selim

More information

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Full-order observers for linear systems with unknown inputs

Full-order observers for linear systems with unknown inputs Full-order observers for linear systems with unknown inputs Mohamed Darouach, Michel Zasadzinski, Shi Jie Xu To cite this version: Mohamed Darouach, Michel Zasadzinski, Shi Jie Xu. Full-order observers

More information

Cutwidth and degeneracy of graphs

Cutwidth and degeneracy of graphs Cutwidth and degeneracy of graphs Benoit Kloeckner To cite this version: Benoit Kloeckner. Cutwidth and degeneracy of graphs. IF_PREPUB. 2009. HAL Id: hal-00408210 https://hal.archives-ouvertes.fr/hal-00408210v1

More information

On size, radius and minimum degree

On size, radius and minimum degree On size, radius and minimum degree Simon Mukwembi To cite this version: Simon Mukwembi. On size, radius and minimum degree. Discrete Mathematics and Theoretical Computer Science, DMTCS, 2014, Vol. 16 no.

More information

Symmetric Norm Inequalities And Positive Semi-Definite Block-Matrices

Symmetric Norm Inequalities And Positive Semi-Definite Block-Matrices Symmetric Norm Inequalities And Positive Semi-Definite lock-matrices Antoine Mhanna To cite this version: Antoine Mhanna Symmetric Norm Inequalities And Positive Semi-Definite lock-matrices 15

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

On the convergence of a regularized Jacobi algorithm for convex optimization

On the convergence of a regularized Jacobi algorithm for convex optimization On the convergence of a regularized Jacobi algorithm for convex optimization Goran Banjac, Kostas Margellos, and Paul J. Goulart Abstract In this paper we consider the regularized version of the Jacobi

More information

A non-commutative algorithm for multiplying (7 7) matrices using 250 multiplications

A non-commutative algorithm for multiplying (7 7) matrices using 250 multiplications A non-commutative algorithm for multiplying (7 7) matrices using 250 multiplications Alexandre Sedoglavic To cite this version: Alexandre Sedoglavic. A non-commutative algorithm for multiplying (7 7) matrices

More information

Widely Linear Estimation with Complex Data

Widely Linear Estimation with Complex Data Widely Linear Estimation with Complex Data Bernard Picinbono, Pascal Chevalier To cite this version: Bernard Picinbono, Pascal Chevalier. Widely Linear Estimation with Complex Data. IEEE Transactions on

More information

New estimates for the div-curl-grad operators and elliptic problems with L1-data in the half-space

New estimates for the div-curl-grad operators and elliptic problems with L1-data in the half-space New estimates for the div-curl-grad operators and elliptic problems with L1-data in the half-space Chérif Amrouche, Huy Hoang Nguyen To cite this version: Chérif Amrouche, Huy Hoang Nguyen. New estimates

More information

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent

More information

Case report on the article Water nanoelectrolysis: A simple model, Journal of Applied Physics (2017) 122,

Case report on the article Water nanoelectrolysis: A simple model, Journal of Applied Physics (2017) 122, Case report on the article Water nanoelectrolysis: A simple model, Journal of Applied Physics (2017) 122, 244902 Juan Olives, Zoubida Hammadi, Roger Morin, Laurent Lapena To cite this version: Juan Olives,

More information

Impedance Transmission Conditions for the Electric Potential across a Highly Conductive Casing

Impedance Transmission Conditions for the Electric Potential across a Highly Conductive Casing Impedance Transmission Conditions for the Electric Potential across a Highly Conductive Casing Hélène Barucq, Aralar Erdozain, David Pardo, Victor Péron To cite this version: Hélène Barucq, Aralar Erdozain,

More information

A Simple Proof of P versus NP

A Simple Proof of P versus NP A Simple Proof of P versus NP Frank Vega To cite this version: Frank Vega. A Simple Proof of P versus NP. 2016. HAL Id: hal-01281254 https://hal.archives-ouvertes.fr/hal-01281254 Submitted

More information

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

A new simple recursive algorithm for finding prime numbers using Rosser s theorem

A new simple recursive algorithm for finding prime numbers using Rosser s theorem A new simple recursive algorithm for finding prime numbers using Rosser s theorem Rédoane Daoudi To cite this version: Rédoane Daoudi. A new simple recursive algorithm for finding prime numbers using Rosser

More information

Recent Advances in Structured Sparse Models

Recent Advances in Structured Sparse Models Recent Advances in Structured Sparse Models Julien Mairal Willow group - INRIA - ENS - Paris 21 September 2010 LEAR seminar At Grenoble, September 21 st, 2010 Julien Mairal Recent Advances in Structured

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

Inexact proximal stochastic gradient method for convex composite optimization

Inexact proximal stochastic gradient method for convex composite optimization Comput Optim Appl DOI 10.1007/s10589-017-993-7 Inexact proximal stochastic gradient method for convex composite optimization Xiao Wang 1 Shuxiong Wang Hongchao Zhang 3 Received: 9 September 016 Springer

More information

On the link between finite differences and derivatives of polynomials

On the link between finite differences and derivatives of polynomials On the lin between finite differences and derivatives of polynomials Kolosov Petro To cite this version: Kolosov Petro. On the lin between finite differences and derivatives of polynomials. 13 pages, 1

More information

Dispersion relation results for VCS at JLab

Dispersion relation results for VCS at JLab Dispersion relation results for VCS at JLab G. Laveissiere To cite this version: G. Laveissiere. Dispersion relation results for VCS at JLab. Compton Scattering from Low to High Momentum Transfer, Mar

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Proximal methods. S. Villa. October 7, 2014

Proximal methods. S. Villa. October 7, 2014 Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

More information

A Distributed Newton Method for Network Utility Maximization, I: Algorithm

A Distributed Newton Method for Network Utility Maximization, I: Algorithm A Distributed Newton Method for Networ Utility Maximization, I: Algorithm Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract Most existing wors use dual decomposition and first-order

More information

Influence of a Rough Thin Layer on the Potential

Influence of a Rough Thin Layer on the Potential Influence of a Rough Thin Layer on the Potential Ionel Ciuperca, Ronan Perrussel, Clair Poignard To cite this version: Ionel Ciuperca, Ronan Perrussel, Clair Poignard. Influence of a Rough Thin Layer on

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

PARALLEL SUBGRADIENT METHOD FOR NONSMOOTH CONVEX OPTIMIZATION WITH A SIMPLE CONSTRAINT

PARALLEL SUBGRADIENT METHOD FOR NONSMOOTH CONVEX OPTIMIZATION WITH A SIMPLE CONSTRAINT Linear and Nonlinear Analysis Volume 1, Number 1, 2015, 1 PARALLEL SUBGRADIENT METHOD FOR NONSMOOTH CONVEX OPTIMIZATION WITH A SIMPLE CONSTRAINT KAZUHIRO HISHINUMA AND HIDEAKI IIDUKA Abstract. In this

More information