arxiv: v1 [math.na] 19 Jan 2018

Size: px
Start display at page:

Download "arxiv: v1 [math.na] 19 Jan 2018"

Transcription

1 The Complexity of Primal-Dual Fixed Point Methods for Ridge Regression Ademir Alves Ribeiro a,,, Peter Richtárik b,2 arxiv:8.6354v [math.na] 9 Jan 28 a Department of Mathematics, Federal University of Paraná, CP 98, , Curitiba, PR, Brazil b School of Mathematics, University of Edinburgh, United Kingdom Abstract Keywords: Unconstrained minimization, primal-dual methods, ridge regression, fixed-point methods. 2 MSC: 65K5, 49M37, 9C3. Introduction Given matrices A,..., A n R d m encoding n observations examples, and vectors y,..., y n R m encoding associated responses labels, one is often interested in finding a vector w R d such that, in some precise sense, the product A T i w is a good approximation of y i for all i. A fundamental approach to this problem, used in all areas of computational practice, is to formulate the problem as an L 2 -regularized least-squares problem, also known as ridge regression. In particular, we consider the primal ridge regression problem min P w def = w R d 2n n i= A T i w y i 2 + λ 2 w 2 = 2n AT w y 2 + λ 2 w 2, where λ > is a regularization parameter, denotes the standard Euclidean norm. In the second and more concise expression we have concatenated the observation matrices and response vectors to form a single observation matrix A = [A, A 2,, A n ] R d N and a single response vector y = y, y 2,, y n R N, where N = nm. The results of this paper were obtained between October 24 and March 25, during AR s affiliation with the University of Edinburgh. First version of this paper was available on August 3, 27. This revision: January 28. Corresponding Author addresses: ademir.ribeiro@ufpr.br Ademir Alves Ribeiro, peter.richtarik@ed.ac.uk Peter Richtárik Supported by CNPq, Brazil, Grants 285/24-3 and 39437/ Supported by the EPSRC Grant EP/K2325X/, Accelerated Coordinate Descent Methods for Big Data Optimization. Preprint submitted to Linear Algebra and its Applications August 6, 28

2 With each observation A i, y i we now associate a dual variable, α i R m. The Fenchel dual of is also a ridge regression problem: max Dα def = α R N 2 2 Aα 2 + n αt y 2n α 2, 2 where α = α, α 2,..., α n R N. Optimality conditions. The starting point of this work is the observation that the optimality conditions for the primal and dual ridge regression problems can be written in several different ways, in the form of a linear system involving the primal and dual variables. In particular, we find several different matrixvector pairs M, b, where M R d+n d+n and b R d+n, such that the optimality conditions can be expressed in the form of a linear system as where x = w, α R d+n. x = Mx + b, 3 Fixed point methods. With each system 3 one can naturally associate a fixed point method performing the iteration x k+ = Mx k + b. However, unless the spectrum of M is contained in the unit circle, such a method will not converge []. To overcome this drawback, we utilize the idea of relaxation. In particular, we pick a relaxation parameter θ and replace 3 with the equivalent system x = G θ x + b θ, where G θ = θi + θm and b θ = θb. The choice θ = recovers 3. We then study the convergence of the primal-dual fixed point methods x k+ = G θ x k + b θ through a careful study of the spectra of the iteration matrices G θ. Our work starts with the following observation: While all these formulations are necessarily algebraically equivalent, they give rise to different fixed-point algorithms, with different convergence properties... Contributions and literature review It is well known that the role of duality in optimization and machine learning is very important, not only from the theoretical point of view but also computationally [2, 3, 4]. However, a more recent idea that has generated many contributions is the usage of the primal and dual problems together. Primal-dual methods have been employed in convex optimization problems where strong duality holds, obtaining success when applied to several types of nonlinear and nonsmooth functions that arise in various application fields, such as image processing, machine learning, inverse problems, among others [5, 6, 7]. 2

3 On the other hand, fixed-point-type algorithms are classical tools for solving some structured linear systems. In particular, we have the iterative schemes developed by the mathematical economists Arrow, Hurwicz and Uzawa for solving saddle point problems [8, 9]. In this paper we develop several primal-dual fixed point methods for the Ridge Regression problem. Ridge regression was introduced by Hoerl and Kennard [, ] as a regularization method for solving least squares problems with highly correlated predictors. The goal is to reduce the standard errors of regression coefficients by imposing a penalty, in the L 2 norm, on their size. Since then, numerous papers were devoted to the study of ridge regression or even for solving problems with a general formulation in which ridge regression is a particular case. Some of these works have considered its dual formulation, proposing deterministic and stochastic algorithms that can be applied to the dual problem [2, 3, 4, 2, 5, 4]. To the best of our knowledge, the only work that considers a primal-dual fixed point approach to deal with ridge regression is [6], where the authors deal with ill-conditioned problems. They present an algorithm based on the gradient method and an accelerated version of this algorithm. Here we propose methods based on the optimality conditions for the problem of minimizing the duality gap between the ridge regression problems and 2 in different and equivalent ways by means of linear systems involving structured matrices. We also study the complexity of the proposed methods and prove that our main method achieves the optimal accelerated Nesterov rate. This theoretical property is supported by numerical experiments indicating that our main algorithm is competitive with the conjugate gradient method..2. Outline In Section 2 we formulate the optimality conditions for the problem of minimizing the duality gap between and 2 in two different, but equivalent, ways by means of linear systems involving structured matrices. We also establish the duality relationship between the problems and 2. In Section 3 we describe a family of parameterized fixed point methods applied to the reformulations for the optimality conditions. We present the convergence analysis and complexity results for these methods. Section 4 brings the main contribution of this work, with an accelerated version of the methods described in Section 3. In Section 5 we discuss some variants of our accelerated algorithm. In Section 6 we perform some numerical experiments. Finally, concluding remarks close our text in Section Separable and Coupled Optimality Conditions Defining x = w, α R d+n, our primal-dual problem consists of minimizing the duality gap between the problems and 2, that is min fx def = P w Dα. 4 x R d+n 3

4 This is a quadratic strongly convex problem and therefore admits a unique global solution x R d+n. 2.. A separable system P w Note that fx = Dα, where P w = n AAT w y + λw and Dα = 2 AT Aα n α + y. 5 n So, the first and natural way of writing the optimality conditions for problem 4 is just to set the expressions given in 5 equal to zero, which can be written as w α 2.2. A coupled system = AA T A T A w α + Ay y. 6 In order to derive the duality between and 2, as well as to reformulate the optimality conditions for problem 4, note that P w = n n φ i A T i w + λgw, 7 i= where φ i z = 2 z y i 2 and gw = 2 w 2. Now, recall that the Fenchel conjugate of a convex function ξ : R l R is ξ : R l R { } defined by ξ u def = sup s R l {s T u ξs}. Note that if ξ is strongly convex, then ξ u < for all u R l. Indeed, in this case ξ is bounded below by a strongly convex quadratic function, implying that the sup above is in fact a max. It is easily seen that φ i s = 2 s 2 +s T y i and g u = 2 u 2. Furthermore, we have Dα = λg n A i α i n φ i α i. 8 n If we write the duality gap can be written as i= ᾱ def = Aα = P w Dα = λ gw+g ᾱ w T ᾱ + n i= n A i α i, 9 i= n i= φ i A T i w+φ i α i +αi T A T i w 4

5 and the weak duality follows immediately from the fact that gw + g ᾱ w T ᾱ and φ i A T i w + φ i α i + α T i A T i w. Strong duality occurs when these quantities vanish, which is precisely the same as w = g ᾱ and α i = φ i A T i w, or, equivalently, ᾱ = gw and AT i w = φ i α i. Therefore, another way to see the optimality conditions for problem 4 is by the relations This is equivalent to w = α w = ᾱ = Aα and α = y AT w. A A T w α + y Compact form Both reformulations of the optimality conditions, 6 and, can be viewed in the compact form x = Mx + b, 2 for some M R d+n d+n and b R d+n. Let us denote M = AA T A T and M A 2 = A A T 3 the matrices associated with the optimality conditions formulated as 6 and, respectively. Also, let b = Ay and b y 2 =. 4 y Thus, we can rewrite 6 and as respectively. x = M x + b and x = M 2 x + b 2, 5 3. Primal-Dual Fixed Point Methods A method that arises immediately from the relation 2 is given by the scheme x k+ = Mx k + b. However, unless the spectrum of M is contained in the unit circle, this scheme will not converge. To overcome this drawback, we utilize the idea of relaxation. More precisely, we consider a relaxation parameter θ and replace 2 with the equivalent system x = θx + θmx + b. Note that the choice θ = recovers 2. The proposed algorithm is then given by the following framework. 5

6 Algorithm 3.. Primal-Dual Fixed Point Method input: matrix M R d+n d+n, vector b R d+n, parameter θ > starting point: x R d+n repeat for k =,, 2,... set x k+ = θx k + θmx k + b As we shall see later, the use of the relaxation parameter θ enables us to prove convergence of Algorithm 3. with M = M and b = b or M = M 2 and b = b 2, chosen according to 3 and 4, independent of the spectral radius of these matrices. Let us denote Gθ = θi + θm 6 and let x be the solution of the problem 4. Then x = Mx +b with M = M and b = b or M = M 2 and b = b 2. Therefore, x = Gθx + θb. Further, the iteration of Algorithm 3. can be written as x k+ = Gθx k + θb. Thus, x k x Gθ k x x 7 and consequently the convergence of the algorithm depends on the spectrum of Gθ. More precisely, it converges if the spectral radius of Gθ is less than, because in this case we have Gθ k. In fact, we will address the following questions: What is the range for θ so that this scheme converges? What is the best choice of θ? What is the rate of convergence? How the complexity of this algorithm compares with the known ones? 3.. Convergence analysis In this section we study the convergence of Algorithm 3. and answer the questions raised above. To this end we point out some properties of the iteration matrices and uncover interesting connections between the complexity bounds of the variants of the fixed point scheme we consider. These connections follow from a close link between the spectral properties of the associated matrices. For this purpose, let A = UΣV T 8 be the singular value decomposition of A. That is, U R d d and V R N N are orthogonal matrices and Σ = Σ p N p p d p 9 6

7 where Σ = diagσ,..., σ p brings the nonzero singular values σ σ p > of A. First, we state a basic linear algebra result the proof is straightforward by induction. Proposition 3.. Let Q j R l l, j =,..., 4, be diagonal matrices whose diagonal entries are components of α, β, γ, δ R l, respectively. Then Q Q det 2 = Q 3 Q 4 l α j δ j β j γ j. The next result is crucial for the convergence analysis and complexity study of Algorithm 3.. Lemma 3.2. The characteristic polynomials of the matrices M and M 2, defined in 3, are p t = t N+d 2p p j= t + σ2 j j= 2 and p 2 t = t N+d 2p p j= t 2 + σ2 j, respectively. Proof. Let c =. From 8 and 9, we can write M = W Σ W T U cσσ M 2 = W Σ 2 W T T, where W =, Σ V = cσ T and Σ Σ 2 = c Σ Σ p d p p N p p d p p N p. and The evaluation of p t = detti M = detti Σ is straightforward and ti c Σ ti c Σ p 2 t = detti M 2 = det ti Σ ti = det Σ ti ti. ti ti The result then follows from Proposition 3.. The following result follows directly from Lemma 3.2 and the fact that M is symmetric. Corollary 3.3. The spectral radii of M and M 2 are, respectively, ρ = M = σ2 = A 2 and ρ 2 = σ = A. 7

8 From Corollary 3.3 we conclude that if σ <, then ρ ρ 2 <. So, M k and M k 2, which in turn implies that the pure fixed point method, that is, Algorithm 3. with θ =, converges. However, if σ, we cannot guarantee convergence of the pure method. Now we shall see that Algorithm 3. converges for a broad range of the parameter θ, without any assumption on σ, λ or n. We begin with the analysis of the framework that uses M and b, defined in 3 and Fixed Point Method based on M Algorithm 3.2. Primal-Dual Fixed Point Method; M = M input: M = M, b = b, parameter θ > starting point: x R d+n repeat for k =,, 2,... set x k+ = θx k + θmx k + b Theorem 3.4. Let x R d+n be an arbitrary starting point and consider the sequence x k 2 k N generated by Algorithm 3.2 with θ, + σ 2. Then the sequence x k converges { to the unique solution of the problem 4 at a linear rate of ρ θ def = max θ + σ2, θ} Furthermore, if we choose θ def 2 = 2 + σ 2, then the theoretical convergence rate is optimal and it is equal to ρ def = σ σ 2 = θ. Proof. We claim that the spectral radius of G θ def = θi +θm is ρ θ and also coincides with G θ. Using Lemma 3.2, we conclude that the eigenvalues of this matrix are { } So, its spectral radius is θ θσ2 j, max { θ j =,..., p { θ}. + σ2, θ} = ρ θ. Since G θ is symmetric, this quantity coincides with G θ. Furthermore, the admissible values for θ, that is, the ones such that the eigenvalues have modulus less than one, can be found by solving θ + σ2 <, 8

9 which immediately gives < θ < 2 + σ 2. So, the linear convergence of Algorithm 3. is guaranteed for any θ, 2 + σ 2. Finally, note that the solution of the problem min ρ θ θ> is achieved when θ + σ2 = θ, yielding θ 2 = 2 + σ 2 and the σ 2 optimal convergence rate ρ = 2 + σ 2. The top picture of Figure illustrates the eigenvalues of G θ magenta squares together with the eigenvalues of M blue triangles, for a fixed value of the parameter θ. The one farthest from the origin is θ θσ2 or θ. On the bottom we show the two largest in absolute value eigenvalues of G θ corresponding to the optimal choice of θ. Figure : Eigenvalues of G θ magenta squares and M blue triangles. Now we analyze the fixed point framework that employs M 2 and b 2, defined in 3 and Fixed Point Method based on M 2 Algorithm 3.3. Primal-Dual Fixed Point Method; M = M 2 input: M = M 2, b = b 2, parameter θ > starting point: x R d+n repeat for k =,, 2,... set x k+ = θx k + θmx k + b Theorem 3.5. Let x R d+n be an arbitrary starting point and consider the sequence x k 2 k N generated by Algorithm 3.3 with θ, + σ 2. Then the sequence x k converges to the unique solution of the problem 4 at an asymptotic convergence rate of ρ 2 θ def = choose θ 2 def = is equal to ρ 2 θ 2 + θ2 σ 2. Furthermore, if we + σ 2, then the theoretical convergence rate is optimal and it def σ = = θ + σ

10 Proof. First, using Lemma 3.2, we conclude that the eigenvalues of G 2 θ def = θi + θm 2 are { θ ± θσ j i, } j =,..., p { θ}, where i =. The two ones with largest modulus are θ ± θσ i see Figure 2. So, the spectral radius of G 2 θ is θ 2 + θ2 σ 2 = ρ 2θ. Further, the values of θ for which the eigenvalues of G 2 θ have modulus less than one can be found by solving θ 2 + θ2 σ 2 < θ < 2 + σ 2. < giving The asymptotic convergence follows from the fact that G 2 θ k /k ρ 2 θ. Indeed, using 7 we conclude that x k x /k x x G 2 θ k /k ρ 2 θ. This means that given γ >, there exists k N such that x k x ρ 2 θ + γ k x x for all k k. Finally, the optimal parameter θ 2 and the corresponding optimal rate ρ 2 can be obtained directly by solving min θ> θ2 + θ2 σ 2. The left picture of Figure 2 illustrates, in the complex plane, the eigenvalues of G 2 θ magenta squares together with the eigenvalues of M 2 blue triangles, for a fixed value of the parameter θ. On the right we show, for each θ,, one of the two eigenvalues of G 2 θ farthest from the origin. The dashed segment corresponds to the admissible values for θ, that is, the eigenvalues with modulus less than one. The square corresponds to the optimal choice of θ.

11 Figure 2: Eigenvalues of G 2 θ magenta squares and M 2 blue triangles, represented in the complex plane Comparison of the rates We summarize the discussion above in Table which brings the comparison between the pure θ = and optimal θ = θj, j =, 2 versions of Algorithm 3.. We can see that the convergence rate of the optimal version is /2 + σ 2 times the one of the pure version if M is employed Algorithm 3.2 and / + σ 2 times the pure version when using M 2 Algorithm 3.3. Moreover, in any case, employing M provides faster convergence. This can be seen in Figure 3, where Algorithm 3. was applied to solve the problem 4. The dimensions considered were d = 2, m = and n = 5 so that the total dimension is d + N = d + nm = 52. We also mention that the pure version does not require the knowledge of σ, but it may not converge. On the other hand, the optimal version always converges, but θ depends on σ. Range of θ PDFPθ PDFP2θ 2 2, + σ 2, + σ 2 Pure θ = σ 2 Optimal θ = θ j σ σ 2 = θ σ 2 + σ 2 σ = θ 2 Table : Ranges of convergence and convergence rates of pure and optimal versions of Algorithm 3.2 the one that uses M, indicated by PDFPθ, and Algorithm 3.3 the one that uses M 2, PDFP2θ.

12 2 PDFP... PDFP2 PDFP* PDFP2* objective value iteration Figure 3: Performance of pure and optimal versions of Algorithms 3.2 and 3.3 applied to solve the problem 4. The picture shows the objective values against the number of iterations. The dimensions considered were d = 2, m = and n = 5 so that the total dimension is d + N = d + nm = 52. The matrix A R d N and the vector y R N were randomly generated. For simplicity of notation we have denoted the pure and optimal versions of Algorithm 3.2 by PDFP and PDFP*, respectively. Analogously, for Algorithm 3.3, we used PDFP2 and PDFP2* to denote the pure and optimal versions, respectively Direct relationship between the iterates of the two methods Another relation regarding the employment of M or M 2 in the pure version of Algorithm 3., which is also illustrated in Figure 3, is that one step of the method with M corresponds exactly to two steps of the one with M 2. Indeed, note first that M2 2 = M. Thus, denoting the current point by x and the next iterate by x + M, in view of 5 we have x ++ M 2 = M 2 x + M 2 + b 2 = M 2 M 2 x + b 2 + b 2 = M x + M 2 b 2 + b 2 = M x + b = x + M. In Section 4 we shall see how this behavior can invert with a small change in the computation of the dual variable Complexity results In order to establish the complexity of Algorithm 3. we need to calculate the condition number of the objective function, defined in 4. Note that the Hessian of f is given by 2 f = AA T + I n AT A + I Let us consider two cases: 2

13 If, then σ 2 + σ 2 + σ2 +, which in turn implies that the largest eigenvalue of 2 f is L = σ2 +. The smallest eigenvalue is n n, if d < N σd n, if d = N min { λ, σ2 N 2 + n }, if d > N. Therefore, if d < N, the condition number of 2 f is the condition number of 2 f is σ If <, then σ 2 + < σ 2 + < σ2 +, which in turn implies that the largest eigenvalue of 2 f is L = σ2 +. The smallest eigenvalue is 2 { σ 2 min d n + λ, }, if d < N n σ 2 d + λ, if d = N n λ, if d > N. σ 2 + So, assuming that d < N, the condition number is { σ 2 2 min d n + λ, }. n If A is rank deficient, then the condition number is σ We stress that despite the analysis was made in terms of the sequence x k = w k, α k, the linear convergence also applies to objective values. Indeed, since f is L-smooth, we have fx k fx + fx T x k x + L 2 xk x 2 = L 2 xk x 2, where the equality follows from the fact that the optimal objective value is zero. Therefore, if we want to get fx k fx < ε and we have linear convergence rate ρ on the sequence x k, then it is enough to enforce L 2 ρ2k x x 2 < ε, 3

14 or equivalently, k > 2 log ρ log x x 2 L 2ε. 22 Using the estimate log θ θ, we can approximate the right hand side of 22 by x x 2 L log, 23 2ε 2θ in the case M is used and by x x 2 L log, 2ε 24 θ 2 if we use M 2. In order to estimate the above expressions in terms of the condition number, let us consider the more common case. Then the condition number of 2 f is given by 2, that is, So, if we use M, the complexity is proportional to 2θ κ def = σ = σ If we use M 2, the complexity is proportional to θ 2 = + σ2 4. Accelerated Primal-Dual Fixed Point Method = κ = κ. 27 Now we present our main contribution. When we employ Algorithm 3.3, the primal and dual variables are mixed in two equations. More precisely, in view of 9 the iteration in this case can be rewritten as { w k+ = θw k + θᾱ k α k+ = θα k + θy A T w k. The idea here is to apply block Gauss-Seidel to this system. That is, we use the freshest w to update α. Let us state formally the method by means of the following framework. Algorithm 4.. Accelerated Fixed Point Method input: matrix A R d N, vector y R N, parameter θ, ] starting points: w R d and α R N repeat for k =,, 2,... set w k+ = θw k + θᾱ k set α k+ = θα k + θy A T w k+ 4

15 Due to this modification, we can achieve faster convergence. This algorithm is a deterministic version of a randomized primal-dual algorithm Quartz proposed and analyzed by Qu, Richtárik and Zhang [7]. 4.. Convergence analysis In this section we study the convergence of Algorithm 4.. We shall determine all values for the parameter θ for which this algorithm converges as well as the one giving the best convergence rate. To this end, we start by showing that Algorithm 4. can be viewed as a fixed point scheme. Then we determine the dynamic spectral properties of the associated matrices, which are parameterized by θ. First, note that the iteration of our algorithm can be written as w k+ w k I θa T I α k+ or in a compact way as = θi θ A θi α k + θy x k+ = G 3 θx k + f 28 with G 3 θ = θi + θ A 29 θ A T θ AT A and f =. θy We know that if the spectral radius of G 3 θ is less that, then the sequence defined by 28 converges. Indeed, in this case the limit point is just x, the solution of the problem 4. This follows from the fact that x = G 3 θx + f. Next lemma provides the spectrum of G 3 θ. Lemma 4.. The eigenvalues of the matrix G 3 θ, defined in 29, are given by } {2 θ θ 2 σj 2 ± θσ j θ 2 2 σj 2 4 θ, j =,..., p { θ}. Proof. Consider the matrix M 3 θ def = A. Using the θ A T θ AT A singular value decomposition of A, given in 8, we can write U M 3 θ = Σ U T V θ Σ T θ ΣT Σ V T Therefore, the eigenvalues of M 3 θ are the same as the ones of Σ = θ Σ T θ ΣT Σ c Σ θ Σ θc Σ 2 p d p p N p. p d p p N p 5

16 where c = and Σ is defined in 9. The characteristic polynomial of this matrix is p θ t = det = det ti c Σ ti θ Σ ti θc Σ 2 ti ti c Σ θ Σ ti θc Σ 2 ti ti. Using Proposition 3. and denoting q = N + d 2p, we obtain p p p θ t = t q tt θcσj 2 c θσj 2 = t q t 2 + θσ2 j t + θσ2 j. j= Thus, the eigenvalues of M 3 θ are 2 j= { θσ 2 j ± σ j θ 2 σ 2 j 4 θ, j =,..., p } {}, so that the eigenvalues of G 3 θ = θi + θm 3 θ are } {2 θ θ 2 σj 2 ± θσ j θ 2 2 σj 2 4 θ, j =,..., p { θ}, giving the desired result. Figure 4 illustrates, in the complex plane, the spectrum of the matrix G 3 θ for many different values of θ. We used n = 25, d = 3, m = therefore N = 25, λ =.3 and a random matrix A R d N. The pictures point out the fact that for some range of θ the spectrum is contained in a circle and for other values of θ some of the eigenvalues remain in a circle while others are distributed along the real line, moving monotonically as this parameter changes. These statements will be proved in the sequel. In what follows, let us consider the functions δ j : [, ] R defined by δ j θ = θ 2 σ 2 j 4 θ. 3 The following straightforward result brings some basic properties of them, illustrated in Figure 5. Lemma 4.2. Each function δ j, j =,..., p, is strictly increasing, from 4 to σj 2 as θ goes from zero to. Furthermore, these functions are sorted in decreasing order, δ δ 2 δ p, and their zeros, σ def j 2 θ j =, 3 are sorted in increasing order: < θ θ 2 θ p <. σ 2 j 6

17 θ = θ =.25 θ =.4 θ =.45 θ = θ =.4755 θ = θ = θ =.5659 θ = θ = θ = θ =.6275 θ =.6775 θ =.6587 Figure 4: The spectrum of G 3 θ for many different values of θ. The first 4 pictures satisfying the condition θ 2 σ 2 4 θ < ; in the fifth picture we have θ2 σ 2 4 θ = and the remaining ones represent the case where θ 2 σ 2 4 θ >. The straight line represents in a different scale the interval [, ] on which are ploted some specific values of θ blue marks, defined in 3. The red diamond corresponds to the current value of θ. 7

18 25 δ j θ δ δ 2 δ 3 δp θ Figure 5: The functions δ j, j =,..., p and the properties stated in Lemma 4.2. Now we shall study the spectrum of G 3 θ, given in Lemma 4.. For this, let us denote λ θ def = θ 32 and, for j =,..., p, λ def j θ = 2 θ θ 2 σj 2 θσ j δj θ, 33 2 λ + def j θ = 2 θ θ 2 σj 2 + θσ j δj θ 34 2 where δ j is defined in 3. Lemma 4.3. Consider θ as defined in 3. If θ [, θ ], then λ + j θ = λ j θ = θ for all j =,..., p, which in turn implies that the spectral radius of G 3 θ is θ. Proof. Note that in this case we have δ j θ for all j =,..., p. So, λ + j θ 2 = λ j θ 2 = 4λ 2 n 2 = θ 2, yielding the desired result since θ. 2 θ θ 2 σ 2 j 2 θ 2 σ 2 j δ jθ It can be shown that the parameter θ = θ, defined in Theorem 3.4, satisfies the conditions of Lemma 4.3. So, the spectral radius of G 3 θ is θ, exactly the same spectral radius of G θ = θ I+θ M. This is shown in Figure 6, 8

19 Figure 6: On the left, the spectrum of G θ magenta squares, G 2θ2 blue triangles and G 3 θ red diamonds. On the right, the spectrum of G 3θ3, where θ 3 is the optimal parameter. together with the spectrum of G 2 θ 2 = θ 2I + θ 2M 2. We also show in this figure the right picture the spectrum of G 3 θ 3, where θ 3 is the optimal parameter. This parameter will be determined later, in Theorem 4.6. Lemma 4.4. Consider θ j, j =,..., p, as defined in 3. If θ [ θ l, θ l+ ], then the eigenvalues λ + j θ and λ j θ, j =,..., l, are real numbers satisfying λ θ λ l θ θ λ+ l θ λ+ θ. On the other hand, for j = l +,..., p we have λ + j θ = λ j θ = θ Thus, the spectral radius of G 3 θ is λ θ. Proof. We have δ j θ for all j =,..., l. So, λ + j θ θ = 2 θ θ 2 σj θσ j δj θ + θ = 4 θ θ 2 σj θσ j δj θ = δ j θ + θσ j δj θ 2 δj θ = θσ j δ j θ. 2 Furthermore, 2 2. θσ j δj θ = θ 2 σj θ 2 2 σj 2 4 θ θ 2 σj 2 2 θ Since θ 2 σj 2 2 θ = δ jθ + 2 θ, λ + j θ = 2 θ θ 2 σj 2 + θσ j δj θ. 2 9

20 Now, note that λ j θ θ = 2 = 2 2 θ θ 2 σj 2 θσ j δj θ + θ δ j θ θσ j δj θ. Moreover, from Lemma 4.2 and the definition of σ j, we have δ θ δ l θ and σ σ l, which imply that λ θ λ l θ. The inequality λ + l θ λ+ θ follows from the fact that the function [ a, s s 2 + s s 2 a is increasing. Finally, for j = l +,..., p we have δ j θ and, by the same argument used in Lemma 4.3, we conclude that λ + j θ = λ j θ = θ. From Lemmas 4.3 and 4.4 we can conclude that θ is the threshold value for θ after which the eigenvalues of G 3 θ start departing the circle of radius θ. The next result presents the threshold after which the eigenvalues are all real. Lemma 4.5. Consider θ p as defined in 3. If θ θ p, then the eigenvalues λ + j θ and λ j θ, j =,..., p, are real numbers satisfying λ θ λ p θ θ λ + p θ λ + θ. Thus, the spectral radius of G 3 θ is λ θ. Proof. The same presented for Lemma 4.4. Using the previous results, we can finally establish the convergence of Algorithm 4. Deterministic Quartz. Theorem 4.6. Let w R d and α R N be arbitrary and consider the sequence w k, α k 2 k N generated by Algorithm 4. with θ,. Then the + σ sequence w k, α k converges to the unique solution of the problem 4 at an asymptotic linear rate of θ, if θ, θ ] ρ 3 θ def = θσ δ θ + θ 2 σ 2 2 θ, if θ 2 θ, where θ = σ 2. Furthermore, if we choose θ3 σ 2 the theoretical convergence rate is optimal and it is equal to ρ 3 def = θ, then def = θ 3. 2

21 Proof. Since Algorithm 4. can be represented by 28, we need to show that ρg 3 θ, the spectral radius of G 3 θ, is less than. First, note that by Lemmas 4.3, 4.4 and 4.5, we have ρg 3 θ = ρ 3 θ. Using Lemma 4.2 we conclude that the function θ ρ 3 θ is increasing on the interval [ θ,, which means that its minimum is attained at θ. To finish the proof, it is enough to prove that ρ 3 θ = if and only if θ = 2. + σ Note that and θ = ρ 3 θ = θσ δ θ + θ 2 σ 2 2 θ = 2 2 θ 2 σδ 2 θ = 22 θ θ 2 σ 2 2 θ θ = σ θ = 2 + σ 2 + σ θ 2 σ 2 = 2 θ 2 θσ δ θ + θ 2 σ 2 2 θ = 2, completing the proof. It is worth noting that if the spectral radius of M is less than, that is, if σ 2 <, then Algorithms 3.2, 3.3 and 4. converge for any choice of θ, ]. Indeed, in this case we have 2 + σ 2 > 2 + σ >, which implies that the set of admissible values for θ established in Theorems 3.4, 3.5 and 4.6 contains the whole interval, ]. On the other hand, if σ 2, the convergence of these algorithms is more restrictive. Moreover, in this case we have 2 + σ σ, which means that Algorithm 4. has a broader range for θ than Algorithms 3.2 and

22 4.2. Complexity results Taking into account 22, 25, the relation log θ θ and Theorem 4.6, we conclude that the complexity of our Accelerated Fixed Point Method, Algorithm 4., is proportional to 2θ 3 σ 2 25 κ = = σ 2 4 κ. 35 Note that in the case when λ = /n, as is typical in machine learning applications, we can write 2θ 3 35 = κ 4 κ = κ This is very surprising as it means that we are achieving the optimal accelerated Nesterov rate Õ κ. 5. Extensions In this section we discuss some variants of Algorithm 4.. The first one consists of switching the order of the computations, updating the dual variable first and then the primal one. The second approach updates the primal variable enforcing the first relation of the optimality conditions given by and using the relaxation parameter θ only to update the dual variable. 5.. Switching the update order This approach updates the dual variable α first and then updates the primal variable w using the new information about α. This is summarized in the following scheme. α k+ = θα k + θy A T w k w k+ = θw k + θ 37 Aαk+. As we shall see now, this scheme provides the same complexity results as Algorithm 4.. To see this, note that the iteration 37 is equivalent to I θ A w k+ θi w k I α k+ = θa T θi α k + θy or in a compact way, with Gθ = θi + θ x k+ = Gθx k + f θ AAT θ A A T. 22

23 It can be shown that the matrix Gθ has exactly the same spectrum of G 3 θ, defined in 29. So, the convergence result is also the same, which we state again for convenience. Theorem 5.. Let w R d and α R N be arbitrary and consider the sequence w k, α k 2 k N defined by 37 with θ,. Then the sequence + σ w k, α k converges to the unique solution of the problem 4 at an asymptotic linear rate of θ, if θ, θ ] ρ 3 θ = θσ δ θ + θ 2 σ 2 2 θ, if θ 2 θ, where θ = σ 2. Furthermore, if we choose θ3 = θ, then σ 2 the theoretical convergence rate is optimal and it is equal to ρ 3 = θ Maintaining primal-dual relationship The second approach updates the primal variable enforcing the first relation of the optimality conditions given by and uses the relaxation parameter θ only to update the dual variable, as described in the following scheme. w k+ = Aαk α k+ = θα k + θy A T w k+. 38 Differently from the previous case, this scheme cannot achieve accelerated convergence. Indeed, note first that the scheme 38 can be written as I w k+ θa T I α k+ = A w k θi α k + θy or in a compact way, x k+ = Gθx k + f with Gθ = A. θi θ AT A We can conclude that the eigenvalues of this matrix are { } θ θσ2 j, j =,..., p { θ}, exactly the same of the matrix G θ, the iteration matrix of Algorithm 3. with employment of M. So, the complexity analysis here is the same as that one established in Theorem

24 5.3. Maintaining primal-dual relationship 2 For the sake of completeness, we present next the method where we keep the second relationship intact and include θ in the first relationship. This leads to α k+ = y A T w k w k+ = θw k + θ 39 Aαk+. Here we obtain the same convergence results as the ones described in Section 5.2. In fact, the relations above can be written as I w k+ A T w k y I θ A α k+ = θi α k + or in a compact way, x k+ = Gθx k + f with θi θ Gθ = AAT. A T We can conclude that the eigenvalues of this matrix are { } θ θσ2 j, j =,..., p { θ}, exactly the same of the matrix G θ, the iteration matrix of Algorithm 3. with employment of M. So, the complexity analysis here is the same as that one established in Theorem 3.4. Observe that in 38 we have w k+ = φ α k and α k+ = φ 2 θ, α k, w k+. On the other hand, in 39 we have α k+ = φ 3 w k and w k+ = φ 4 θ, w k, α k+. It is worth noting that if we update the variables as or we obtain θ AAT θa T α k+ = φ 2 θ, α k, w k and w k+ = φ α k+ w k+ = φ 4 θ, w k, α k and α k+ = φ 3 w k+ θ A θi and θi θ A θa T θ AT A as the associated iteration matrices, respectively. Moreover, we can conclude that they also have the same spectrum of G θ. So, the complexity analysis is the same as that one established in Theorem

25 6. Numerical Experiments In this section we present a comparison among the methods discussed in this work. Besides a table with the convergence rates and complexity bounds, we show here some numerical tests performed to illustrate the properties of Algorithms 3.2, 3.2 and 4. as well as of the extensions 37 and 38 applied to solve the primal-dual ridge regression problem stated in 4. We refer to Algorithm 4. as Quartz and the extensions 37 and 38 as New Quartz and Modified Quartz, respectively. The name Quartz is due to the fact that Algorithm 4. is a deterministic version of a randomized primal-dual algorithm proposed and analyzed by Qu, Richtárik and Zhang [7]. We summarize the main features of these methods in Table 2 which brings the range of the parameter to ensure convergence, the optimal convergence rates, the complexity and the cost per iteration of each method. For instance, the two versions of Algorithm 3. have the same range for theta. The usage of M provides best convergence rate compared with using M 2. However, it requires more calculations per iteration: the major computational tasks to be performed are computation of the matrix-vector products AA T w and A T Aα, while the use of M 2 needs the computation of Aα and A T w. Surprisingly, Algorithm 4. has shown to be the best from both the theoretical point of view and the numerical experiments and with the same cost as the computation of Aα and A T w. We also point out that the modified Quartz, 38, did not have here the same performance as the randomized version studied in [7]. PDFPθ PDFP2θ QTZθ NQTZθ MQTZθ Range of θ Optimal rate Complexity Cost/iteration 2 σ 2, 26 dn + 5d + 9N + σ σ 2 2 σ 2, 27 6dN + 5d + 9N + σ 2 + σ 2 2, θ3 36 6dN + 5d + 9N + σ,, 2 + σ 2 + σ 2 θ dN + 5d + 9N σ σ dN + 3d + 9N Table 2: Comparison between the ranges of θ to ensure convergence, optimal convergence rates, complexity and cost per iteration # of arithmetic operations of the algorithms proposed in this paper: Algorithm 3.2, indicated by PDFPθ; Algorithm 3.3, denoted by PDFP2θ; Algorithm 4., QTZθ and the extensions 37 New Quartz and 38 Modified Quartz, indicated by NQTZθ and MQTZθ, respectively. Figure 7 illustrates these features, showing the primal-dual objective values 25

26 against the number of iterations. The dimensions considered were d =, m = and n = 5. We adopted the optimal parameters associated with each method, namely, θ, θ 2 and θ 3 for Algorithms 3.2, 3.3 and 4., respectively, θ 3 for the algorithm given by 37 and θ for the algorithm given by 38. These parameters are defined in Theorems 3.4, 3.5 and 4.6 and the computational cost for computing them is the same as the cost for computing σ, the largest singular value of A. The left picture of Figure 7 compares Algorithms 3.2, 3.3 and 4., while the right one shows the performance of Algorithm 3.2 and the three variants of Quartz. We can see the equivalence between Quartz and New Quartz and also the equivalence between Modified Quartz and Algorithm 3.2. Note that besides the advantage of QTZ* in terms of number of iterations, it does not need more arithmetic operations per iteration as we have seen in Table 2. 2 PDFP*... PDFP2* QTZ* 2 PDFP*... QTZ* NQTZ* MQTZ* objective value 2 objective value iteration iteration Figure 7: Performance of the optimal versions of the algorithms proposed in this paper applied to solve the problem 4. The pictures show the objective values against the number of iterations. The dimensions considered were d =, m = and n = 5. The matrix A R d N and the vector y R N were randomly generated. For simplicity of notation we have denoted Algorithm 3.2 by PDFP*, Algorithm 3.3 by PDFP2*, Algorithm 4. by QTZ* and the extensions 37 New Quartz and 38 Modified Quartz by NQTZ* and MQTZ*, respectively. Despite the main goal of this work being a theoretical study about convergence and complexity of various fixed point type methods, for the sake of completeness, we present here a comparison of our methods with the classical one for solving quadratic optimization problems: the conjugate gradient algorithm CG. Figure 8 shows the performance of the optimal versions of the algorithms proposed in this paper compared with CG, applied to solve the problem 4. On the top we have plotted the objective values against the number of iterations, while the bottom pictures show the objective values against the cpu time. The numerical experiments indicate that Quartz is competitive with CG. While Quartz needs more iterations than CG to converge, it is faster in runtime. This is due to the big difference between the effort per iteration of these two algorithms: 6dN + 5d + 9N arithmetic operations per iteration for Quartz compared to 4d 2 + 4N 2 + 4dN + 4d + 7N for CG. 26

27 2 PDFP*... PDFP2* QTZ* CG 2 QTZ*... CG objective value 4 objective value iteration 2 PDFP*... PDFP2* QTZ* CG iteration 2 QTZ*... CG objective value 4 objective value cpu seconds cpu seconds Figure 8: Performance of the optimal versions of the algorithms proposed in this paper compared with the conjugate gradient algorithm, applied to solve the problem 4. The dimensions considered were d = 2, m = and n = 5. The matrix A R d N and the vector y R N were randomly generated. For simplicity of notation we have denoted Algorithm 3.2 by PDFP*, Algorithm 3.3 by PDFP2*, Algorithm 4. by QTZ* and conjugate gradient by CG. The pictures on the top show the objective values against the number of iterations, while the bottom ones show the objective values against the cpu time. The right pictures present the results of QTZ* and CG of the left ones with the horizontal axis rescaled. Note that despite QTZ* spent more iterations than CG, the computational time for solving the problem was less than that for CG. 27

28 7. Conclusion In this paper we have proposed and analyzed several algorithms for solving the ridge regression problem and its dual. We have developed a parameterized family of fixed point methods applied to various equivalent reformulations of the optimality conditions. We have performed a convergence analysis and obtained complexity results for these methods, revealing interesting geometrical insights between convergence speed and spectral properties of iteration matrices. Our main method achieves the optimal accelerated rate of Nesterov. We have performed some numerical experiments to illustrate the properties of our algorithms as well as a comparison with the conjugate gradient algorithm. The numerical experiments indicate that our main algorithm is competitive with the conjugate gradient algorithm. References [] Y. Saad, Iterative Methods for Sparse Linear Systems, SIAM, 23. [2] S. Shalev-Shwartz, T. Zhang, Stochastic dual coordinate ascent methods for regularized loss, J. Mach. Learn. Res [3] S. Shalev-Shwartz, T. Zhang, Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization, Math. Program., Ser. A [4] T. Zhang, On the dual formulation of regularized linear systems with convex risks, Machine Learning [5] A. Chambolle, T. Pock, A first-order primal-dual algorithm for convex problems with applications to imaging, J. Math. Imaging Vis [6] N. Komodakis, J. C. Pesquet, Playing with duality: An overview of recent primal-dual approaches for solving large-scale optimization problems, IEEE Signal Process. Mag [7] Z. Qu, P. Richtárik, T. Zhang, Quartz: Randomized dual coordinate ascent with arbitrary sampling, in: Adv. Neural Inf. Process. Syst. 28, 25, pp [8] K. J. Arrow, L. Hurwicz, Gradient method for concave programming I: Local results, in: K. J. Arrow, L. Hurwicz, H. Uzawa Eds., Studies in Linear and Nonlinear Programming, Stanford University Press, Stanford, 958, pp [9] H. Uzawa, Iterative methods for concave programming, in: K. J. Arrow, L. Hurwicz, H. Uzawa Eds., Studies in Linear and Nonlinear Programming, Stanford University Press, Stanford, 958, pp

29 [] A. E. Hoerl, Application of ridge analysis to regression problems, Chem. Eng. Prog [] A. E. Hoerl, R. W. Kennard, Ridge regression: biased estimation for nonorthogonal problems, Technometrics [2] M. El-Dereny, N. I. Rashwan, Solving multicollinearity problem using ridge regression models, Int. J. Contemp. Math. Sci [3] D. M. Hawkins, X. Yin, A faster algorithm for ridge regression of reduced rank data, Comput. Statist. Data Anal [4] C. Saunders, A. Gammerman, V. Vovk, Ridge regression learning algorithm in dual variables, in: Proceedings of the 5th International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., 998, pp [5] H. D. Vinod, A survey of ridge regression and related techniques for improvements over ordinary least squares, The Review of Economics and Statistics [6] T. C. Silva, A. A. Ribeiro, G. A. Periçaro, A new accelerated algorithm for ill-conditioned ridge regression problems, Comp. Appl. Math.To appear. 29

The Complexity of Primal-Dual Fixed Point Methods for Ridge Regression

The Complexity of Primal-Dual Fixed Point Methods for Ridge Regression The Complexity of Primal-Dual Fixed Point Methods for Ridge Regression Ademir Alves Ribeiro Peter Richtárik August 28, 27 Abstract We study the ridge regression L 2 regularized least squares problem and

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

arxiv: v2 [math.na] 28 Jan 2016

arxiv: v2 [math.na] 28 Jan 2016 Stochastic Dual Ascent for Solving Linear Systems Robert M. Gower and Peter Richtárik arxiv:1512.06890v2 [math.na 28 Jan 2016 School of Mathematics University of Edinburgh United Kingdom December 21, 2015

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 5, SEPTEMBER 2001 1215 A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing Da-Zheng Feng, Zheng Bao, Xian-Da Zhang

More information

Nonlinear Optimization for Optimal Control

Nonlinear Optimization for Optimal Control Nonlinear Optimization for Optimal Control Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 11 [optional]

More information

Research Note. A New Infeasible Interior-Point Algorithm with Full Nesterov-Todd Step for Semi-Definite Optimization

Research Note. A New Infeasible Interior-Point Algorithm with Full Nesterov-Todd Step for Semi-Definite Optimization Iranian Journal of Operations Research Vol. 4, No. 1, 2013, pp. 88-107 Research Note A New Infeasible Interior-Point Algorithm with Full Nesterov-Todd Step for Semi-Definite Optimization B. Kheirfam We

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts Some definitions Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization M. M. Sussman sussmanm@math.pitt.edu Office Hours: MW 1:45PM-2:45PM, Thack 622 A matrix A is SPD (Symmetric

More information

A PREDICTOR-CORRECTOR PATH-FOLLOWING ALGORITHM FOR SYMMETRIC OPTIMIZATION BASED ON DARVAY'S TECHNIQUE

A PREDICTOR-CORRECTOR PATH-FOLLOWING ALGORITHM FOR SYMMETRIC OPTIMIZATION BASED ON DARVAY'S TECHNIQUE Yugoslav Journal of Operations Research 24 (2014) Number 1, 35-51 DOI: 10.2298/YJOR120904016K A PREDICTOR-CORRECTOR PATH-FOLLOWING ALGORITHM FOR SYMMETRIC OPTIMIZATION BASED ON DARVAY'S TECHNIQUE BEHROUZ

More information

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings Structural and Multidisciplinary Optimization P. Duysinx and P. Tossings 2018-2019 CONTACTS Pierre Duysinx Institut de Mécanique et du Génie Civil (B52/3) Phone number: 04/366.91.94 Email: P.Duysinx@uliege.be

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

A derivative-free nonmonotone line search and its application to the spectral residual method

A derivative-free nonmonotone line search and its application to the spectral residual method IMA Journal of Numerical Analysis (2009) 29, 814 825 doi:10.1093/imanum/drn019 Advance Access publication on November 14, 2008 A derivative-free nonmonotone line search and its application to the spectral

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

A Review of Linear Programming

A Review of Linear Programming A Review of Linear Programming Instructor: Farid Alizadeh IEOR 4600y Spring 2001 February 14, 2001 1 Overview In this note we review the basic properties of linear programming including the primal simplex

More information

Symmetric Matrices and Eigendecomposition

Symmetric Matrices and Eigendecomposition Symmetric Matrices and Eigendecomposition Robert M. Freund January, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 2 1 Symmetric Matrices and Convexity of Quadratic Functions

More information

Conjugate Gradient (CG) Method

Conjugate Gradient (CG) Method Conjugate Gradient (CG) Method by K. Ozawa 1 Introduction In the series of this lecture, I will introduce the conjugate gradient method, which solves efficiently large scale sparse linear simultaneous

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Primal-dual relationship between Levenberg-Marquardt and central trajectories for linearly constrained convex optimization

Primal-dual relationship between Levenberg-Marquardt and central trajectories for linearly constrained convex optimization Primal-dual relationship between Levenberg-Marquardt and central trajectories for linearly constrained convex optimization Roger Behling a, Clovis Gonzaga b and Gabriel Haeser c March 21, 2013 a Department

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

Convergence of a Class of Stationary Iterative Methods for Saddle Point Problems

Convergence of a Class of Stationary Iterative Methods for Saddle Point Problems Convergence of a Class of Stationary Iterative Methods for Saddle Point Problems Yin Zhang 张寅 August, 2010 Abstract A unified convergence result is derived for an entire class of stationary iterative methods

More information

Linear Algebra Massoud Malek

Linear Algebra Massoud Malek CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product

More information

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL) Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

Interior-Point Methods for Linear Optimization

Interior-Point Methods for Linear Optimization Interior-Point Methods for Linear Optimization Robert M. Freund and Jorge Vera March, 204 c 204 Robert M. Freund and Jorge Vera. All rights reserved. Linear Optimization with a Logarithmic Barrier Function

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Here is an example of a block diagonal matrix with Jordan Blocks on the diagonal: J

Here is an example of a block diagonal matrix with Jordan Blocks on the diagonal: J Class Notes 4: THE SPECTRAL RADIUS, NORM CONVERGENCE AND SOR. Math 639d Due Date: Feb. 7 (updated: February 5, 2018) In the first part of this week s reading, we will prove Theorem 2 of the previous class.

More information

Math Introduction to Numerical Analysis - Class Notes. Fernando Guevara Vasquez. Version Date: January 17, 2012.

Math Introduction to Numerical Analysis - Class Notes. Fernando Guevara Vasquez. Version Date: January 17, 2012. Math 5620 - Introduction to Numerical Analysis - Class Notes Fernando Guevara Vasquez Version 1990. Date: January 17, 2012. 3 Contents 1. Disclaimer 4 Chapter 1. Iterative methods for solving linear systems

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

Contraction Methods for Convex Optimization and monotone variational inequalities No.12

Contraction Methods for Convex Optimization and monotone variational inequalities No.12 XII - 1 Contraction Methods for Convex Optimization and monotone variational inequalities No.12 Linearized alternating direction methods of multipliers for separable convex programming Bingsheng He Department

More information

1 Number Systems and Errors 1

1 Number Systems and Errors 1 Contents 1 Number Systems and Errors 1 1.1 Introduction................................ 1 1.2 Number Representation and Base of Numbers............. 1 1.2.1 Normalized Floating-point Representation...........

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

arxiv: v1 [math.oc] 23 May 2017

arxiv: v1 [math.oc] 23 May 2017 A DERANDOMIZED ALGORITHM FOR RP-ADMM WITH SYMMETRIC GAUSS-SEIDEL METHOD JINCHAO XU, KAILAI XU, AND YINYU YE arxiv:1705.08389v1 [math.oc] 23 May 2017 Abstract. For multi-block alternating direction method

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

A Quick Tour of Linear Algebra and Optimization for Machine Learning

A Quick Tour of Linear Algebra and Optimization for Machine Learning A Quick Tour of Linear Algebra and Optimization for Machine Learning Masoud Farivar January 8, 2015 1 / 28 Outline of Part I: Review of Basic Linear Algebra Matrices and Vectors Matrix Multiplication Operators

More information

Lecture 8 : Eigenvalues and Eigenvectors

Lecture 8 : Eigenvalues and Eigenvectors CPS290: Algorithmic Foundations of Data Science February 24, 2017 Lecture 8 : Eigenvalues and Eigenvectors Lecturer: Kamesh Munagala Scribe: Kamesh Munagala Hermitian Matrices It is simpler to begin with

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

arxiv: v1 [math.oc] 21 Apr 2016

arxiv: v1 [math.oc] 21 Apr 2016 Accelerated Douglas Rachford methods for the solution of convex-concave saddle-point problems Kristian Bredies Hongpeng Sun April, 06 arxiv:604.068v [math.oc] Apr 06 Abstract We study acceleration and

More information

A DECOMPOSITION PROCEDURE BASED ON APPROXIMATE NEWTON DIRECTIONS

A DECOMPOSITION PROCEDURE BASED ON APPROXIMATE NEWTON DIRECTIONS Working Paper 01 09 Departamento de Estadística y Econometría Statistics and Econometrics Series 06 Universidad Carlos III de Madrid January 2001 Calle Madrid, 126 28903 Getafe (Spain) Fax (34) 91 624

More information

The speed of Shor s R-algorithm

The speed of Shor s R-algorithm IMA Journal of Numerical Analysis 2008) 28, 711 720 doi:10.1093/imanum/drn008 Advance Access publication on September 12, 2008 The speed of Shor s R-algorithm J. V. BURKE Department of Mathematics, University

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

Lecture 8 Plus properties, merit functions and gap functions. September 28, 2008

Lecture 8 Plus properties, merit functions and gap functions. September 28, 2008 Lecture 8 Plus properties, merit functions and gap functions September 28, 2008 Outline Plus-properties and F-uniqueness Equation reformulations of VI/CPs Merit functions Gap merit functions FP-I book:

More information

ON GAP FUNCTIONS OF VARIATIONAL INEQUALITY IN A BANACH SPACE. Sangho Kum and Gue Myung Lee. 1. Introduction

ON GAP FUNCTIONS OF VARIATIONAL INEQUALITY IN A BANACH SPACE. Sangho Kum and Gue Myung Lee. 1. Introduction J. Korean Math. Soc. 38 (2001), No. 3, pp. 683 695 ON GAP FUNCTIONS OF VARIATIONAL INEQUALITY IN A BANACH SPACE Sangho Kum and Gue Myung Lee Abstract. In this paper we are concerned with theoretical properties

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

A Distributed Newton Method for Network Utility Maximization, II: Convergence

A Distributed Newton Method for Network Utility Maximization, II: Convergence A Distributed Newton Method for Network Utility Maximization, II: Convergence Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract The existing distributed algorithms for Network Utility

More information

A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases

A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases 2558 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 48, NO 9, SEPTEMBER 2002 A Generalized Uncertainty Principle Sparse Representation in Pairs of Bases Michael Elad Alfred M Bruckstein Abstract An elementary

More information

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J 7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured

More information

arxiv: v1 [math.oc] 7 Dec 2018

arxiv: v1 [math.oc] 7 Dec 2018 arxiv:1812.02878v1 [math.oc] 7 Dec 2018 Solving Non-Convex Non-Concave Min-Max Games Under Polyak- Lojasiewicz Condition Maziar Sanjabi, Meisam Razaviyayn, Jason D. Lee University of Southern California

More information

CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS. W. Erwin Diewert January 31, 2008.

CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS. W. Erwin Diewert January 31, 2008. 1 ECONOMICS 594: LECTURE NOTES CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS W. Erwin Diewert January 31, 2008. 1. Introduction Many economic problems have the following structure: (i) a linear function

More information

Affine iterations on nonnegative vectors

Affine iterations on nonnegative vectors Affine iterations on nonnegative vectors V. Blondel L. Ninove P. Van Dooren CESAME Université catholique de Louvain Av. G. Lemaître 4 B-348 Louvain-la-Neuve Belgium Introduction In this paper we consider

More information

Least Sparsity of p-norm based Optimization Problems with p > 1

Least Sparsity of p-norm based Optimization Problems with p > 1 Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Some Inexact Hybrid Proximal Augmented Lagrangian Algorithms

Some Inexact Hybrid Proximal Augmented Lagrangian Algorithms Some Inexact Hybrid Proximal Augmented Lagrangian Algorithms Carlos Humes Jr. a, Benar F. Svaiter b, Paulo J. S. Silva a, a Dept. of Computer Science, University of São Paulo, Brazil Email: {humes,rsilva}@ime.usp.br

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Zhaosong Lu October 5, 2012 (Revised: June 3, 2013; September 17, 2013) Abstract In this paper we study

More information

Optimization and Root Finding. Kurt Hornik

Optimization and Root Finding. Kurt Hornik Optimization and Root Finding Kurt Hornik Basics Root finding and unconstrained smooth optimization are closely related: Solving ƒ () = 0 can be accomplished via minimizing ƒ () 2 Slide 2 Basics Root finding

More information

Generalized Power Method for Sparse Principal Component Analysis

Generalized Power Method for Sparse Principal Component Analysis Generalized Power Method for Sparse Principal Component Analysis Peter Richtárik CORE/INMA Catholic University of Louvain Belgium VOCAL 2008, Veszprém, Hungary CORE Discussion Paper #2008/70 joint work

More information

Uniqueness Conditions for A Class of l 0 -Minimization Problems

Uniqueness Conditions for A Class of l 0 -Minimization Problems Uniqueness Conditions for A Class of l 0 -Minimization Problems Chunlei Xu and Yun-Bin Zhao October, 03, Revised January 04 Abstract. We consider a class of l 0 -minimization problems, which is to search

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 5

Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 5 Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 5 Instructor: Farid Alizadeh Scribe: Anton Riabov 10/08/2001 1 Overview We continue studying the maximum eigenvalue SDP, and generalize

More information

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

MATH 5720: Unconstrained Optimization Hung Phan, UMass Lowell September 13, 2018

MATH 5720: Unconstrained Optimization Hung Phan, UMass Lowell September 13, 2018 MATH 57: Unconstrained Optimization Hung Phan, UMass Lowell September 13, 18 1 Global and Local Optima Let a function f : S R be defined on a set S R n Definition 1 (minimizers and maximizers) (i) x S

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Proximal methods. S. Villa. October 7, 2014

Proximal methods. S. Villa. October 7, 2014 Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

More information

THE solution of the absolute value equation (AVE) of

THE solution of the absolute value equation (AVE) of The nonlinear HSS-like iterative method for absolute value equations Mu-Zheng Zhu Member, IAENG, and Ya-E Qi arxiv:1403.7013v4 [math.na] 2 Jan 2018 Abstract Salkuyeh proposed the Picard-HSS iteration method

More information

On Optimal Frame Conditioners

On Optimal Frame Conditioners On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,

More information

arxiv: v1 [math.na] 21 Oct 2014

arxiv: v1 [math.na] 21 Oct 2014 Computing Symmetric Positive Definite Solutions of Three Types of Nonlinear Matrix Equations arxiv:1410.5559v1 [math.na] 21 Oct 2014 Negin Bagherpour a, Nezam Mahdavi-Amiri a, a Department of Mathematical

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

Lecture Note 5: Semidefinite Programming for Stability Analysis

Lecture Note 5: Semidefinite Programming for Stability Analysis ECE7850: Hybrid Systems:Theory and Applications Lecture Note 5: Semidefinite Programming for Stability Analysis Wei Zhang Assistant Professor Department of Electrical and Computer Engineering Ohio State

More information

INDUSTRIAL MATHEMATICS INSTITUTE. B.S. Kashin and V.N. Temlyakov. IMI Preprint Series. Department of Mathematics University of South Carolina

INDUSTRIAL MATHEMATICS INSTITUTE. B.S. Kashin and V.N. Temlyakov. IMI Preprint Series. Department of Mathematics University of South Carolina INDUSTRIAL MATHEMATICS INSTITUTE 2007:08 A remark on compressed sensing B.S. Kashin and V.N. Temlyakov IMI Preprint Series Department of Mathematics University of South Carolina A remark on compressed

More information

Towards stability and optimality in stochastic gradient descent

Towards stability and optimality in stochastic gradient descent Towards stability and optimality in stochastic gradient descent Panos Toulis, Dustin Tran and Edoardo M. Airoldi August 26, 2016 Discussion by Ikenna Odinaka Duke University Outline Introduction 1 Introduction

More information

Nonlinear Programming

Nonlinear Programming Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week

More information

Throughout these notes we assume V, W are finite dimensional inner product spaces over C.

Throughout these notes we assume V, W are finite dimensional inner product spaces over C. Math 342 - Linear Algebra II Notes Throughout these notes we assume V, W are finite dimensional inner product spaces over C 1 Upper Triangular Representation Proposition: Let T L(V ) There exists an orthonormal

More information

Convex Optimization & Lagrange Duality

Convex Optimization & Lagrange Duality Convex Optimization & Lagrange Duality Chee Wei Tan CS 8292 : Advanced Topics in Convex Optimization and its Applications Fall 2010 Outline Convex optimization Optimality condition Lagrange duality KKT

More information

A full-newton step infeasible interior-point algorithm for linear programming based on a kernel function

A full-newton step infeasible interior-point algorithm for linear programming based on a kernel function A full-newton step infeasible interior-point algorithm for linear programming based on a kernel function Zhongyi Liu, Wenyu Sun Abstract This paper proposes an infeasible interior-point algorithm with

More information