Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Size: px
Start display at page:

Download "Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018"

Transcription

1 Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected first-order methods for large-scale convex optimization Index Terms: Standard gradient methods; accelerated gradient methods; proximal gradient methods; FISTA and enhancements; smoothing techniques; conditional gradient and Frank-Wolfe methods; coordinate descent methods; stochastic gradient descent methods; mirror descent method. Copyright: This lecture is released under a Creative Commons License and Full Text of the License. Gradient-type methods for unconstrained convex problems We start describing some numerical methods for a simple unconstrained convex minimization problem, which will be later generalized to composite problems and constrained settings.. Brief introduction We start this lecture by studying a classical gradient descent method to solve the following unconstrained convex optimization problem: where f : R p R + is a proper, closed and convex function. f := min fx), ) x Rp Assume that problem ) has solution. Then f is finite, and the optimal solution set of ) can be written as X := x dom f) : fx ) = f. ) Under assumption that f Γ 0 R p ), we can show that X is closed and convex. Since f is convex, any point x X is a global solution of ), i.e., fx) fx ) for all x dom f)... Representative examples Let us consider some common examples to motivate this topic. Example. The first and simplest example is the least squares problem of the form f := min fx) := x R p Ax b, 3) where A R n p and b R n. This problem has many applications, which we will discuss in the next sections. Example. The second example is the following empirical risk convex minimization problem: f := min fx) := n f i x), 4) x R p n i=

2 where f i : R p R + are n proper, closed, and convex functions. The function f i is often induced by a loss function that depends on given datasets. For instance, f i x) = log + e yia i x+µi )) in logistic regression, where a i R p, µ i R and y i, are given data. Several loss functions have been using in statistics, machine learning, and data analysis. More details of applications can be found in Lecture. Example 3. Another example is the following structural convex optimization problem: f := min fx) := max x, Au ϕu), 5) x R p u U where ϕ Γ 0 R n ), U is a nonempty, closed, and convex set in R n and A R p n. This problem is also called a min-max problem or a saddle point problem. As a special example, we can consider the Chebyshev approximation problem min x R p Ax b, which can be written into 5)... Optimality condition In order to solve problem ), we must rely on some intermediate characterization of f to develop algorithms. The most important characterization is the first-order condition also called Fermat s rule). Lemma. Fermat s rule). Let f Γ 0 R p ) in ). Then, the following condition is necessary and sufficient for x dom f) to be an optimal solution of ): 0 fx ). 6) Proof. We note that x dom f) is an optimal solution of ) iff fx) fx ) = fx ) + 0, x x, which exactly shows that 0 fx ) by definition of subdifferential. If f is smooth, then condition 6) reduces to the following nonlinear equation: fx ) = 0. 7) Unfortunately, in practice, we cannot find an exact solution x of ), but an approximation in the following sense: Definition. Given an accuracy ε > 0, we say that x is an ε-solution of ) if f x) f ε. Here, we note that f x) f, which already gives us a lower bound. In the sequel, we aim at finding an ε-solution x of ). In some methods, we can be able to find x such that x x ε in a given norm. When f is differentiable, we can also find x such that f x) ε.. Gradient methods We studied different gradient methods for solving the unconstrained convex minimization problem ). Apart from convexity, we also require the following Lipschitz gradient continuity assumption: Assumption. The objective function f of ) is smooth and convex, and its gradient f is Lipschitz continuous on dom f) with the Lipschitz constant L f [0, ), i.e.: fx) fy) L f x y, x, y dom f). Assumption restricts the applicability of ), but still cover many important problems. For instance, fx) = Ax b and logistic loss satisfy this assumption. There are several ways of deriving gradient methods. We show to approaches below. /78

3 .. Deriving gradient method from the fixed-point principle From the optimality condition 7), we have fx ) = 0. We can rewrite this equation as x = x L fx ) Grad L x ), for some L 0, + ). This formulation shows that x is a fixed-point of the mapping Grad L := I L f. Hence, we can apply the Picard iterative scheme to approximate this fixed-point as x k+ := Grad L x k ), which is x k+ := x k L fxk ), 8) where α := L > 0 is a given step-size, which can be fixed or can be adaptively updated as α k := L k > 0. Question: The question is how to choose the step-size α k such that the sequence x k generated by scheme 8) converges to x? Key to this is Assumption. Under this assumption, we can show that with a proper choice of α, the sequence x k generated by the gradient scheme 8) converges to a solution of ). More practically, we can show that x k approximates x in the sense of Definition... Deriving gradient method from quadratic surrogate of f Another view of the gradient method is as follows. Let x k be the current iteration. We consider a quadratic model where L > 0 is a given constant. Q L x; x k ) := fx k ) + fx k ), x x k + L x x k, 9) Clearly, Q L x k ; x k ) = fx k ), and fx) Q Lf x, x k ) for any x R n see Lecture 3) Our objective is to build a surrogate Q L of f such that Q L is easy to minimize. By minimizing Q L, x k ) over x, we obtain the solution x k+ such that Q L x k+ ; x k ) = min x R p Q Lx; x k ) where x k+ := arg min x R p Q L x; x k ) = x k L fxk ). 0) Clearly, 0) is exactly the same as 8). The main idea is illustrated in Figure. Q L x; x k ) `x; x k ) x k+ x Figure : An illustration of gradient method for solving ) The goal is to find L > such that fx k+ ) Q L x k+, x k ). Under Assumption, an obvious choice of L is L L f. Clearly, the step-size α in 8) relates to L as α = L. If L L f, then we have α 0, L f ]. But, in practice, we do not need to choose L L f as long as fx k+ ) Q L x k+, x k ) holds...3 Descent lemma We prove the following key lemma, which will be used to analyze the convergence of the gradient scheme 8) and the accelerated gradient methods below. 3/78

4 Lemma.. Let f S, L f,µ f R p ) i.e. f satisfies Assumption and µ f -strongly convex). Then, x k+ generated by 8) satisfies the following estimate: fx k+ ) fx) + fx k ), x k x L ) f fx k ) µ f L L xk x, x dom f). ) Proof. By the µ f -strongly convexity of f, we have fx) fx k ) + fx k ), x x k + µ f x xk, which implies fx k ) fx) fx k ), x x k µ f x xk. On the other hand, by the Lipschitz gradient continuity of f, we have fx k+ ) fx k ) + fx k ), x k+ x k + L f xk+ x k. Summing up the last two inequalities and using x k+ x k = L fxk ), we have which proves ). fx k+ ) fx) + fx k ), x k+ x + L f xk+ x k µ f xk x = fx) + fx k ), x k x L ) f fx k ) µ f L L xk x,..4 The algorithm and its convergence Now, we can specify the gradient descent method 8) formally as in Algorithm below. Here, we incorporate it with a relative stopping criterion and a line-search procedure describing below to make it practical. If we skip the line-search step and simply choose L = L f, then we obtain a vanilla gradient method. Algorithm Gradient descent algorithm GDA)) : Inputs: Choose an arbitrarily initial point x 0 R p and a desired accuracy ε > 0. : Output: An ε-solution x k of ). 3: For k = 0,, k max, perform: 4: Compute the gradient fx k fx ). If ) max, fx 0 ) ε, then Terminate. 5: Perform a line-search procedure to find L such that 0.5L f < L L f see Subsection.6). 6: Update x k+ := x k L fxk ). 7: End for The main computation of Algorithm i.e., per-iteration complexity) is: the evaluation of f, and the evaluation of L: If we know the Lipschitz constant L f, then we take L = L f as the optimal value in the worst-case context). If we use a linesearch procedure, then, at each iteration, we need to evaluate approximately function values fx) on average by using a bisection procedure i.e., L L). From Theorem.3, we can also see that fx k ) f ε if k + 4 L f x 0 x ε. Let R 0 := x 0 x. ) Then, the. worst-case complexity of Algorithm to achieve an ε-solution x k such that fx k ) f Lf R ε is O 0 ε Using Lemma., we can prove the convergence of Algorithm as follows. Theorem.3. Assume that f in ) satisfies Assumption and x is an optimal solution of ). Let x k be the sequence generated by scheme 8) or by Algorithm ), with L > L f. Then the following bound holds fx k ) f L fx 0 ) f ) x 0 x L x 0 x + kl L f ) fx 0 ) f, k. ) ) The right-hand side is optimal if we choose L := L f, and ) becomes fx k ) f L f x 0 x k+4. At Step 3, we use max, fx 0 ) instead of fx 0 ) to prevent fx 0 ) = 0 or small. 4/78

5 Proof. There are many ways to prove this theorem. We choose the argument from [30]. For simplicity, means. Let us start by a simple derivation using 8) as x k+ x = x k L fxk ) x = x k x L fxk ), x k x + L fx k ). Using fx) fy), x y L f fx) fy) from Lecture with x := x k, y := x and fx ) = 0, we have fx k ), x k x L f fx k ). Substituting this estimate into the last inequality, we get x k+ x x k x L ) fx k ) L f L. Clearly, if L > L f, then x k+ x < x k x. The sequence x k x is monotonically decreasing. Let us define r k := x k x and k := fx k ) f. Using the convexity of f, we have k = f fx k ) fx k ), x x fx k ) x k x fx k ) r 0, by the monotonicity of r k. Hence, fx k ) k r 0. Next, using ) with x := x k, we get fx k+ ) fx k ) L L ) f L fx k ). Substituting fx k ) k r 0 into this inequality, we obtain ) where ω := L L f L addition, we have This inequality implies k+ 0 k+) ω r 0 k+ k ω r 0 k, > 0. The last inequality also implies k+ k, which is equivalent to k k+ k+ k + ω r0 ω k k k+ r0 ω k+ k+ r0.. In k + ω. By induction, we have r0 k+ 0 + k + ) ω. Hence, r0 0+. Using the definition of ω, and k, we finally obtain fx k ) f L fx 0 ) f ) x 0 x L x 0 x + kl L f ) fx 0 ) f ). If we choose L := L f, then this estimate becomes fx k ) f L f x 0 x k+4 by using the fact that fx 0 ) f + L f x0 x from the Lipschitz continuity of f and fx ) = 0. Theorem.3 also shows that the objective residual sequence fx k ) f converges to zero at a sublinear rate O ) fx k. However, from the stopping criterion of Algorithm, we have k ) fx 0 ) ε. This also shows that xk is an approximate solution of )...5 Strongly convex case If f in ) is strongly convex, then how do we exploit it in Algorithm? The following theorem guides us how to choose step-size in Algorithm and shows what convergence we can achieve. Theorem.4. Under the assumptions of Theorem.3, let f be µ f -strongly convex with µ f > 0, and x k be generated by Algorithm by choosing L such that L L f + µ f. Then, we have x k x ω k x 0 x = µ ) k f L f x 0 x Lµ f + L f ). 5/78

6 If we choose L := L f +µ f, and set κ := L f µ f the condition number of f, then x k x ) k κ x 0 x, and fx k ) f L κ + ) k κ x 0 x κ +. Hence, the worst-case iteration-complexity to achieve an ε-solution of ) is O κ log ε )). Proof. Using the estimate fx) fy), x y µ f L f µ f +L f x y + µ f +L f fx) fy) from Lecture for strongly convex and Lipschitz gradient function with x := x k, y := x and fx ) = 0, we can derive x k+ x = x k x L fxk ) = x k x L fxk ), x k x + L fx k ) x k x µ f L f Lµ f + L f ) xk x Lµ f + L f ) fxk ) + L fx k ) = µ ) f L f x k x Lµ f + L f ) L µ f + L f ) ) fx k ) L. ) ) If L L f + µ f, then x k+ x µ f L f Lµ f +L f ) x k x. Moreover, ω := µ f L f Lµ f +L f ) 0, ). By induction, we can show that x k x ω k x 0 x = µ ) k f L f x 0 x Lµ f + L f ). This proves the first statement of Theorem.4. If we choose L := L f +µ f, and set κ := L f µ f, the condition number of f, then x k x ) k κ x 0 x, and fx k ) f L κ + ) k κ x 0 x κ +. Here, the last inequality follows from the first one and the fact that fx k ) f L f xk x by the Lipschitz gradient continuity of f. We prove the second part of Theorem.4. For the last statement, we note that k κ κ+ = L f µ f L f +µ f = µ f L f +µ f = κ+. Hence, we have fxk ) f L f xk x L f κ+) R 0 ε if k log ε L f R 0 log ) κ+) κ log ε ). Here, we use log κ+ ) κ+ O κ). Remark. In the strongly convex case, the optimal step-size is α := L = L f +µ f, depending on both µ f and L f. While L f can be evaluated by a power method relatively cheap, see Appendix A.), but evaluating µ f is in general expensive. Hence, if we do not know µ f, it is still a challenge to compute this optimal step-size..3 Lower bound complextity of first-order methods The main information using in Algorithm is the values of gradient f and function f evaluated at each iterations. Any algorithm that uses only this information is called a first-order gradient method. Mathematically, any iterative first-order method for solving ) generates a sequence x k based on the following scheme: where span is the linear span of a collection of vectors in R p. x k+ x 0 + span fx 0 ), fx ),, fx k ), M FO ) Informally, a first-order method M FO is optimal for the class of problems ) that satisfies Assumption if there exists no any other first-order gradient method that can solve any given problem in this class while achieving a better complexity than M FO up to a constant factor). In order to show a lower-bound on convergence rate of first-order gradient methods, we can create a specific problem in this class and provide a target lower-bound 6/78

7 complexity. Then, we show that there exists no first-order gradient algorithm that beats this lower bound. We process as follows. Nesterov showed in [30] that there exists an instance of f satisfying Assumption so that the first-order scheme M FO ) exhibits a lower bound O k ) rate. More formally, we state this result into the following theorem. Theorem.5. For any integer 0 k p and any x 0 R p, there exists a convex function f for problem ) satisfying Assumption such that fx k ) f 3L f x 0 x 3k + ), 3) for any sequence x k generated by M FO ) starting from x 0. Proof. Nesterov constructed the following function fx) := L f 4 x + ) ) p i= x i x i+ ) + x p x. This is a quadratic function, which can be written as fx) = L f 4 x Qx e x ), where Q := , and e.. = Clearly, this function is convex, and fx) = L f 4 Qx e ), which is Lipschitz continuous with the Lipschitz constant L, i.e., 0 fx) Q L f I. From the optimality condition of ), we have Qx = e, which leads to the unique optimal solution x = p+, p+,, p p+ ), and the optimal value f = fx ) = L f 4 x ) Qx e x ) = L f 8 e x = L f p 8p+). Moreover, we can show that x = Without loss of generality, we assume that x 0 = 0. p + ) p ) p + ). 3 0 Otherwise, we can shift the function with a linear term to get x 0 = 0. With x 0 = 0, we have fx 0 ) = L f 4 e. Since we apply the scheme M FO ) to solve ), we have x span e. By the tri-diagonal form of Q, we have fx ) span e, e, which shows that x span e, e. By induction, we have x k span e, e,, e k, where e i is the i-th unit vector of R p. Therefore, we have fx k ) By our assumption, for p k +, we have fx k ) f L f 8 inf fx) = L f k x k+ = x p=0 8 k +. k k + + L f 8 p p + Finally, using this inequality and x 0 x = x above, we have which is exactly 3). fx k ) f x 0 x L f 6k + ). L f 3 6k + ) k + ) = 3L f 3k + ), As we have seen from Theorem.3 that the convergence rate of the gradient method 8) is O ) k. This method is an instance of the first-order scheme M FO ). But Theorem.5 shows that the lower bound rate 7/78

8 is O ) k. Therefore, we can conclude that the standard gradient method is suboptimal, and this motivates researchers to develop a new method that matches the lower bound. Note that in the worst-case complexity sense, the lower bound iteration-complexity corresponding to Theorem.5 is O ε ). Similarly, the worst-case iteration-complexity of the gradient method 8) is O ε )..4 The accelerated gradient algorithm: Nesterov s first optimal method The accelerated gradient method was introduced by Yurii Nesterov since 983 in [9] this paper was originally in Russian, but it was translated to English). This method is also called an optimal ) gradient method or Nesterov s optimal scheme) since it matches the lower bound iteration complexity O ε presented in Theorem.5. There exist many variants of Nesterov s optimal gradient methods. In this subsection, we will present some variants and recently improvements..4. Standard accelerated gradient scheme To make our presentation simple, we first directly present the standard accelerated gradient scheme. Then, we analyze its convergence guarantee. Later on, we will explain why this method actually accelerates the standard gradient method and provide different views of this method. Mathematically, the standard accelerated gradient scheme for solving ) simply performs three steps: x k+ := y k L f fy k ) t k+ := t k ) y k+ := x k+ + t k t k+ x k+ x k ), where x 0 dom f) is a given starting point, y 0 := x 0, and t 0 :=. AGDA) There are different ways of presenting AGDA. For example, if we define τ k := t k and z k := τ k y k τ k )x k), we can write AGDA) into four steps as y k := τ k )x k + τ k z k where τ 0 := and z 0 := x 0. x k+ := y k L f fy k ) z k+ := z k τ k y k x k+ ) = z k τ k L f fy k ) τ k+ := τ k τ k + 4 τ k ), We note that, from AGDA), if we define β k := t k t k, then we can rewrite AGDA) in one line as x k+ := x k L f x k + β k x k x k ) ) + β k x k x k ) 5) The last expression shows that the accelerated gradient has a similar form as a so-called Heavy-ball method [39]: x k+ = x k L fxk ) + γ k x k x k ), which requires to add a momentum term γ k x k x k ) to the gradient step. However, the gradient is estimated at the middle point x k + β k x k x k ) instead of at x k as in the Heavy-ball method. In addition, the convergence rate of the Heavy-ball method is only O k )..4. The algorithm We can specify the standard accelerated gradient method algorithmically as in Algorithm below. Per-iteration complexity: the evaluation of one gradient fy k ), and Each iteration of this algorithm requires: the evaluation of L: If we choose L := L f, then we just need to evaluate L f once at the beginning of the algorithm, e.g., by power method Appendix A.). If we use a line-search procedure, then additional computation is required, see Subsection.6 for more details. 4) 8/78

9 Algorithm Accelerated gradient descent algorithm) : Inputs: Choose an initial point x 0 R p and an accuracy ε > 0. Set y 0 := x 0 and t 0 :=. : Output: An ε-solution x k of ). 3: For k = 0,, k max, perform: 4: Compute the gradient fy k x ). If k+ x k max, x k ε, then Terminate. 5: Update t k+ := t k ) and x k+ := y k L f fy k ) y k+ := x k+ + t k t k+ x k+ x k ). 6: End for the update of y k : just one subtraction and one addition of two vectors, and one constant-vector multiplication. The computational complexity of this step is O p). Clearly, the per-iteration complexity of Algorithm is nearly the same as in Algorithm except for the last step. However, as we will see in the next subsection, Algorithm achieves a much better convergence rate..4.3 Convergence and complexity analysis The following theorem shows the convergence of Algorithm, whose proof relies on Lemma.. Theorem.6. Assume that f in ) satisfies Assumption and x is an optimal solution of ). Let x k be the sequence generated by AGDA). Then, we have the following bound: fx k ) fx) L f k + ) x0 x, x R p, k 0. 6) Hence, the worst-case iteration-complexity to achieve an ε-solution x k of ) such that fx k ) f ε is Lf x O 0 x ε ). This method is optimal, which matches the lower bound complexity in Theorem.5. Proof. Using ) from Lemma. with x k := y k, µ f = 0, and L := L f, we get Using this estimate with x = x k, we get fx k+ ) fx) + fy k ), y k x L f fy k ), x R p. 7) fx k+ ) fx k ) + fy k ), y k x k L f fy k ). 8) Multiplying 7) by τ k, and 8) by τ k ) and summing up the results, we get fx k+ ) τ k )fx k ) + τ k fx) + fy k ), y k τ k )x k τ k x L f fy k ). Using τ k z k = y k τ k )x k, we derive from this inequality that fx k+ ) fx) τ k ) [ fx k ) fx) ] + τ k fy k ), z k x L f fy k ) = τ k ) [ fx k ) fx) ] [ ] + τ k L f z k x z k τ k L f fy k ) x = τ k ) [ fx k ) fx) ] + τ k L [ f z k x z k+ x ]. Note that τ k = t k, it is easy to show that t k+ := t k ) is equivalent to τ k+ τ k+ 9) and using the fact that τ k τ k =, we get τk [ τ fx k+ ) fx) ] + L f k zk+ x τ [ k) τ fx k ) fx) ] + L f k zk x [ fx k ) fx) ] + L f zk x. = τ k 9) =. Rearranging τk 9/78

10 By induction and using τ 0 =, we get [ τ fx k+ ) fx) ] + L f k zk+ x [ τ0 τ fx 0 ) fx) ] + L f 0 z0 x = L f z0 x. This inequality implies fx k ) fx) τ k L f z 0 x, or equivalent to since z 0 = x 0 ): fx k ) fx) τ k L f x 0 x = L f t k x 0 x. 0) Now, we note that t k+ = t k ) t k 0.5k + ) + t 0 = k+ + = k+3. Hence, t k k+. Using this into 0) we obtain the bound 6). The worst-case complexity is a consequence of 6) when we substitute x by x. The optimal rate is due to Theorem.5. To compare Algorithm and Algorithm, we assume that we can estimate a tight upper-bound for x 0 x by R 0 =. We also assume that L f = 00. Let us take a tolerance ε = 0 3, and wish to achieve an approximate solution x k such that fx k ) f ε. If we apply Algorithm to solve ), then we need at most k := L f R 0 ε If we apply Algorithm to solve ), then we only need at most k := = 00, 000 iterations. Lf R 0 ε = iterations. Clearly, Algorithm outperforms Algorithm in terms of iterations, while having almost the same per-iteration complexity..4.4 Strongly convex case We consider an accelerated gradient method to solve ) for a L f -smooth and µ f -strongly convex function f S, L f,µ f R p ). In this case, we can present AGDA) in a simple form as where y 0 := x 0. x k+ := y k L f fy k ) y k+ := x k+ Lf µ f + Lf + x k+ x k), µ f We do not provide a formal proof of this scheme, but let us explain briefly the idea as follows. Using ) with x k := y k and L := L f, we have fx k+ ) fx) + fy k ), y k x L f fy k ) µ f yk x. In this case, we can choose a step size α := L f at the first line and update y k+ := x k+ Lf µ f + Lf + x k+ x k) µ f at the second line in Step 4 of Algorithm to achieve the following convergence rate: fx k ) f L f + µ ) k µ x 0 x L. f Lf This shows that the scheme ) has O µ f log ) ) ε -worst case iteration-complexity. However, the step-size of this algorithm requires knowing the strong convexity parameter µ f, which is often very hard to evaluate in practice. Compare this complexity and Theorem.4, we can see that ) achieves a factor of κ, while, in Theorem.4, we only achieve a factor of κ, where κ := L f µ f is the condition number of f. As proved in [8, 30], ) is also optimal in the sense of first-order gradient methods for strongly convex f. We skip the details. ) 0/78

11 .5 Nesterov s second optimal method estimate sequence approach The accelerated scheme AGDA) achieves the optimal rate, but it cannot be extended to handle non-euclidean norms. We present another fast gradient scheme to solve the following constrained convex optimization problem, which covers ) as a special case when X = R p, and can work with non-euclidean norms under certain conditions. f := min fx), ) x X where f is an F, L f R p ) and X is a simple, nonempty, closed and convex subset in R p. The objective function f in ) still satisfies Assumption. Here, we mean that X is simple if there is a proximity function d X see Lecture ) such that can be efficiently solved i.e. min v, x + d X x) x X with a closed form solution or with a low-order polynomial time algorithm). Without loss of generality, we assume that d X is µ d -strongly convex with µ d =..5. The construction of Nesterov s optimal scheme Given a point y k dom f), let l k x) := fy k ) + fy k ), x y k, and a k > 0. We define ϕ k x) := k a i l i x) = i=0 k [ a i fy i ) + fy i ), y y i ]. 3) i=0 Let A k be a sequence defined as A 0 := a 0 > 0, A k+ := A k + a k+. Hence, A k := k i=0 a i. We consider the following problem: ψ k := min ϕ k x) + L f d X x), 4) x X where L f is the Lipschitz constant of f and d is the proximity function of X. Clearly, by convexity of f, we have l k x) fx) and hence ϕ k x) = k a i l i x) i=0 k a i fx) A k fx), x dom f). 5) i=0 Our goal is to construct a sequence x k in X such that Combining this condition and 5), we obtain A k fx k ) ψ k := min x X ϕ k x) + L f d X x). R k ) fx k ) fx) L f d X x) A k, x X = fx k ) f L f d X x ) A k. 6) Let us define v k to be the solution of 4), i.e.: v k := arg min x X ϕ k x) + L f d X x). 7) By the -strong convexity of d X x), we have ψ k + L f x vk ϕ k x)+l f d X x). We construct x k recursively as follows. First, we compute the parameters a k+ and A k+ := A k + a k+ such that A k+ a k+. 8) Then, let τ k := a k+ A k+, we update y k+ := τ k )x k + τ k v k. 9) /78

12 Finally, we define x k+ = T X y k+ ) := arg min x X fy k+ ) + fy k+ ), x y k+ + L f x yk+. 30) The following lemma shows that y k, a k, A k ) satisfies the condition R k ) above. Lemma.7. Let y k, a k, A k ) be generated by 8), 9), and 30), respectively. Then, R k ) holds for k 0. Proof. We prove this lemma by induction. For k = 0, we have ϕ 0 x) = fy) 0 + fy 0 ), x y 0. By 7), we have ψ 0 = min fy 0 ) + fy 0 ), x y 0 + L f d X x) fy) 0 + fy 0 ), v 0 y 0 + L f x T X y 0 ) y 0 ft X y 0 )) = a 0 fx 0 ) = A 0 fx 0 ). for any a 0 0, ] See Lecture 3, more precisely, this holds since fy 0 ) + fy 0 ), x y 0 + L f d X x) is strongly convex with µ = ). Assume that R k ) holds for k 0. We start from the definition of ϕ k in 3) to derive ϕ k+ x) := k+ i=0 a il i x) = ϕ k x) + a k+ [ fy k+ ) + fy k+ ), x y k+ ]. Hence, we have ϕ k+ x) + L f d X x) = ϕ k x) + L f d X x) + a k+ [ fy k+ ) + fy k+ ), x y k+ ] ψ k + L f x vk [ + a k+ fy k+ ) + fy k+ ), x y k+ ] R k ) A k fx k [ ) + a k+ fy k+ ) + fy k+ ), x y k+ ] + L f x vk a) A k fy k+ ) + a k+ fy k+ ) + A k fy k+, x k y k+ + a k+ fy k+ ), x y k+ + L f x vk b) = A k+ fy k+ ) + A k+ fy k+ ), A k A k+ x k y k+ ) + a k+ A k+ x y k+ ) + L f x vk. 3) Here, we use fx k ) fy k+ ) + fy k+ ), x k y k+ in a) due to the convexity of f, and A k+ = A k + a k+ in b). Next, we define τ k := a k+ A k+. Then τ k 0, ) and A k A k+ = τ k. We have A k A k+ x k y k+ )+ a k+ A k+ x y k+ ) = τ k )x k y k+ ) + τ k x y k+ ) = τ k )x k + τ k x y k+. Hence, we consider x := τ k )x k + τ k x, then x X. Since y k+ A is defined as 9), we have k A k+ x k y k+ ) + a k+ A k+ x y k+ ) = x y k+ = τ k x v k ). Hence, x v k = τ k x y k+ ). Substituting these relations into 3), and using the fact that τk from 8), we can derive ϕ k+ x) + L f d X x) A k+ fy k+ ) + A k+ fy k+ ), x y k+ + L f τk x y k+ [ A k+ fy k+ ) + fy k+ ), x y k+ + L ] f A k+ min x X = A k+ fx k+ ), x yk+ fy k+ ) + fy k+ ), x y k+ + L f x yk+ = A k+ a k+ where x k+ is computed by 30). Now, taking the minimization both sides of the above inequality, we obtain ψ k+ := min x X ϕ k+ x) + L f d X x) A k+ fx k+ ), which shows that x k+ satisfies R k ) with k k +. A k+ /78

13 .5. The algorithm and its convergence guarantee Putting all the ingredients analyzing above together, we can obtain the following algorithm, Algorithm 3. Algorithm 3 Nesterov s optimal gradient method) : Inputs: Choose y 0 := arg min d X x) and a 0 0, ]. Set A 0 := a 0, and x 0 := T X y 0 ). x X : Output: An ε-solution x k of ). 3: For k = 0,, k max, perform: 4: Find a k+ and A k+ such that A k+ := A k + a k+ and a k+ A k+. 5: Compute v k as v k := arg min x X k [ a i fy i ) + fy i ), x y i ] + L f d X x). i=0 6: Update y k+ := τ k )x k + τ k v k with τ k := a k+ A k+. 7: Update x k+ := T X y k+ ) from 30). 8: End for Next, we analyze the update rule 8). If we choose a k := k+, then A k = k i+) i=0 = k+)k+) 4. Clearly, a k+ = k+) 4 k+)k+3) 4 = A k+ which satisfies 8). Hence, τ k = k+3. Clearly, τ k 0, ) and a 0 = <. We can summarize the update rule of a k and A k at Step 4 and τ k at Step 6 as a k := k +, A k := k + )k + ), and τ k := 4 k ) In general, we can choose a k+ by solving a a A k 0, which leads to a k+ + +4A k. Finally, we summarize the convergence result of Algorithm 3 into the following theorem, which is a direct consequence of 6) and 3) without any proof. Theorem.8. Let x k be the sequence generated by Algorithm 3 for solving ). Then, one has fx k ) f 4L f d X x ) k + )k + ). Lf d Consequently, the convergence rate of Algorithm 3 is O X x ) k ). Here, we make a few remarks on Algorithm 3. Algorithm 3 requires only one gradient fy k ) at each iteration. It has a flexibility to choose the proximity function d X that best describes the feasible domain X. We will see later that this is really important in some applications. It also works with non-euclidean norms. Algorithm 3 requires two projections onto X : one at Step 5 to compute v k, and one at Step 7 to update x k+. This can be a disadvantage compared to Algorithm if the projection π X is expensive. Algorithm 3 can be viewed as a combination between primal gradient method and dual averaging scheme see next lectures). The sequence of the objective values fx k ) in Algorithm 3 is nonmonotone i.e., fx k+ ) fx k ) for all k 0). To impose a monotone behavior, we can modify Step 7 for computing x k+ by selecting it among y k+, T X y k+ ), x k such that fx k+ ) = min fˆx k+ ) fˆx k+ ) ft X y k+ )), fy k+ ), fx k ). However, in this case, we need to evaluate the objective values at two additional points per iteration. 3/78

14 .6 Implementation discussion and examples We now discuss some implementation aspects of the gradient and fast gradient methods presented previously. We focus on three techniques: stopping criteria, line-search, and restarting. These techniques do not change or just slightly change the theoretical guaranty of the algorithms but certainly significantly enhance the performance of the algorithm such as computational time and number of iterations..6. Stopping criteria For the standard gradient method 8), by its optimality condition fx ) = 0, we can use fx k ) ε max, fx 0 ), 33) as a stopping criterion, where ε is a given tolerance, and x 0 is the starting point. However, for the fast gradient method, it is more complicated. Since we have a guarantee of convergence on the sequence x k, while the gradient is computed at y k. Hence, we cannot use a similar condition as in the standard gradient method. Of course, we can occasionally evaluate the gradient fx k ) for every few iterations e.g., every 5 iterations) to check this condition. As indicated in [], the sequence x k converges to a solution of ). Hence, one strategy to terminate this algorithm is as follows. We compute the relative change of x k and check x k+ x k max x k, ε. If this condition holds, we can also compute fx k ) just one to check the stopping criterion 33) above. While gradient-type methods can converge to a solution x from any initial point in dom f), we should note that the complexity bound of both the gradient and fast gradient methods also depends on the distance x 0 x from the initial point x 0 to the solution set. Hence, if we have any priori information about the solution of ), one can exploit it to choose a good initial point x 0 so that we can reduce the number of iterations..6. Line-search Sometimes, evaluating the Lipschitz constant L f is expensive. Even when we know L f, the optimal step-size α := L f remains giving us the worst-case performance, which may not be as good as adaptive step-sizes. Hence, using a line-search strategy may give us a better performance. What is a line-search procedure? To illustrate the idea of line-search, we assume that we want to solve the unconstrained convex problem ). Given a point x dom f), a vector 0 d R n is called a descent direction of f at x if fx) d < 0. Indeed, by assumption f F, L Rn ), using the bound from Lecture, we have fx + td) fx) + t fx) d + t L f d, t > 0. Since fx) d < 0, we can choose t > 0 sufficiently small e.g., 0 < t < fx) d/l f d )) such that fx) d + tl f d < 0. Hence, we have fx + td) < fx), which shows that if we move along the direction d, the objective function f is decreased. The cosine cosd, fx)) between fx) and d measures the relative slope of the direction d with respect to f. It is clear that cosd, fx)) = d fx) fx) d = if d = fx). Therefore, d = fx) is called the steepest descent direction. Now, given a descent direction d of f at x, how can we find the step-size t > 0 such that fx + td) < fx)? There are several ways of doing this. For instance, if we know L f, we can take t < fx) d/l f d ). 4/78

15 Specially, if d = fx) then t < /L f, which is optimal if t = /L f. In general, we can find t by solving an one-variable minimization problem min t>0 ϕt), where ϕt) := fx + td). This technique is called an exact linesearch procedure. In practice, we often approximate t. The simplest way is using a bisection scheme: starting from t := t 0, at each iteration, we bisect t by t/ and compare the objective values. If fx + td) < fx), then we terminate. Now, we present a simple backtracking linesearch using the well-known Amijo condition with f as follows:. At iteration k, given an estimation L 0 for L f such that L 0 < L f. Set L := L 0 as a starting value.. Iterate, and at each line-search iteration, if the following condition holds fx k /L) fx k )) fx k ) c L fxk ), then terminate, where c 0, ] is a fixed constant. Otherwise, set L := L and repeat this step. 3. Once this loop is terminated, we set α k := /L as a step-size at the iteration k of Algorithm. In the fast gradient method, Algorithm, we can use the same line-search procedure, but we need to guarantee L k+ L k for every iteration k in order to guarantee its convergence, see [6] for more details. Hence, at the iteration k, we set L := L k / as a starting value instead of using L 0. Question: How to estimate an initial value L 0 in this line-search procedure? We note that fx) fy) L f := sup fx ) fx 0 ) x y x y x x 0 for a given x 0, x dom f) with x 0 x fx. Hence, we can estimate L 0 as L 0 := c ) fx 0 ) L x x 0 by taking two arbitrary points x 0, x dom f) such that x 0 x. Here, the constant c L can be set to c L := /4 to damp the L 0 constant to make sure that L 0 < L f. If the function f is L f -Lipschitz gradient f, then the backtracking linesearch procedure is always terminated after finite number of iterations i k. We estimate i k as follows. Every line-search iteration i k, we double L starting from its initial value L 0. Hence, after i k iterations, we have L = i k L 0 L f since it may be overshot by one iteration). Hence, i k log L f /L 0 ). The maximum number of iterations is ) Lf i k := log +. L 0 This number of iterations corresponds to the number of function evaluations fx k +/L)d k ), d k := fx k )). If the cost of evaluating fx k +/L)d k ) is high, then this line-search procedure may incur significantly computational cost. Note that, we can use different rule for increasing L instead of doubling L. For example, we ) choose β > Lf and update L βl. In this case, the maximum number of line-search iterations is i k := log β L Restarting strategies The accelerated gradient method has an oscillation behavior as we can see in Figure when the objective function f is strongly convex or restricted strongly convex). This figure shows the convergence behavior of the objective residual fx k ) f of a quadratic function fx) = x Qx with different estimates q of the inverse of the condition number q = L f µ f. In this test, we use the momentum step-size β = q q+. The oscillated behavior is due to the non-monotonicity of the objective residual sequence fx k ) f. In order to reduce this oscillation, one can inject a restarting procedure. This step is performed in Algorithm 5/78

16 76 Found Comput Math 05) 5:75 73 Figure : An oscillated behavior of accelerated gradient methods for minimizing fx) = x Qx [37]. Fig. Convergence of Algorithm with different estimates of q whenever the oscillation Introduction starts. There are different conditions to implement restarting strategies. We refer the reader to [0, 37, 45] for some heuristic strategies, and [5] for some theoretical aspects behind this enhancement. Accelerated gradient schemes were first proposed by Yurii Nesterov in 983 [8]. Here, we present a strategy He demonstrated from [0] awhich simple modification uses a condition to gradient ondescent the iterative that couldsequence obtain provably optimal performance for the complexity class of first-order algorithms applied to restart. After Step 4 of Algorithm to minimize, smooth we add convex thefunctions. restarting The method, condition and its successors, are often referred to as fast, accelerated, or optimal methods. In recent years there has been aresurgenceofinterestinfirst-orderoptimizationmethods[, 3, 4, 0, 4], driven primarily if y k by the x k+ need, xto k+ solve very x k large > 0 problem then: instances y k+ := unsuited x k+ to and second-order t k+ :=. methods. Accelerated gradient schemes can be thought of as momentum methods, in that the We can also replace the step condition taken at each yiteration k x k+ depends, x k+ on the x k previous > 0 by iterations, a fixedwhere restarting the momentum condition: if modk, k rs ) = 0 then y k+ := x k+ andgrows t k+ from :=, onewhere iteration k rs to can the next. be chosen When wesuch refer as to restarting k r = 50, the 00, algorithm 00, etc. we mean starting the algorithm again, taking the current iteration as the new starting.7 Applications point. and Thisnumerical erases the memory examples of previous iterations and resets the momentum back to zero. We consider two simple problems Unlike gradient to illustrate descent, accelerated the performance methods areofnot gradient guaranteed methods to be monotone and accelerated variants. The in the objective function value. A common observation when running an accelerated first example is simply method a least-squares is the appearance problem, of ripples while or bumps the in second the trace one of the is objective a logistic value. regression. These are seemingly regular increases in the objective, see Fig. for an example. In this Example 4 Least-squares problem). We consider the following least-square problem: paper we demonstrate that this behavior can occur when the momentum has exceeded acriticalvaluetheoptimalmomentumvaluederivedbynesterovin[9]) and that the period of these ripples f := ismin proportional fx) to := the square-root of the local) condition x R number of the function. Separately, p Ax b, 34) we re-derive the previously known result that the optimal restart interval is also proportional to the square root of the condition number. where A R m p and b R p. If m < p, then we have an underdetermined case. If m > p, then it is an overdetermined case. If m = p and A is full-rank, the optimal value is zero and 34) has a unique solution x, which solves the linear system Ax = b. The function f is convex, and its gradient is fx) = A Ax b) which is Lipschitz continuous with L f := A A. This norm can be computed efficiently by using a power method See Appendix A.). Figure 3 shows the convergence behavior of three different algorithms: Algorithm, Algorithm and the restarting variant of Algorithm on a synthetic instance with n = 800 and p = 000. The y-axis shows the relative fx objective residual k ) f max, f and the relative error xk+ x k max x k,, respectively. While Algorithm is clearly faster than Algorithm, its restarting variants further outperform both methods. If we inject a backtracking line-search procedure into both algorithms, then we obtain a better performance as observed in Figure 3. Now, let us add a regularizer µ f x with µ f = 0. to f to get a strongly objective function fx) := Ax b + µ f x. If we test on the same setting as in the first case, we obtain a convergence behavior of these six variants in Figure 4. Here, we can clearly observe the oscillation behavior of the objective residuals 6/78

17 The relative objective residual The relative error Gradient 0 0 LS-Gradient Fast gradient LS-Fast Gradient RS-fast gradient RS-LS fast grdient Number of iterations Number of iterations Figure 3: The convergence of six variants of the gradient method for solving 34). The relative objective residual 0 4 Gradient 0 Gradient-LS Ac-Gradient 0 0 Ac-gradient-LS Ac-gradient-RS Ac-Grad-LS/RS The relative error Number of iterations Number of iterations Figure 4: The convergence of six variants of the gradient method for solving ) with fx) := Ax b + µ f x. fx k ) f of Algorithm as in Figure 4. The restarting variants achieve an approximate solution at the machine precision in a few iterations, which is much faster than the previous case shown in Figure 3. Example 5 Logistic regression). We consider another problem from logistic regression: min fx) := n log + exp y i a x R p i x + µ) )), 35) n i= where a i, y i ) n i= is a given dataset, and µ is also a given intercept. Let A be the matrix whose columns are formed from a i for i =,, n. Then, we can easily compute the gradient of f as fx) = n n i= exp y i a i x + µ)) + exp y i a i x + µ))y ia i, 7/78

18 which is Lipschitz continuous with the Lipschitz constant L f := 4n A A as we shown from Lecture. We test six variants of the gradient method both non-accelerated and accelerated algorithms) in a dataset called w4a downloaded from cjlin/libsvmtools/datasets/binary.html, and the result is shown in Figure 5. The relative objective residual Gradient LS-Gradient 0-4 Fast gradient LS fast gradient RS fast gradient LS-RS fast gradient Number of iterations The relative error 0 0 Gradient LS-Gradient Fast gradient LS fast gradient RS fast gradient LS-RS fast gradient Number of iterations Figure 5: The convergence behavior of six variants of the gradient method for solving 35). Here, the number of data points is n = 7366 and the problem dimension is p = 300. In this figure, we see that restart does not help to enhance the performance at all. This happens due to the non-strong convexity of f. However, the line-search procedure still slightly helps to improve the performance in terms of iterations. But it requires additional computational time for evaluating f in the line-search routine. Now, to see the effect of strong convexity to the performance of the gradient method, we slightly modify the objective function f in 35) by adding a small quadratic term as fx) := n n i= log + exp y i a i x + µ) )) + µ f x. Let us take µ f := 0.0. Then, the convergence of the six variants is shown in Figure 6. In this case, all the The relative objective residual The relative error 0 Gradient 0 0 LS-Gradient Fast gradient LS fast gradient 0 - RS fast gradient LS-RS fast gradient Number of iterations Number of iterations Figure 6: The convergence behavior of six algorithmic variants for solving a strongly convex instance of 35). 8/78

19 variants have a better performance compared to Figure 5. Moreover, restarting strategy also make a difference by significantly improving their performance. Proximal gradient-type methods We have seen from Lecture that several convex optimization models can be reformulated into composite convex minimization problems. In this section, we will study a class of gradient-type methods to solve them. Let us recall this problem here for our convenience of presentation: F := min F x) := fx) + gx), 36) x R p where f and g are both proper, closed, and convex see Lecture for definitions). Several practical models in statistics, machine learning, computer science, and engineering can be reformulated as a minimization problem of the sum of two convex functions. Here, the first objective term usually characterizes a loss, risk, or cost function, or a data fidelity term, while the second one can be used to regularize or to promote some desired structures of the optimal solution [6, 34, 47]. The numerical methods we design in this section rely on the following blanket assumption: Assumption. The objective terms f and g are proper, closed, and convex. Moreover, they satisfy: a) f is a smooth function with Lipschitz gradient, i.e., there exists a constant L f [0, + ) such that: fx) fy) L f x y, x, y dom f). 37) b) g is a nonsmooth function but equipped with a tractablly proximal operator see Lecture ): prox g x) := arg min gz) + z x. 38) z domf) We denote the class of all functions satisfying a) and b) by F, L Rp ) and by F prox R p ), respectively. Let us denote by x an optimal solution of 36), i.e., F x ) F x) for all x R p. The set of all optimal solutions of 36) is denoted by S and is assumed to be nonempty, i.e.: S := x dom F ) F x ) = F, where F is the optimal value of 36), which is finite, and dom F ) := dom f) dom g). To avoid trivial cases, we assume that dom F ) is nonempty, and S is nonempty.. Motivating examples The following applications can be cast into the composite convex minimization problem of the form 36). These problems have been described in Lecture, we recall them here for completeness. Example 6 LASSO problem). Given an observed or measurement vector y R n and a sensing/measurement matrix Φ in R n q, where n p. In signal processing, the input signal z R q produces the measurement vector y via a linear model: y = Φz + n, where n is some noisy corruption vector. Very often, the noise vector is assumed to be Gaussian, while the signal vector z can be transformed into a sparse vector x R p via some transformation z = Zx e.g., FFT or wavelets). Plugging into our model, we finally get y = ΦZx + n Ax + n with A := ΦZ. 9/78

20 y<latexit sha_base64="l9wxoub9debvmhlg7jhtz0ou4=">aaab6hicbvbns8naej34wetxaoxxsj4kokieix68dic/yalm0q7dbmlurgihv8clb0w8+po8+w/ctjlo64obx3szzmwlesgcdvz9y3nru7rt3t3bpzishb3dzwqhi0wiha6prciktw43abqkqrohatjc5m/mdjsax/lbzan6erjhnjgjzwaabsdwvuhgsveawpqohgoplvh8ysjvaajqjwpc9njj9tztgtoc33u40jzrm6wp6lkkao/xx+6jscwviwljzkobmd8toy0zqladkbujpwynxp/83qpcw/8nmsknsjzylgycmjimvuadllczkrmcwwksjgnfmbhzlg0i3vllq6r9wfpcmte8qtzvizhkcapncaeexemd7qeblwca8ayv8oy8oi/ou/oxafzipkt+apn8wfnvyz9</latexit> <latexit sha_base64="l9wxoub9debvmhlg7jhtz0ou4=">aaab6hicbvbns8naej34wetxaoxxsj4kokieix68dic/yalm0q7dbmlurgihv8clb0w8+po8+w/ctjlo64obx3szzmwlesgcdvz9y3nru7rt3t3bpzishb3dzwqhi0wiha6prciktw43abqkqrohatjc5m/mdjsax/lbzan6erjhnjgjzwaabsdwvuhgsveawpqohgoplvh8ysjvaajqjwpc9njj9tztgtoc33u40jzrm6wp6lkkao/xx+6jscwviwljzkobmd8toy0zqladkbujpwynxp/83qpcw/8nmsknsjzylgycmjimvuadllczkrmcwwksjgnfmbhzlg0i3vllq6r9wfpcmte8qtzvizhkcapncaeexemd7qeblwca8ayv8oy8oi/ou/oxafzipkt+apn8wfnvyz9</latexit> <latexit sha_base64="l9wxoub9debvmhlg7jhtz0ou4=">aaab6hicbvbns8naej34wetxaoxxsj4kokieix68dic/yalm0q7dbmlurgihv8clb0w8+po8+w/ctjlo64obx3szzmwlesgcdvz9y3nru7rt3t3bpzishb3dzwqhi0wiha6prciktw43abqkqrohatjc5m/mdjsax/lbzan6erjhnjgjzwaabsdwvuhgsveawpqohgoplvh8ysjvaajqjwpc9njj9tztgtoc33u40jzrm6wp6lkkao/xx+6jscwviwljzkobmd8toy0zqladkbujpwynxp/83qpcw/8nmsknsjzylgycmjimvuadllczkrmcwwksjgnfmbhzlg0i3vllq6r9wfpcmte8qtzvizhkcapncaeexemd7qeblwca8ayv8oy8oi/ou/oxafzipkt+apn8wfnvyz9</latexit> <latexit sha_base64="l9wxoub9debvmhlg7jhtz0ou4=">aaab6hicbvbns8naej34wetxaoxxsj4kokieix68dic/yalm0q7dbmlurgihv8clb0w8+po8+w/ctjlo64obx3szzmwlesgcdvz9y3nru7rt3t3bpzishb3dzwqhi0wiha6prciktw43abqkqrohatjc5m/mdjsax/lbzan6erjhnjgjzwaabsdwvuhgsveawpqohgoplvh8ysjvaajqjwpc9njj9tztgtoc33u40jzrm6wp6lkkao/xx+6jscwviwljzkobmd8toy0zqladkbujpwynxp/83qpcw/8nmsknsjzylgycmjimvuadllczkrmcwwksjgnfmbhzlg0i3vllq6r9wfpcmte8qtzvizhkcapncaeexemd7qeblwca8ayv8oy8oi/ou/oxafzipkt+apn8wfnvyz9</latexit> A<latexit sha_base64="baevobc5obwqgcfk5klp7hcwrg=">aaab6hicbvbns8naej3urq/qh69lbbbu0le0gpvi8cw7ae0owyk3btzhnn0ij/qvepcjiz/kzx/jtsbwx8mpn6bywzekaiujet+o4w9y3nrejawd3b/+gfhju0ngqgdzzlglvcahgwsudtcco4lcgguc8h4bua3nbphsshm0nqj+hq8pazaqzuuomxk7vnyosei8nfchr75e/eooyprfkwwtvuuu5ifezqgxnaqelxqoxowxmh9ivniitz/nd5sm6smsbgrw9kqufp7iqorpmosj0rnso97m3e/7xuasjrp+mysqktlgupokymmy+jgoukbkxsyqyxethiooszybeobg/55vxsuqh6btvrxfzqt3kcrtibuzghd66gbvdqhyywqhigv3hzhp0x5935wlqwnhzmgp7a+fwbktmxq==</latexit> <latexit sha_base64="baevobc5obwqgcfk5klp7hcwrg=">aaab6hicbvbns8naej3urq/qh69lbbbu0le0gpvi8cw7ae0owyk3btzhnn0ij/qvepcjiz/kzx/jtsbwx8mpn6bywzekaiujet+o4w9y3nrejawd3b/+gfhju0ngqgdzzlglvcahgwsudtcco4lcgguc8h4bua3nbphsshm0nqj+hq8pazaqzuuomxk7vnyosei8nfchr75e/eooyprfkwwtvuuu5ifezqgxnaqelxqoxowxmh9ivniitz/nd5sm6smsbgrw9kqufp7iqorpmosj0rnso97m3e/7xuasjrp+mysqktlgupokymmy+jgoukbkxsyqyxethiooszybeobg/55vxsuqh6btvrxfzqt3kcrtibuzghd66gbvdqhyywqhigv3hzhp0x5935wlqwnhzmgp7a+fwbktmxq==</latexit> <latexit sha_base64="baevobc5obwqgcfk5klp7hcwrg=">aaab6hicbvbns8naej3urq/qh69lbbbu0le0gpvi8cw7ae0owyk3btzhnn0ij/qvepcjiz/kzx/jtsbwx8mpn6bywzekaiujet+o4w9y3nrejawd3b/+gfhju0ngqgdzzlglvcahgwsudtcco4lcgguc8h4bua3nbphsshm0nqj+hq8pazaqzuuomxk7vnyosei8nfchr75e/eooyprfkwwtvuuu5ifezqgxnaqelxqoxowxmh9ivniitz/nd5sm6smsbgrw9kqufp7iqorpmosj0rnso97m3e/7xuasjrp+mysqktlgupokymmy+jgoukbkxsyqyxethiooszybeobg/55vxsuqh6btvrxfzqt3kcrtibuzghd66gbvdqhyywqhigv3hzhp0x5935wlqwnhzmgp7a+fwbktmxq==</latexit> <latexit sha_base64="baevobc5obwqgcfk5klp7hcwrg=">aaab6hicbvbns8naej3urq/qh69lbbbu0le0gpvi8cw7ae0owyk3btzhnn0ij/qvepcjiz/kzx/jtsbwx8mpn6bywzekaiujet+o4w9y3nrejawd3b/+gfhju0ngqgdzzlglvcahgwsudtcco4lcgguc8h4bua3nbphsshm0nqj+hq8pazaqzuuomxk7vnyosei8nfchr75e/eooyprfkwwtvuuu5ifezqgxnaqelxqoxowxmh9ivniitz/nd5sm6smsbgrw9kqufp7iqorpmosj0rnso97m3e/7xuasjrp+mysqktlgupokymmy+jgoukbkxsyqyxethiooszybeobg/55vxsuqh6btvrxfzqt3kcrtibuzghd66gbvdqhyywqhigv3hzhp0x5935wlqwnhzmgp7a+fwbktmxq==</latexit> <latexit sha_base64="fyzimwbr/dgjzp6tz360fhrqni=">aaab6hicbvbns8naej3urq/qh69lbbbu0le0gpri8cw7ae0owyk3btzhnnij/qvepcjiz/kzx/jtsbwx8mpn6bywzekaiujet+o4w9y3nrejawd3b/+gfhju0ngqgdzzlglvcahgwsudtcco4lcgguc8h4dua3hfphst7m0nqj+hq8pazaqzueoqxk7vnyosei8nfchr75e/eooyprfkwwtvuuu5ifezqgxnaqelxqoxowxmh9ivniitz/nd5sm6smsbgrw9kqufp7iqorpmosj0rnso97m3e/7xuasjrp+mysqktlgupokymmy+jgoukbkxsyqyxethiooszybeobg/55vxsuqh6btvrxfzqn3kcrtibuzghd66gbndqhyywqhigv3hzhpwx5935wlqwnhzmgp7a+fwb5jmm/a==</latexit> <latexit sha_base64="fyzimwbr/dgjzp6tz360fhrqni=">aaab6hicbvbns8naej3urq/qh69lbbbu0le0gpri8cw7ae0owyk3btzhnnij/qvepcjiz/kzx/jtsbwx8mpn6bywzekaiujet+o4w9y3nrejawd3b/+gfhju0ngqgdzzlglvcahgwsudtcco4lcgguc8h4dua3hfphst7m0nqj+hq8pazaqzueoqxk7vnyosei8nfchr75e/eooyprfkwwtvuuu5ifezqgxnaqelxqoxowxmh9ivniitz/nd5sm6smsbgrw9kqufp7iqorpmosj0rnso97m3e/7xuasjrp+mysqktlgupokymmy+jgoukbkxsyqyxethiooszybeobg/55vxsuqh6btvrxfzqn3kcrtibuzghd66gbndqhyywqhigv3hzhpwx5935wlqwnhzmgp7a+fwb5jmm/a==</latexit> <latexit sha_base64="fyzimwbr/dgjzp6tz360fhrqni=">aaab6hicbvbns8naej3urq/qh69lbbbu0le0gpri8cw7ae0owyk3btzhnnij/qvepcjiz/kzx/jtsbwx8mpn6bywzekaiujet+o4w9y3nrejawd3b/+gfhju0ngqgdzzlglvcahgwsudtcco4lcgguc8h4dua3hfphst7m0nqj+hq8pazaqzueoqxk7vnyosei8nfchr75e/eooyprfkwwtvuuu5ifezqgxnaqelxqoxowxmh9ivniitz/nd5sm6smsbgrw9kqufp7iqorpmosj0rnso97m3e/7xuasjrp+mysqktlgupokymmy+jgoukbkxsyqyxethiooszybeobg/55vxsuqh6btvrxfzqn3kcrtibuzghd66gbndqhyywqhigv3hzhpwx5935wlqwnhzmgp7a+fwb5jmm/a==</latexit> b A x<latexit sha_base64="fyzimwbr/dgjzp6tz360fhrqni=">aaab6hicbvbns8naej3urq/qh69lbbbu0le0gpri8cw7ae0owyk3btzhnnij/qvepcjiz/kzx/jtsbwx8mpn6bywzekaiujet+o4w9y3nrejawd3b/+gfhju0ngqgdzzlglvcahgwsudtcco4lcgguc8h4dua3hfphst7m0nqj+hq8pazaqzueoqxk7vnyosei8nfchr75e/eooyprfkwwtvuuu5ifezqgxnaqelxqoxowxmh9ivniitz/nd5sm6smsbgrw9kqufp7iqorpmosj0rnso97m3e/7xuasjrp+mysqktlgupokymmy+jgoukbkxsyqyxethiooszybeobg/55vxsuqh6btvrxfzqn3kcrtibuzghd66gbndqhyywqhigv3hzhpwx5935wlqwnhzmgp7a+fwb5jmm/a==</latexit> To measure the sparsity of x, we can use x 0 := cardx) the cardinality of x. However, to get a convex relaxation of x 0, we use the l -norm x as a convex envelope of x 0 and reformulate the problem of recovering a sparse signal vector x as min x R p Ax) y + λ x, 39) where λ > 0 is a regularization parameter. As we can see, problem 39) has exactly the same form as 36), where fx) := Ax y, which is convex and has a Lipschitz gradient fx) = A Ax y), and gx) := λ x is convex but nonsmooth. However, the prox-operator of g can be computed in a closed form analytical form) as: 0 if x i, prox g x) = sign x) max x, 0 = 40) / x i )x i otherwise. This operator is known as the soft-thresholding operator. Example 7 Sparse logistic regression). Given a dataset D := w ), y ), w ), y ),, w n), y n ), where w i) R p and y i), + for i =,, n. The conditional probability of a label y given w is defined as P y w) = / + e yx w+µ) ), where x R p is a weight vector, and µ is the intercept. The aim is to find a sparse weight vector x via the maximum log-likelihood principle. This problem can be formulated into the follow composite convex minimization problem: m min x R p m ly i w i) ) x + µ)), 4) i= fx) + λ x gx) where w i) is the i-th row of the data matrix W in R n p n p), λ > 0 is a regularization parameter, and l is the logistic loss function given by lτ) := log + e τ ). As seen previously, both f and g satisfy Assumption. Example 8 Imaging deconvolution). Given a noisy or blurred image y R n p, the aim is to recover a clean image x via y = Ax)+n, where A is a decoding or convolution operator and n is a Gaussian noise. This problem can be reformulated in the form of 36) as follows: min x R n p Ax) y F fx) + λ x TV gx), 4) where λ > 0 is a regularization parameter and TV is the total variation norm TV)-norm) that promotes to remove noise, or to preserve the sharped-edge structures: i,j x TV := x i,j+ x i,j + x i+,j x i,j for the anisotropic case, i,j xi,j+ x i,j + x i+,j x i,j for the isotropic case By solving this problem, we can approximately recover the original image x. Figure 7 illustrates how to use the optimization model 4) to de-blur an image. 0/78

21 original image input: Noise image output: After deconvolution Figure 7: Original image x Observed image y Clean image x after deconvolution Example 9 Sparse inverse covariance estimation). Given a dataset D := z ),, z i) generated from a Example: Log-determinant for LMIs Gaussian Markov random field z in R p. Let Σ be the covariance matrix corresponding to this graphical model of Application: Graphical model selection the Gaussian Markov random field. The aim is to learn a sparse matrix Θ that approximates the inverse Σ of the covariance matrix Σ. See Figure 8 for an illustration. x 5 x 4 x x x x 3 x 4 x 5 x x x 3 = x x 3 x 4 x 5 Given a datafigure set D 8: := Anx illustration,...,x N of,wherex the graphical i model is a Gaussian that leads to random 43) variable. Let be the covariance matrix corresponding to the graphical model of the Gausian Markov random field. The aim is to learn a sparse matrix that approximates the inverse. This problem can be formulated as follows see Lecture ): Optimization problem 8 9 >< >=. Optimality condition min and log det ) fixed-point + trace ) characterization + kvec )k 0 >: z z >; From the following well-known Moreau-Rockafellar fx) theorem [4] for two proper, gx) closed, and convex functions f and g in Γ 0 R p ) such that ri dom f) dom g)), we have Thursday, June, 4 min Θ 0 Tr ΣΘ) log detθ) fθ) + λ vec Θ) gθ), 43) where Θ 0 means that Θ is symmetric and positive definite and λ > 0 is a regularization parameter and vec is the vectorization operator. For more details of this problem can be found in [3, 4,, 40, 46]. Under some appropriate assumptions on data, one can show that f ) in 43) also satisfies Assumption... Optimality condition F x) = f + g)x) fx) + gx), x dom F ) := dom f) dom g). Now, we can write the optimality condition for 36) as 0 F x ) fx ) + gx ), x dom F ). 44) If f F, L Rp ) i.e., f is Lipschitz continuous), then 44) reduces to 0 F x ) fx ) + gx ), x dom F ). 45) The following lemma gives a necessary and sufficient condition for a point x to be an optimal solution of 36). /78

22 Lemma.. The necessary and sufficient condition for a point x dom F ) to be globally optimal to 36) is 44) or 45) if f F, L Rp ). Proof. By definition of the subdifferential of F, we have F x) F x ) ξ x x ), for any ξ F x ), x dom f). If 44) or 45)) is satisfied, then F x) F x ) 0. Consequently, x is a global solution to 36). Conversely, if x is a globally optimal of 36) then F x) F x ), x dom F ) F x) F x ) 0 x x ), x R p. This leads to 0 F x ) or 44) or 45))... Properties of proximal operators Let prox g be a prox-operator of a proper, closed and convex function g Γ 0 R p ). We recall some basic properties of prox g from Lecture. Lemma.. Given a proper, closed, and convex function g Γ 0 R p ). Let prox g be the proximal operator of g defined by 38). Then, the following properties hold: a) prox g is well-defined and single-valued for any x R p. b) prox g x) satisfies the following inclusion x prox g x) + gprox g x)), x R p. 46) c) x is a fixed point of prox g, iff x = arg min x R p gx), i.e.: d) prox g is a non-expansive operator, i.e.: x = arg min x R p gx) x = prox g x ). prox g x) prox g y) x y, x, y R p. Proof. Since g is convex, g ) + x is strongly convex with the parameter µ = for any x R p. There exists an optimal solution z x) = prox g x) of min z R p gz) + z x for any x R p. Moreover, since the function is strongly convex, this z x) is unique. The statement a) is proved. Writing down the optimality condition of 38), we obtain 0 gprox g x)) + prox g x) x. Rearranging this expression, we obtain 46). The statement b) is proved. A point x = arg min x R p gx) is equivalent to 0 gx ). This is equivalent to 0 gx ) + x x, which shows that x satisfies the optimality condition of 38) with x = x. Therefore, x = prox g x ). Conversely, by using 46), we also obtain 0 gx ), which means that x = arg min x R p gx). The statement c) is proved. Finally, we prove d). Let u := prox g x) and v := prox g y). By 46) we have x u gu) and y v gv). However, since g is convex, we have u v) p q) 0, where p gu) and q gv) the monotonicity of g, see, e.g., [30]). Using this relation with p = x u and q = y v, we have u v) x u y + v) 0. This implies u v) x y) u v. Finally, by using the Cauchy-Schwarz inequality, we have u v x y u v) x y) u v, which leads to d). /78

23 ..3 Optimality condition vs. fixed-point formulation The optimality condition 4.) is equivalent to for any λ > 0, which says that x is a fixed-point of the mapping T λ. Alternatively, the optimality condition 44) is equivalent to x prox λg x λ fx )) := T λ x ), 47) x prox λg x λ fx )) := S λ x ), 48) for any λ > 0, which says that x is a fixed-point of the mapping S λ. While T λ is a set-valued operator, S λ is a single-valued operator. Proof. Indeed, we prove 48) 47) is done similarly). From 44), we can write 0 fx ) + gx ) x λ fx ) x + λ gx ) I + λg)x ), where I is the identity mapping, i.e., Ix) = x. Using the basis property b) of prox λg from Lemma., we have Since prox λg and f are single-valued, we obtain 48)..3 Proximal-gradient methods x prox λg x λ fx )). Now, we present the first algorithm for solving 36) called proximal-gradient method [6, 34]. is sometimes called ISTA [6] as referred to the Iterative Shrinkage-Thresholding Algorithm. monotone inclusions [4], this algorithm is known as the forward-backward splitting FBS) method. This algorithm relies on the following assumptions, which has been stated in Assumption : Assumption 3. This algorithm In the context of a) f F, L Rp ) and g F prox R p ) i.e., f is Lipschitz gradient and g has a tractable proximity operator). b) Oracle: Proximal-gradient algorithms typically use F ), f ) and prox λg ) at each iteration..3. Derivation of proximal-gradient method Since f F, L Rp ), for any x R p, we can approximate it by the following quadratic model: Q L y, x) := fx) + fx) y x) + L y x, y R p. 49) One can prove the following lower and upper bounds see Lecture ) or [30]: fx) + fx) y x) fy) by the convexity of f fy) fx) + fx) y x) + L f y x by the Lipschitz continuity of f, 50) for all x, y R p. Now, combing 49) and g, for given a point x k R p and L > 0, we can define a quadratic-convex model of F at x k as follows: P L x, x k ) := Q L x, x k ) + gx) fx k ) + fx k ) x x k ) + L x xk + gx). Since P L, x k ) is strongly convex for any L > 0, the following problem is well-defined and has a unique solution: S L x k ) := arg min x domf )) P Lx, x k ) prox /L)g x k /L) fx k ) ). 5) 3/78

24 Definition Proximal-gradient mapping). The proximal-gradient mapping of F is defined as: G L x k ) := Lx k S L x k )). 5) In particular, if g 0, then G L x k ) fx k ). Figure 9 illustrates the lower and upper bounds of F. The F x) P L x, x k ):=fx k )+rfx k ) T x x k )+ L kx xk k + gx) F x) =fx) + gx) x k S L x k ) fx k )+rfx k ) T x x k ) + gx) x k x k+ x? x Figure 9: An illustration of the lower and upper bounds for F following lemma shows an optimality condition of 36), which will be used to terminate the algorithm. Lemma.3. If G L x ) = 0, then x is an optimal solution of 36). Proof. If G L x ) = 0 then we have Lx S L x )) = 0, which leads to x = S L x ) due to L > 0. Now, using the definition ofthursday, S L, we June, have 4 x = prox λg x λ fx )) for λ := /L. By using 48), we can see that x is an optimal solution to 36)..3. The algorithm Now, using the fixed-point formulation 48), we can eventually design Algorithm 4 for solving 36). Algorithm 4 Basic proximal-gradient scheme ISTA)) : Inputs: Choose an arbitrarily initial point x 0 dom F ) and a desired accuracy ε > 0 : Output: An ε-solution x k of 36). 3: For k = 0,, k max, perform: 4: Compute the gradient fx k ). If x k+ x k ε max, x k, then Terminate. 5: Update 6: End for x k+ := prox αg x k α fx k ) ), where α := L f or α is determined by line-search. Per-iteration complexity: Each iteration of Algorithm 4 requires One gradient fx k ) of f and one proximal operator prox /L)g of g. the evaluation of L. If L f is available, then we set L := L f. Otherwise, if we use a line-search procedure as in Algorithm, then we need to evaluate F at each line-search iteration. Based on Lemma.3, we can terminate this algorithm using a condition as at Step 4. 4/78

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING YANGYANG XU Abstract. Motivated by big data applications, first-order methods have been extremely

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)

More information

Lecture 1: September 25

Lecture 1: September 25 0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

10. Unconstrained minimization

10. Unconstrained minimization Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation

More information

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum

More information

The Proximal Gradient Method

The Proximal Gradient Method Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product,

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent 10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for

More information

Adaptive Restarting for First Order Optimization Methods

Adaptive Restarting for First Order Optimization Methods Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

arxiv: v7 [math.oc] 22 Feb 2018

arxiv: v7 [math.oc] 22 Feb 2018 A SMOOTH PRIMAL-DUAL OPTIMIZATION FRAMEWORK FOR NONSMOOTH COMPOSITE CONVEX MINIMIZATION QUOC TRAN-DINH, OLIVIER FERCOQ, AND VOLKAN CEVHER arxiv:1507.06243v7 [math.oc] 22 Feb 2018 Abstract. We propose a

More information

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1, Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,

More information

Introduction to Alternating Direction Method of Multipliers

Introduction to Alternating Direction Method of Multipliers Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

Unconstrained minimization of smooth functions

Unconstrained minimization of smooth functions Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

arxiv: v1 [math.oc] 13 Dec 2018

arxiv: v1 [math.oc] 13 Dec 2018 A NEW HOMOTOPY PROXIMAL VARIABLE-METRIC FRAMEWORK FOR COMPOSITE CONVEX MINIMIZATION QUOC TRAN-DINH, LIANG LING, AND KIM-CHUAN TOH arxiv:8205243v [mathoc] 3 Dec 208 Abstract This paper suggests two novel

More information

Worst Case Complexity of Direct Search

Worst Case Complexity of Direct Search Worst Case Complexity of Direct Search L. N. Vicente May 3, 200 Abstract In this paper we prove that direct search of directional type shares the worst case complexity bound of steepest descent when sufficient

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

5. Subgradient method

5. Subgradient method L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Math. Program., Ser. B 2013) 140:125 161 DOI 10.1007/s10107-012-0629-5 FULL LENGTH PAPER Gradient methods for minimizing composite functions Yu. Nesterov Received: 10 June 2010 / Accepted: 29 December

More information

About Split Proximal Algorithms for the Q-Lasso

About Split Proximal Algorithms for the Q-Lasso Thai Journal of Mathematics Volume 5 (207) Number : 7 http://thaijmath.in.cmu.ac.th ISSN 686-0209 About Split Proximal Algorithms for the Q-Lasso Abdellatif Moudafi Aix Marseille Université, CNRS-L.S.I.S

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Convex Optimization and l 1 -minimization

Convex Optimization and l 1 -minimization Convex Optimization and l 1 -minimization Sangwoon Yun Computational Sciences Korea Institute for Advanced Study December 11, 2009 2009 NIMS Thematic Winter School Outline I. Convex Optimization II. l

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

ORIE 6326: Convex Optimization. Quasi-Newton Methods

ORIE 6326: Convex Optimization. Quasi-Newton Methods ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted

More information

FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING

FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING KATYA SCHEINBERG, DONALD GOLDFARB, AND XI BAI Abstract. We propose new versions of accelerated first order methods for convex

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Lecture 17: October 27

Lecture 17: October 27 0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

Douglas-Rachford splitting for nonconvex feasibility problems

Douglas-Rachford splitting for nonconvex feasibility problems Douglas-Rachford splitting for nonconvex feasibility problems Guoyin Li Ting Kei Pong Jan 3, 015 Abstract We adapt the Douglas-Rachford DR) splitting method to solve nonconvex feasibility problems by studying

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Generalized Conditional Gradient and Its Applications

Generalized Conditional Gradient and Its Applications Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized

More information

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 XVI - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 A slightly changed ADMM for convex optimization with three separable operators Bingsheng He Department of

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information

You should be able to...

You should be able to... Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6) EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement to the material discussed in

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

Mathematics of Data: From Theory to Computation

Mathematics of Data: From Theory to Computation Mathematics of Data: From Theory to Computation Prof. Volkan Cevher volkan.cevher@epfl.ch Lecture 8: Composite convex minimization I Laboratory for Information and Inference Systems (LIONS) École Polytechnique

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE CONVEX ANALYSIS AND DUALITY Basic concepts of convex analysis Basic concepts of convex optimization Geometric duality framework - MC/MC Constrained optimization

More information

Unconstrained minimization

Unconstrained minimization CSCI5254: Convex Optimization & Its Applications Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions 1 Unconstrained

More information

Cubic regularization of Newton s method for convex problems with constraints

Cubic regularization of Newton s method for convex problems with constraints CORE DISCUSSION PAPER 006/39 Cubic regularization of Newton s method for convex problems with constraints Yu. Nesterov March 31, 006 Abstract In this paper we derive efficiency estimates of the regularized

More information

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Zhaosong Lu October 5, 2012 (Revised: June 3, 2013; September 17, 2013) Abstract In this paper we study

More information

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES Fenghui Wang Department of Mathematics, Luoyang Normal University, Luoyang 470, P.R. China E-mail: wfenghui@63.com ABSTRACT.

More information

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1

More information

SCRIBERS: SOROOSH SHAFIEEZADEH-ABADEH, MICHAËL DEFFERRARD

SCRIBERS: SOROOSH SHAFIEEZADEH-ABADEH, MICHAËL DEFFERRARD EE-731: ADVANCED TOPICS IN DATA SCIENCES LABORATORY FOR INFORMATION AND INFERENCE SYSTEMS SPRING 2016 INSTRUCTOR: VOLKAN CEVHER SCRIBERS: SOROOSH SHAFIEEZADEH-ABADEH, MICHAËL DEFFERRARD STRUCTURED SPARSITY

More information

Complexity of gradient descent for multiobjective optimization

Complexity of gradient descent for multiobjective optimization Complexity of gradient descent for multiobjective optimization J. Fliege A. I. F. Vaz L. N. Vicente July 18, 2018 Abstract A number of first-order methods have been proposed for smooth multiobjective optimization

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

Gradient Methods Using Momentum and Memory

Gradient Methods Using Momentum and Memory Chapter 3 Gradient Methods Using Momentum and Memory The steepest descent method described in Chapter always steps in the negative gradient direction, which is orthogonal to the boundary of the level set

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Lecture 6 : Projected Gradient Descent

Lecture 6 : Projected Gradient Descent Lecture 6 : Projected Gradient Descent EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Consider the following update. x l+1 = Π C (x l α f(x l )) Theorem Say f : R d R is (m, M)-strongly

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow

More information

Optimization Algorithms for Compressed Sensing

Optimization Algorithms for Compressed Sensing Optimization Algorithms for Compressed Sensing Stephen Wright University of Wisconsin-Madison SIAM Gator Student Conference, Gainesville, March 2009 Stephen Wright (UW-Madison) Optimization and Compressed

More information

Computational Finance

Computational Finance Department of Mathematics at University of California, San Diego Computational Finance Optimization Techniques [Lecture 2] Michael Holst January 9, 2017 Contents 1 Optimization Techniques 3 1.1 Examples

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS

SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS XIANTAO XIAO, YONGFENG LI, ZAIWEN WEN, AND LIWEI ZHANG Abstract. The goal of this paper is to study approaches to bridge the gap between

More information