Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall PDF Free Download

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected first-order methods for large-scale convex optimization Index Terms: Standard gradient methods; accelerated gradient methods; proximal gradient methods; FISTA and enhancements; smoothing techniques; conditional gradient and Frank-Wolfe methods; coordinate descent methods; stochastic gradient descent methods; mirror descent method. Copyright: This lecture is released under a Creative Commons License and Full Text of the License. Gradient-type methods for unconstrained convex problems We start describing some numerical methods for a simple unconstrained convex minimization problem, which will be later generalized to composite problems and constrained settings.. Brief introduction We start this lecture by studying a classical gradient descent method to solve the following unconstrained convex optimization problem: where f : R p R + is a proper, closed and convex function. f := min fx), ) x Rp Assume that problem ) has solution. Then f is finite, and the optimal solution set of ) can be written as X := x dom f) : fx ) = f. ) Under assumption that f Γ 0 R p ), we can show that X is closed and convex. Since f is convex, any point x X is a global solution of ), i.e., fx) fx ) for all x dom f)... Representative examples Let us consider some common examples to motivate this topic. Example. The first and simplest example is the least squares problem of the form f := min fx) := x R p Ax b, 3) where A R n p and b R n. This problem has many applications, which we will discuss in the next sections. Example. The second example is the following empirical risk convex minimization problem: f := min fx) := n f i x), 4) x R p n i=

where f i : R p R + are n proper, closed, and convex functions. The function f i is often induced by a loss function that depends on given datasets. For instance, f i x) = log + e yia i x+µi )) in logistic regression, where a i R p, µ i R and y i, are given data. Several loss functions have been using in statistics, machine learning, and data analysis. More details of applications can be found in Lecture. Example 3. Another example is the following structural convex optimization problem: f := min fx) := max x, Au ϕu), 5) x R p u U where ϕ Γ 0 R n ), U is a nonempty, closed, and convex set in R n and A R p n. This problem is also called a min-max problem or a saddle point problem. As a special example, we can consider the Chebyshev approximation problem min x R p Ax b, which can be written into 5)... Optimality condition In order to solve problem ), we must rely on some intermediate characterization of f to develop algorithms. The most important characterization is the first-order condition also called Fermat s rule). Lemma. Fermat s rule). Let f Γ 0 R p ) in ). Then, the following condition is necessary and sufficient for x dom f) to be an optimal solution of ): 0 fx ). 6) Proof. We note that x dom f) is an optimal solution of ) iff fx) fx ) = fx ) + 0, x x, which exactly shows that 0 fx ) by definition of subdifferential. If f is smooth, then condition 6) reduces to the following nonlinear equation: fx ) = 0. 7) Unfortunately, in practice, we cannot find an exact solution x of ), but an approximation in the following sense: Definition. Given an accuracy ε > 0, we say that x is an ε-solution of ) if f x) f ε. Here, we note that f x) f, which already gives us a lower bound. In the sequel, we aim at finding an ε-solution x of ). In some methods, we can be able to find x such that x x ε in a given norm. When f is differentiable, we can also find x such that f x) ε.. Gradient methods We studied different gradient methods for solving the unconstrained convex minimization problem ). Apart from convexity, we also require the following Lipschitz gradient continuity assumption: Assumption. The objective function f of ) is smooth and convex, and its gradient f is Lipschitz continuous on dom f) with the Lipschitz constant L f [0, ), i.e.: fx) fy) L f x y, x, y dom f). Assumption restricts the applicability of ), but still cover many important problems. For instance, fx) = Ax b and logistic loss satisfy this assumption. There are several ways of deriving gradient methods. We show to approaches below. /78

.. Deriving gradient method from the fixed-point principle From the optimality condition 7), we have fx ) = 0. We can rewrite this equation as x = x L fx ) Grad L x ), for some L 0, + ). This formulation shows that x is a fixed-point of the mapping Grad L := I L f. Hence, we can apply the Picard iterative scheme to approximate this fixed-point as x k+ := Grad L x k ), which is x k+ := x k L fxk ), 8) where α := L > 0 is a given step-size, which can be fixed or can be adaptively updated as α k := L k > 0. Question: The question is how to choose the step-size α k such that the sequence x k generated by scheme 8) converges to x? Key to this is Assumption. Under this assumption, we can show that with a proper choice of α, the sequence x k generated by the gradient scheme 8) converges to a solution of ). More practically, we can show that x k approximates x in the sense of Definition... Deriving gradient method from quadratic surrogate of f Another view of the gradient method is as follows. Let x k be the current iteration. We consider a quadratic model where L > 0 is a given constant. Q L x; x k ) := fx k ) + fx k ), x x k + L x x k, 9) Clearly, Q L x k ; x k ) = fx k ), and fx) Q Lf x, x k ) for any x R n see Lecture 3) Our objective is to build a surrogate Q L of f such that Q L is easy to minimize. By minimizing Q L, x k ) over x, we obtain the solution x k+ such that Q L x k+ ; x k ) = min x R p Q Lx; x k ) where x k+ := arg min x R p Q L x; x k ) = x k L fxk ). 0) Clearly, 0) is exactly the same as 8). The main idea is illustrated in Figure. Q L x; x k ) `x; x k ) x k+ x Figure : An illustration of gradient method for solving ) The goal is to find L > such that fx k+ ) Q L x k+, x k ). Under Assumption, an obvious choice of L is L L f. Clearly, the step-size α in 8) relates to L as α = L. If L L f, then we have α 0, L f ]. But, in practice, we do not need to choose L L f as long as fx k+ ) Q L x k+, x k ) holds...3 Descent lemma We prove the following key lemma, which will be used to analyze the convergence of the gradient scheme 8) and the accelerated gradient methods below. 3/78

Lemma.. Let f S, L f,µ f R p ) i.e. f satisfies Assumption and µ f -strongly convex). Then, x k+ generated by 8) satisfies the following estimate: fx k+ ) fx) + fx k ), x k x L ) f fx k ) µ f L L xk x, x dom f). ) Proof. By the µ f -strongly convexity of f, we have fx) fx k ) + fx k ), x x k + µ f x xk, which implies fx k ) fx) fx k ), x x k µ f x xk. On the other hand, by the Lipschitz gradient continuity of f, we have fx k+ ) fx k ) + fx k ), x k+ x k + L f xk+ x k. Summing up the last two inequalities and using x k+ x k = L fxk ), we have which proves ). fx k+ ) fx) + fx k ), x k+ x + L f xk+ x k µ f xk x = fx) + fx k ), x k x L ) f fx k ) µ f L L xk x,..4 The algorithm and its convergence Now, we can specify the gradient descent method 8) formally as in Algorithm below. Here, we incorporate it with a relative stopping criterion and a line-search procedure describing below to make it practical. If we skip the line-search step and simply choose L = L f, then we obtain a vanilla gradient method. Algorithm Gradient descent algorithm GDA)) : Inputs: Choose an arbitrarily initial point x 0 R p and a desired accuracy ε > 0. : Output: An ε-solution x k of ). 3: For k = 0,, k max, perform: 4: Compute the gradient fx k fx ). If ) max, fx 0 ) ε, then Terminate. 5: Perform a line-search procedure to find L such that 0.5L f < L L f see Subsection.6). 6: Update x k+ := x k L fxk ). 7: End for The main computation of Algorithm i.e., per-iteration complexity) is: the evaluation of f, and the evaluation of L: If we know the Lipschitz constant L f, then we take L = L f as the optimal value in the worst-case context). If we use a linesearch procedure, then, at each iteration, we need to evaluate approximately function values fx) on average by using a bisection procedure i.e., L L). From Theorem.3, we can also see that fx k ) f ε if k + 4 L f x 0 x ε. Let R 0 := x 0 x. ) Then, the. worst-case complexity of Algorithm to achieve an ε-solution x k such that fx k ) f Lf R ε is O 0 ε Using Lemma., we can prove the convergence of Algorithm as follows. Theorem.3. Assume that f in ) satisfies Assumption and x is an optimal solution of ). Let x k be the sequence generated by scheme 8) or by Algorithm ), with L > L f. Then the following bound holds fx k ) f L fx 0 ) f ) x 0 x L x 0 x + kl L f ) fx 0 ) f, k. ) ) The right-hand side is optimal if we choose L := L f, and ) becomes fx k ) f L f x 0 x k+4. At Step 3, we use max, fx 0 ) instead of fx 0 ) to prevent fx 0 ) = 0 or small. 4/78

Proof. There are many ways to prove this theorem. We choose the argument from [30]. For simplicity, means. Let us start by a simple derivation using 8) as x k+ x = x k L fxk ) x = x k x L fxk ), x k x + L fx k ). Using fx) fy), x y L f fx) fy) from Lecture with x := x k, y := x and fx ) = 0, we have fx k ), x k x L f fx k ). Substituting this estimate into the last inequality, we get x k+ x x k x L ) fx k ) L f L. Clearly, if L > L f, then x k+ x < x k x. The sequence x k x is monotonically decreasing. Let us define r k := x k x and k := fx k ) f. Using the convexity of f, we have k = f fx k ) fx k ), x x fx k ) x k x fx k ) r 0, by the monotonicity of r k. Hence, fx k ) k r 0. Next, using ) with x := x k, we get fx k+ ) fx k ) L L ) f L fx k ). Substituting fx k ) k r 0 into this inequality, we obtain ) where ω := L L f L addition, we have This inequality implies k+ 0 k+) ω r 0 k+ k ω r 0 k, > 0. The last inequality also implies k+ k, which is equivalent to k k+ k+ k + ω r0 ω k k k+ r0 ω k+ k+ r0.. In k + ω. By induction, we have r0 k+ 0 + k + ) ω. Hence, r0 0+. Using the definition of ω, and k, we finally obtain fx k ) f L fx 0 ) f ) x 0 x L x 0 x + kl L f ) fx 0 ) f ). If we choose L := L f, then this estimate becomes fx k ) f L f x 0 x k+4 by using the fact that fx 0 ) f + L f x0 x from the Lipschitz continuity of f and fx ) = 0. Theorem.3 also shows that the objective residual sequence fx k ) f converges to zero at a sublinear rate O ) fx k. However, from the stopping criterion of Algorithm, we have k ) fx 0 ) ε. This also shows that xk is an approximate solution of )...5 Strongly convex case If f in ) is strongly convex, then how do we exploit it in Algorithm? The following theorem guides us how to choose step-size in Algorithm and shows what convergence we can achieve. Theorem.4. Under the assumptions of Theorem.3, let f be µ f -strongly convex with µ f > 0, and x k be generated by Algorithm by choosing L such that L L f + µ f. Then, we have x k x ω k x 0 x = µ ) k f L f x 0 x Lµ f + L f ). 5/78

If we choose L := L f +µ f, and set κ := L f µ f the condition number of f, then x k x ) k κ x 0 x, and fx k ) f L κ + ) k κ x 0 x κ +. Hence, the worst-case iteration-complexity to achieve an ε-solution of ) is O κ log ε )). Proof. Using the estimate fx) fy), x y µ f L f µ f +L f x y + µ f +L f fx) fy) from Lecture for strongly convex and Lipschitz gradient function with x := x k, y := x and fx ) = 0, we can derive x k+ x = x k x L fxk ) = x k x L fxk ), x k x + L fx k ) x k x µ f L f Lµ f + L f ) xk x Lµ f + L f ) fxk ) + L fx k ) = µ ) f L f x k x Lµ f + L f ) L µ f + L f ) ) fx k ) L. ) ) If L L f + µ f, then x k+ x µ f L f Lµ f +L f ) x k x. Moreover, ω := µ f L f Lµ f +L f ) 0, ). By induction, we can show that x k x ω k x 0 x = µ ) k f L f x 0 x Lµ f + L f ). This proves the first statement of Theorem.4. If we choose L := L f +µ f, and set κ := L f µ f, the condition number of f, then x k x ) k κ x 0 x, and fx k ) f L κ + ) k κ x 0 x κ +. Here, the last inequality follows from the first one and the fact that fx k ) f L f xk x by the Lipschitz gradient continuity of f. We prove the second part of Theorem.4. For the last statement, we note that k κ κ+ = L f µ f L f +µ f = µ f L f +µ f = κ+. Hence, we have fxk ) f L f xk x L f κ+) R 0 ε if k log ε L f R 0 log ) κ+) κ log ε ). Here, we use log κ+ ) κ+ O κ). Remark. In the strongly convex case, the optimal step-size is α := L = L f +µ f, depending on both µ f and L f. While L f can be evaluated by a power method relatively cheap, see Appendix A.), but evaluating µ f is in general expensive. Hence, if we do not know µ f, it is still a challenge to compute this optimal step-size..3 Lower bound complextity of first-order methods The main information using in Algorithm is the values of gradient f and function f evaluated at each iterations. Any algorithm that uses only this information is called a first-order gradient method. Mathematically, any iterative first-order method for solving ) generates a sequence x k based on the following scheme: where span is the linear span of a collection of vectors in R p. x k+ x 0 + span fx 0 ), fx ),, fx k ), M FO ) Informally, a first-order method M FO is optimal for the class of problems ) that satisfies Assumption if there exists no any other first-order gradient method that can solve any given problem in this class while achieving a better complexity than M FO up to a constant factor). In order to show a lower-bound on convergence rate of first-order gradient methods, we can create a specific problem in this class and provide a target lower-bound 6/78

complexity. Then, we show that there exists no first-order gradient algorithm that beats this lower bound. We process as follows. Nesterov showed in [30] that there exists an instance of f satisfying Assumption so that the first-order scheme M FO ) exhibits a lower bound O k ) rate. More formally, we state this result into the following theorem. Theorem.5. For any integer 0 k p and any x 0 R p, there exists a convex function f for problem ) satisfying Assumption such that fx k ) f 3L f x 0 x 3k + ), 3) for any sequence x k generated by M FO ) starting from x 0. Proof. Nesterov constructed the following function fx) := L f 4 x + ) ) p i= x i x i+ ) + x p x. This is a quadratic function, which can be written as fx) = L f 4 x Qx e x ), where 0 0 0 0 0 Q :=............., and e.. =.. 0 0 Clearly, this function is convex, and fx) = L f 4 Qx e ), which is Lipschitz continuous with the Lipschitz constant L, i.e., 0 fx) Q L f I. From the optimality condition of ), we have Qx = e, which leads to the unique optimal solution x = p+, p+,, p p+ ), and the optimal value f = fx ) = L f 4 x ) Qx e x ) = L f 8 e x = L f p 8p+). Moreover, we can show that x = Without loss of generality, we assume that x 0 = 0. p + ) + + + p ) p + ). 3 0 Otherwise, we can shift the function with a linear term to get x 0 = 0. With x 0 = 0, we have fx 0 ) = L f 4 e. Since we apply the scheme M FO ) to solve ), we have x span e. By the tri-diagonal form of Q, we have fx ) span e, e, which shows that x span e, e. By induction, we have x k span e, e,, e k, where e i is the i-th unit vector of R p. Therefore, we have fx k ) By our assumption, for p k +, we have fx k ) f L f 8 inf fx) = L f k x k+ = x p=0 8 k +. k k + + L f 8 p p + Finally, using this inequality and x 0 x = x above, we have which is exactly 3). fx k ) f x 0 x L f 6k + ). L f 3 6k + ) k + ) = 3L f 3k + ), As we have seen from Theorem.3 that the convergence rate of the gradient method 8) is O ) k. This method is an instance of the first-order scheme M FO ). But Theorem.5 shows that the lower bound rate 7/78

is O ) k. Therefore, we can conclude that the standard gradient method is suboptimal, and this motivates researchers to develop a new method that matches the lower bound. Note that in the worst-case complexity sense, the lower bound iteration-complexity corresponding to Theorem.5 is O ε ). Similarly, the worst-case iteration-complexity of the gradient method 8) is O ε )..4 The accelerated gradient algorithm: Nesterov s first optimal method The accelerated gradient method was introduced by Yurii Nesterov since 983 in [9] this paper was originally in Russian, but it was translated to English). This method is also called an optimal ) gradient method or Nesterov s optimal scheme) since it matches the lower bound iteration complexity O ε presented in Theorem.5. There exist many variants of Nesterov s optimal gradient methods. In this subsection, we will present some variants and recently improvements..4. Standard accelerated gradient scheme To make our presentation simple, we first directly present the standard accelerated gradient scheme. Then, we analyze its convergence guarantee. Later on, we will explain why this method actually accelerates the standard gradient method and provide different views of this method. Mathematically, the standard accelerated gradient scheme for solving ) simply performs three steps: x k+ := y k L f fy k ) t k+ := 0.5 + + 4t k ) y k+ := x k+ + t k t k+ x k+ x k ), where x 0 dom f) is a given starting point, y 0 := x 0, and t 0 :=. AGDA) There are different ways of presenting AGDA. For example, if we define τ k := t k and z k := τ k y k τ k )x k), we can write AGDA) into four steps as y k := τ k )x k + τ k z k where τ 0 := and z 0 := x 0. x k+ := y k L f fy k ) z k+ := z k τ k y k x k+ ) = z k τ k L f fy k ) τ k+ := τ k τ k + 4 τ k ), We note that, from AGDA), if we define β k := t k t k, then we can rewrite AGDA) in one line as x k+ := x k L f x k + β k x k x k ) ) + β k x k x k ) 5) The last expression shows that the accelerated gradient has a similar form as a so-called Heavy-ball method [39]: x k+ = x k L fxk ) + γ k x k x k ), which requires to add a momentum term γ k x k x k ) to the gradient step. However, the gradient is estimated at the middle point x k + β k x k x k ) instead of at x k as in the Heavy-ball method. In addition, the convergence rate of the Heavy-ball method is only O k )..4. The algorithm We can specify the standard accelerated gradient method algorithmically as in Algorithm below. Per-iteration complexity: the evaluation of one gradient fy k ), and Each iteration of this algorithm requires: the evaluation of L: If we choose L := L f, then we just need to evaluate L f once at the beginning of the algorithm, e.g., by power method Appendix A.). If we use a line-search procedure, then additional computation is required, see Subsection.6 for more details. 4) 8/78

Algorithm Accelerated gradient descent algorithm) : Inputs: Choose an initial point x 0 R p and an accuracy ε > 0. Set y 0 := x 0 and t 0 :=. : Output: An ε-solution x k of ). 3: For k = 0,, k max, perform: 4: Compute the gradient fy k x ). If k+ x k max, x k ε, then Terminate. 5: Update t k+ := 0.5 + + 4t k ) and x k+ := y k L f fy k ) y k+ := x k+ + t k t k+ x k+ x k ). 6: End for the update of y k : just one subtraction and one addition of two vectors, and one constant-vector multiplication. The computational complexity of this step is O p). Clearly, the per-iteration complexity of Algorithm is nearly the same as in Algorithm except for the last step. However, as we will see in the next subsection, Algorithm achieves a much better convergence rate..4.3 Convergence and complexity analysis The following theorem shows the convergence of Algorithm, whose proof relies on Lemma.. Theorem.6. Assume that f in ) satisfies Assumption and x is an optimal solution of ). Let x k be the sequence generated by AGDA). Then, we have the following bound: fx k ) fx) L f k + ) x0 x, x R p, k 0. 6) Hence, the worst-case iteration-complexity to achieve an ε-solution x k of ) such that fx k ) f ε is Lf x O 0 x ε ). This method is optimal, which matches the lower bound complexity in Theorem.5. Proof. Using ) from Lemma. with x k := y k, µ f = 0, and L := L f, we get Using this estimate with x = x k, we get fx k+ ) fx) + fy k ), y k x L f fy k ), x R p. 7) fx k+ ) fx k ) + fy k ), y k x k L f fy k ). 8) Multiplying 7) by τ k, and 8) by τ k ) and summing up the results, we get fx k+ ) τ k )fx k ) + τ k fx) + fy k ), y k τ k )x k τ k x L f fy k ). Using τ k z k = y k τ k )x k, we derive from this inequality that fx k+ ) fx) τ k ) [ fx k ) fx) ] + τ k fy k ), z k x L f fy k ) = τ k ) [ fx k ) fx) ] [ ] + τ k L f z k x z k τ k L f fy k ) x = τ k ) [ fx k ) fx) ] + τ k L [ f z k x z k+ x ]. Note that τ k = t k, it is easy to show that t k+ := 0.5 + + 4t k ) is equivalent to τ k+ τ k+ 9) and using the fact that τ k τ k =, we get τk [ τ fx k+ ) fx) ] + L f k zk+ x τ [ k) τ fx k ) fx) ] + L f k zk x [ fx k ) fx) ] + L f zk x. = τ k 9) =. Rearranging τk 9/78

By induction and using τ 0 =, we get [ τ fx k+ ) fx) ] + L f k zk+ x [ τ0 τ fx 0 ) fx) ] + L f 0 z0 x = L f z0 x. This inequality implies fx k ) fx) τ k L f z 0 x, or equivalent to since z 0 = x 0 ): fx k ) fx) τ k L f x 0 x = L f t k x 0 x. 0) Now, we note that t k+ = 0.5 + + 4t k ) 0.5 + t k 0.5k + ) + t 0 = k+ + = k+3. Hence, t k k+. Using this into 0) we obtain the bound 6). The worst-case complexity is a consequence of 6) when we substitute x by x. The optimal rate is due to Theorem.5. To compare Algorithm and Algorithm, we assume that we can estimate a tight upper-bound for x 0 x by R 0 =. We also assume that L f = 00. Let us take a tolerance ε = 0 3, and wish to achieve an approximate solution x k such that fx k ) f ε. If we apply Algorithm to solve ), then we need at most k := L f R 0 ε If we apply Algorithm to solve ), then we only need at most k := = 00, 000 iterations. Lf R 0 ε = 0 000 37 iterations. Clearly, Algorithm outperforms Algorithm in terms of iterations, while having almost the same per-iteration complexity..4.4 Strongly convex case We consider an accelerated gradient method to solve ) for a L f -smooth and µ f -strongly convex function f S, L f,µ f R p ). In this case, we can present AGDA) in a simple form as where y 0 := x 0. x k+ := y k L f fy k ) y k+ := x k+ Lf µ f + Lf + x k+ x k), µ f We do not provide a formal proof of this scheme, but let us explain briefly the idea as follows. Using ) with x k := y k and L := L f, we have fx k+ ) fx) + fy k ), y k x L f fy k ) µ f yk x. In this case, we can choose a step size α := L f at the first line and update y k+ := x k+ Lf µ f + Lf + x k+ x k) µ f at the second line in Step 4 of Algorithm to achieve the following convergence rate: fx k ) f L f + µ ) k µ x 0 x L. f Lf This shows that the scheme ) has O µ f log ) ) ε -worst case iteration-complexity. However, the step-size of this algorithm requires knowing the strong convexity parameter µ f, which is often very hard to evaluate in practice. Compare this complexity and Theorem.4, we can see that ) achieves a factor of κ, while, in Theorem.4, we only achieve a factor of κ, where κ := L f µ f is the condition number of f. As proved in [8, 30], ) is also optimal in the sense of first-order gradient methods for strongly convex f. We skip the details. ) 0/78

.5 Nesterov s second optimal method estimate sequence approach The accelerated scheme AGDA) achieves the optimal rate, but it cannot be extended to handle non-euclidean norms. We present another fast gradient scheme to solve the following constrained convex optimization problem, which covers ) as a special case when X = R p, and can work with non-euclidean norms under certain conditions. f := min fx), ) x X where f is an F, L f R p ) and X is a simple, nonempty, closed and convex subset in R p. The objective function f in ) still satisfies Assumption. Here, we mean that X is simple if there is a proximity function d X see Lecture ) such that can be efficiently solved i.e. min v, x + d X x) x X with a closed form solution or with a low-order polynomial time algorithm). Without loss of generality, we assume that d X is µ d -strongly convex with µ d =..5. The construction of Nesterov s optimal scheme Given a point y k dom f), let l k x) := fy k ) + fy k ), x y k, and a k > 0. We define ϕ k x) := k a i l i x) = i=0 k [ a i fy i ) + fy i ), y y i ]. 3) i=0 Let A k be a sequence defined as A 0 := a 0 > 0, A k+ := A k + a k+. Hence, A k := k i=0 a i. We consider the following problem: ψ k := min ϕ k x) + L f d X x), 4) x X where L f is the Lipschitz constant of f and d is the proximity function of X. Clearly, by convexity of f, we have l k x) fx) and hence ϕ k x) = k a i l i x) i=0 k a i fx) A k fx), x dom f). 5) i=0 Our goal is to construct a sequence x k in X such that Combining this condition and 5), we obtain A k fx k ) ψ k := min x X ϕ k x) + L f d X x). R k ) fx k ) fx) L f d X x) A k, x X = fx k ) f L f d X x ) A k. 6) Let us define v k to be the solution of 4), i.e.: v k := arg min x X ϕ k x) + L f d X x). 7) By the -strong convexity of d X x), we have ψ k + L f x vk ϕ k x)+l f d X x). We construct x k recursively as follows. First, we compute the parameters a k+ and A k+ := A k + a k+ such that A k+ a k+. 8) Then, let τ k := a k+ A k+, we update y k+ := τ k )x k + τ k v k. 9) /78

Finally, we define x k+ = T X y k+ ) := arg min x X fy k+ ) + fy k+ ), x y k+ + L f x yk+. 30) The following lemma shows that y k, a k, A k ) satisfies the condition R k ) above. Lemma.7. Let y k, a k, A k ) be generated by 8), 9), and 30), respectively. Then, R k ) holds for k 0. Proof. We prove this lemma by induction. For k = 0, we have ϕ 0 x) = fy) 0 + fy 0 ), x y 0. By 7), we have ψ 0 = min fy 0 ) + fy 0 ), x y 0 + L f d X x) fy) 0 + fy 0 ), v 0 y 0 + L f x T X y 0 ) y 0 ft X y 0 )) = a 0 fx 0 ) = A 0 fx 0 ). for any a 0 0, ] See Lecture 3, more precisely, this holds since fy 0 ) + fy 0 ), x y 0 + L f d X x) is strongly convex with µ = ). Assume that R k ) holds for k 0. We start from the definition of ϕ k in 3) to derive ϕ k+ x) := k+ i=0 a il i x) = ϕ k x) + a k+ [ fy k+ ) + fy k+ ), x y k+ ]. Hence, we have ϕ k+ x) + L f d X x) = ϕ k x) + L f d X x) + a k+ [ fy k+ ) + fy k+ ), x y k+ ] ψ k + L f x vk [ + a k+ fy k+ ) + fy k+ ), x y k+ ] R k ) A k fx k [ ) + a k+ fy k+ ) + fy k+ ), x y k+ ] + L f x vk a) A k fy k+ ) + a k+ fy k+ ) + A k fy k+, x k y k+ + a k+ fy k+ ), x y k+ + L f x vk b) = A k+ fy k+ ) + A k+ fy k+ ), A k A k+ x k y k+ ) + a k+ A k+ x y k+ ) + L f x vk. 3) Here, we use fx k ) fy k+ ) + fy k+ ), x k y k+ in a) due to the convexity of f, and A k+ = A k + a k+ in b). Next, we define τ k := a k+ A k+. Then τ k 0, ) and A k A k+ = τ k. We have A k A k+ x k y k+ )+ a k+ A k+ x y k+ ) = τ k )x k y k+ ) + τ k x y k+ ) = τ k )x k + τ k x y k+. Hence, we consider x := τ k )x k + τ k x, then x X. Since y k+ A is defined as 9), we have k A k+ x k y k+ ) + a k+ A k+ x y k+ ) = x y k+ = τ k x v k ). Hence, x v k = τ k x y k+ ). Substituting these relations into 3), and using the fact that τk from 8), we can derive ϕ k+ x) + L f d X x) A k+ fy k+ ) + A k+ fy k+ ), x y k+ + L f τk x y k+ [ A k+ fy k+ ) + fy k+ ), x y k+ + L ] f A k+ min x X = A k+ fx k+ ), x yk+ fy k+ ) + fy k+ ), x y k+ + L f x yk+ = A k+ a k+ where x k+ is computed by 30). Now, taking the minimization both sides of the above inequality, we obtain ψ k+ := min x X ϕ k+ x) + L f d X x) A k+ fx k+ ), which shows that x k+ satisfies R k ) with k k +. A k+ /78

.5. The algorithm and its convergence guarantee Putting all the ingredients analyzing above together, we can obtain the following algorithm, Algorithm 3. Algorithm 3 Nesterov s optimal gradient method) : Inputs: Choose y 0 := arg min d X x) and a 0 0, ]. Set A 0 := a 0, and x 0 := T X y 0 ). x X : Output: An ε-solution x k of ). 3: For k = 0,, k max, perform: 4: Find a k+ and A k+ such that A k+ := A k + a k+ and a k+ A k+. 5: Compute v k as v k := arg min x X k [ a i fy i ) + fy i ), x y i ] + L f d X x). i=0 6: Update y k+ := τ k )x k + τ k v k with τ k := a k+ A k+. 7: Update x k+ := T X y k+ ) from 30). 8: End for Next, we analyze the update rule 8). If we choose a k := k+, then A k = k i+) i=0 = k+)k+) 4. Clearly, a k+ = k+) 4 k+)k+3) 4 = A k+ which satisfies 8). Hence, τ k = k+3. Clearly, τ k 0, ) and a 0 = <. We can summarize the update rule of a k and A k at Step 4 and τ k at Step 6 as a k := k +, A k := k + )k + ), and τ k := 4 k + 3. 3) In general, we can choose a k+ by solving a a A k 0, which leads to a k+ + +4A k. Finally, we summarize the convergence result of Algorithm 3 into the following theorem, which is a direct consequence of 6) and 3) without any proof. Theorem.8. Let x k be the sequence generated by Algorithm 3 for solving ). Then, one has fx k ) f 4L f d X x ) k + )k + ). Lf d Consequently, the convergence rate of Algorithm 3 is O X x ) k ). Here, we make a few remarks on Algorithm 3. Algorithm 3 requires only one gradient fy k ) at each iteration. It has a flexibility to choose the proximity function d X that best describes the feasible domain X. We will see later that this is really important in some applications. It also works with non-euclidean norms. Algorithm 3 requires two projections onto X : one at Step 5 to compute v k, and one at Step 7 to update x k+. This can be a disadvantage compared to Algorithm if the projection π X is expensive. Algorithm 3 can be viewed as a combination between primal gradient method and dual averaging scheme see next lectures). The sequence of the objective values fx k ) in Algorithm 3 is nonmonotone i.e., fx k+ ) fx k ) for all k 0). To impose a monotone behavior, we can modify Step 7 for computing x k+ by selecting it among y k+, T X y k+ ), x k such that fx k+ ) = min fˆx k+ ) fˆx k+ ) ft X y k+ )), fy k+ ), fx k ). However, in this case, we need to evaluate the objective values at two additional points per iteration. 3/78

.6 Implementation discussion and examples We now discuss some implementation aspects of the gradient and fast gradient methods presented previously. We focus on three techniques: stopping criteria, line-search, and restarting. These techniques do not change or just slightly change the theoretical guaranty of the algorithms but certainly significantly enhance the performance of the algorithm such as computational time and number of iterations..6. Stopping criteria For the standard gradient method 8), by its optimality condition fx ) = 0, we can use fx k ) ε max, fx 0 ), 33) as a stopping criterion, where ε is a given tolerance, and x 0 is the starting point. However, for the fast gradient method, it is more complicated. Since we have a guarantee of convergence on the sequence x k, while the gradient is computed at y k. Hence, we cannot use a similar condition as in the standard gradient method. Of course, we can occasionally evaluate the gradient fx k ) for every few iterations e.g., every 5 iterations) to check this condition. As indicated in [], the sequence x k converges to a solution of ). Hence, one strategy to terminate this algorithm is as follows. We compute the relative change of x k and check x k+ x k max x k, ε. If this condition holds, we can also compute fx k ) just one to check the stopping criterion 33) above. While gradient-type methods can converge to a solution x from any initial point in dom f), we should note that the complexity bound of both the gradient and fast gradient methods also depends on the distance x 0 x from the initial point x 0 to the solution set. Hence, if we have any priori information about the solution of ), one can exploit it to choose a good initial point x 0 so that we can reduce the number of iterations..6. Line-search Sometimes, evaluating the Lipschitz constant L f is expensive. Even when we know L f, the optimal step-size α := L f remains giving us the worst-case performance, which may not be as good as adaptive step-sizes. Hence, using a line-search strategy may give us a better performance. What is a line-search procedure? To illustrate the idea of line-search, we assume that we want to solve the unconstrained convex problem ). Given a point x dom f), a vector 0 d R n is called a descent direction of f at x if fx) d < 0. Indeed, by assumption f F, L Rn ), using the bound from Lecture, we have fx + td) fx) + t fx) d + t L f d, t > 0. Since fx) d < 0, we can choose t > 0 sufficiently small e.g., 0 < t < fx) d/l f d )) such that fx) d + tl f d < 0. Hence, we have fx + td) < fx), which shows that if we move along the direction d, the objective function f is decreased. The cosine cosd, fx)) between fx) and d measures the relative slope of the direction d with respect to f. It is clear that cosd, fx)) = d fx) fx) d = if d = fx). Therefore, d = fx) is called the steepest descent direction. Now, given a descent direction d of f at x, how can we find the step-size t > 0 such that fx + td) < fx)? There are several ways of doing this. For instance, if we know L f, we can take t < fx) d/l f d ). 4/78

Specially, if d = fx) then t < /L f, which is optimal if t = /L f. In general, we can find t by solving an one-variable minimization problem min t>0 ϕt), where ϕt) := fx + td). This technique is called an exact linesearch procedure. In practice, we often approximate t. The simplest way is using a bisection scheme: starting from t := t 0, at each iteration, we bisect t by t/ and compare the objective values. If fx + td) < fx), then we terminate. Now, we present a simple backtracking linesearch using the well-known Amijo condition with f as follows:. At iteration k, given an estimation L 0 for L f such that L 0 < L f. Set L := L 0 as a starting value.. Iterate, and at each line-search iteration, if the following condition holds fx k /L) fx k )) fx k ) c L fxk ), then terminate, where c 0, ] is a fixed constant. Otherwise, set L := L and repeat this step. 3. Once this loop is terminated, we set α k := /L as a step-size at the iteration k of Algorithm. In the fast gradient method, Algorithm, we can use the same line-search procedure, but we need to guarantee L k+ L k for every iteration k in order to guarantee its convergence, see [6] for more details. Hence, at the iteration k, we set L := L k / as a starting value instead of using L 0. Question: How to estimate an initial value L 0 in this line-search procedure? We note that fx) fy) L f := sup fx ) fx 0 ) x y x y x x 0 for a given x 0, x dom f) with x 0 x fx. Hence, we can estimate L 0 as L 0 := c ) fx 0 ) L x x 0 by taking two arbitrary points x 0, x dom f) such that x 0 x. Here, the constant c L can be set to c L := /4 to damp the L 0 constant to make sure that L 0 < L f. If the function f is L f -Lipschitz gradient f, then the backtracking linesearch procedure is always terminated after finite number of iterations i k. We estimate i k as follows. Every line-search iteration i k, we double L starting from its initial value L 0. Hence, after i k iterations, we have L = i k L 0 L f since it may be overshot by one iteration). Hence, i k log L f /L 0 ). The maximum number of iterations is ) Lf i k := log +. L 0 This number of iterations corresponds to the number of function evaluations fx k +/L)d k ), d k := fx k )). If the cost of evaluating fx k +/L)d k ) is high, then this line-search procedure may incur significantly computational cost. Note that, we can use different rule for increasing L instead of doubling L. For example, we ) choose β > Lf and update L βl. In this case, the maximum number of line-search iterations is i k := log β L 0 +..6.3 Restarting strategies The accelerated gradient method has an oscillation behavior as we can see in Figure when the objective function f is strongly convex or restricted strongly convex). This figure shows the convergence behavior of the objective residual fx k ) f of a quadratic function fx) = x Qx with different estimates q of the inverse of the condition number q = L f µ f. In this test, we use the momentum step-size β = q q+. The oscillated behavior is due to the non-monotonicity of the objective residual sequence fx k ) f. In order to reduce this oscillation, one can inject a restarting procedure. This step is performed in Algorithm 5/78

76 Found Comput Math 05) 5:75 73 Figure : An oscillated behavior of accelerated gradient methods for minimizing fx) = x Qx [37]. Fig. Convergence of Algorithm with different estimates of q whenever the oscillation Introduction starts. There are different conditions to implement restarting strategies. We refer the reader to [0, 37, 45] for some heuristic strategies, and [5] for some theoretical aspects behind this enhancement. Accelerated gradient schemes were first proposed by Yurii Nesterov in 983 [8]. Here, we present a strategy He demonstrated from [0] awhich simple modification uses a condition to gradient ondescent the iterative that couldsequence obtain provably optimal performance for the complexity class of first-order algorithms applied to restart. After Step 4 of Algorithm to minimize, smooth we add convex thefunctions. restarting The method, condition and its successors, are often referred to as fast, accelerated, or optimal methods. In recent years there has been aresurgenceofinterestinfirst-orderoptimizationmethods[, 3, 4, 0, 4], driven primarily if y k by the x k+ need, xto k+ solve very x k large > 0 problem then: instances y k+ := unsuited x k+ to and second-order t k+ :=. methods. Accelerated gradient schemes can be thought of as momentum methods, in that the We can also replace the step condition taken at each yiteration k x k+ depends, x k+ on the x k previous > 0 by iterations, a fixedwhere restarting the momentum condition: if modk, k rs ) = 0 then y k+ := x k+ andgrows t k+ from :=, onewhere iteration k rs to can the next. be chosen When wesuch refer as to restarting k r = 50, the 00, algorithm 00, etc. we mean starting the algorithm again, taking the current iteration as the new starting.7 Applications point. and Thisnumerical erases the memory examples of previous iterations and resets the momentum back to zero. We consider two simple problems Unlike gradient to illustrate descent, accelerated the performance methods areofnot gradient guaranteed methods to be monotone and accelerated variants. The in the objective function value. A common observation when running an accelerated first example is simply method a least-squares is the appearance problem, of ripples while or bumps the in second the trace one of the is objective a logistic value. regression. These are seemingly regular increases in the objective, see Fig. for an example. In this Example 4 Least-squares problem). We consider the following least-square problem: paper we demonstrate that this behavior can occur when the momentum has exceeded acriticalvaluetheoptimalmomentumvaluederivedbynesterovin[9]) and that the period of these ripples f := ismin proportional fx) to := the square-root of the local) condition x R number of the function. Separately, p Ax b, 34) we re-derive the previously known result that the optimal restart interval is also proportional to the square root of the condition number. where A R m p and b R p. If m < p, then we have an underdetermined case. If m > p, then it is an overdetermined case. If m = p and A is full-rank, the optimal value is zero and 34) has a unique solution x, which solves the linear system Ax = b. The function f is convex, and its gradient is fx) = A Ax b) which is Lipschitz continuous with L f := A A. This norm can be computed efficiently by using a power method See Appendix A.). Figure 3 shows the convergence behavior of three different algorithms: Algorithm, Algorithm and the restarting variant of Algorithm on a synthetic instance with n = 800 and p = 000. The y-axis shows the relative fx objective residual k ) f max, f and the relative error xk+ x k max x k,, respectively. While Algorithm is clearly faster than Algorithm, its restarting variants further outperform both methods. If we inject a backtracking line-search procedure into both algorithms, then we obtain a better performance as observed in Figure 3. Now, let us add a regularizer µ f x with µ f = 0. to f to get a strongly objective function fx) := Ax b + µ f x. If we test on the same setting as in the first case, we obtain a convergence behavior of these six variants in Figure 4. Here, we can clearly observe the oscillation behavior of the objective residuals 6/78

The relative objective residual 0 5 0 0 0-5 0-0 0-5 0-0 The relative error Gradient 0 0 LS-Gradient Fast gradient LS-Fast Gradient RS-fast gradient RS-LS fast grdient 0-0 -4 0-6 0-8 0-0 0-0 -5 0-4 0-30 0 00 400 600 800 000 Number of iterations 0-6 0 00 400 600 800 000 Number of iterations Figure 3: The convergence of six variants of the gradient method for solving 34). The relative objective residual 0 4 Gradient 0 Gradient-LS Ac-Gradient 0 0 Ac-gradient-LS Ac-gradient-RS Ac-Grad-LS/RS 0-0 -4 0-6 0-8 0-0 0-0 -4 The relative error 0 0 0-0 -4 0-6 0-8 0-0 0-0 -4 0-6 0 00 400 600 800 000 Number of iterations 0-6 0 00 400 600 800 000 Number of iterations Figure 4: The convergence of six variants of the gradient method for solving ) with fx) := Ax b + µ f x. fx k ) f of Algorithm as in Figure 4. The restarting variants achieve an approximate solution at the machine precision in a few iterations, which is much faster than the previous case shown in Figure 3. Example 5 Logistic regression). We consider another problem from logistic regression: min fx) := n log + exp y i a x R p i x + µ) )), 35) n i= where a i, y i ) n i= is a given dataset, and µ is also a given intercept. Let A be the matrix whose columns are formed from a i for i =,, n. Then, we can easily compute the gradient of f as fx) = n n i= exp y i a i x + µ)) + exp y i a i x + µ))y ia i, 7/78

which is Lipschitz continuous with the Lipschitz constant L f := 4n A A as we shown from Lecture. We test six variants of the gradient method both non-accelerated and accelerated algorithms) in a dataset called w4a downloaded from http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html, and the result is shown in Figure 5. The relative objective residual 0 0 0-0 - 0-3 Gradient LS-Gradient 0-4 Fast gradient LS fast gradient RS fast gradient LS-RS fast gradient 0-5 0 50 00 50 00 Number of iterations The relative error 0 0 Gradient LS-Gradient Fast gradient LS fast gradient RS fast gradient LS-RS fast gradient 0-0 - 0-3 0 50 00 50 00 Number of iterations Figure 5: The convergence behavior of six variants of the gradient method for solving 35). Here, the number of data points is n = 7366 and the problem dimension is p = 300. In this figure, we see that restart does not help to enhance the performance at all. This happens due to the non-strong convexity of f. However, the line-search procedure still slightly helps to improve the performance in terms of iterations. But it requires additional computational time for evaluating f in the line-search routine. Now, to see the effect of strong convexity to the performance of the gradient method, we slightly modify the objective function f in 35) by adding a small quadratic term as fx) := n n i= log + exp y i a i x + µ) )) + µ f x. Let us take µ f := 0.0. Then, the convergence of the six variants is shown in Figure 6. In this case, all the The relative objective residual 0 0 0-0 -4 0-6 0-8 0-0 0-0 -4 0-6 The relative error 0 Gradient 0 0 LS-Gradient Fast gradient LS fast gradient 0 - RS fast gradient LS-RS fast gradient 0-4 0-6 0-8 0-0 0-0 -4 0-8 0 50 00 50 00 Number of iterations 0-6 0 50 00 50 00 Number of iterations Figure 6: The convergence behavior of six algorithmic variants for solving a strongly convex instance of 35). 8/78

variants have a better performance compared to Figure 5. Moreover, restarting strategy also make a difference by significantly improving their performance. Proximal gradient-type methods We have seen from Lecture that several convex optimization models can be reformulated into composite convex minimization problems. In this section, we will study a class of gradient-type methods to solve them. Let us recall this problem here for our convenience of presentation: F := min F x) := fx) + gx), 36) x R p where f and g are both proper, closed, and convex see Lecture for definitions). Several practical models in statistics, machine learning, computer science, and engineering can be reformulated as a minimization problem of the sum of two convex functions. Here, the first objective term usually characterizes a loss, risk, or cost function, or a data fidelity term, while the second one can be used to regularize or to promote some desired structures of the optimal solution [6, 34, 47]. The numerical methods we design in this section rely on the following blanket assumption: Assumption. The objective terms f and g are proper, closed, and convex. Moreover, they satisfy: a) f is a smooth function with Lipschitz gradient, i.e., there exists a constant L f [0, + ) such that: fx) fy) L f x y, x, y dom f). 37) b) g is a nonsmooth function but equipped with a tractablly proximal operator see Lecture ): prox g x) := arg min gz) + z x. 38) z domf) We denote the class of all functions satisfying a) and b) by F, L Rp ) and by F prox R p ), respectively. Let us denote by x an optimal solution of 36), i.e., F x ) F x) for all x R p. The set of all optimal solutions of 36) is denoted by S and is assumed to be nonempty, i.e.: S := x dom F ) F x ) = F, where F is the optimal value of 36), which is finite, and dom F ) := dom f) dom g). To avoid trivial cases, we assume that dom F ) is nonempty, and S is nonempty.. Motivating examples The following applications can be cast into the composite convex minimization problem of the form 36). These problems have been described in Lecture, we recall them here for completeness. Example 6 LASSO problem). Given an observed or measurement vector y R n and a sensing/measurement matrix Φ in R n q, where n p. In signal processing, the input signal z R q produces the measurement vector y via a linear model: y = Φz + n, where n is some noisy corruption vector. Very often, the noise vector is assumed to be Gaussian, while the signal vector z can be transformed into a sparse vector x R p via some transformation z = Zx e.g., FFT or wavelets). Plugging into our model, we finally get y = ΦZx + n Ax + n with A := ΦZ. 9/78

y<latexit sha_base64="l9wxoub9debvmhlg7jhtz0ou4=">aaab6hicbvbns8naej34wetxaoxxsj4kokieix68dic/yalm0q7dbmlurgihv8clb0w8+po8+w/ctjlo64obx3szzmwlesgcdvz9y3nru7rt3t3bpzishb3dzwqhi0wiha6prciktw43abqkqrohatjc5m/mdjsax/lbzan6erjhnjgjzwaabsdwvuhgsveawpqohgoplvh8ysjvaajqjwpc9njj9tztgtoc33u40jzrm6wp6lkkao/xx+6jscwviwljzkobmd8toy0zqladkbujpwynxp/83qpcw/8nmsknsjzylgycmjimvuadllczkrmcwwksjgnfmbhzlg0i3vllq6r9wfpcmte8qtzvizhkcapncaeexemd7qeblwca8ayv8oy8oi/ou/oxafzipkt+apn8wfnvyz9</latexit> <latexit sha_base64="l9wxoub9debvmhlg7jhtz0ou4=">aaab6hicbvbns8naej34wetxaoxxsj4kokieix68dic/yalm0q7dbmlurgihv8clb0w8+po8+w/ctjlo64obx3szzmwlesgcdvz9y3nru7rt3t3bpzishb3dzwqhi0wiha6prciktw43abqkqrohatjc5m/mdjsax/lbzan6erjhnjgjzwaabsdwvuhgsveawpqohgoplvh8ysjvaajqjwpc9njj9tztgtoc33u40jzrm6wp6lkkao/xx+6jscwviwljzkobmd8toy0zqladkbujpwynxp/83qpcw/8nmsknsjzylgycmjimvuadllczkrmcwwksjgnfmbhzlg0i3vllq6r9wfpcmte8qtzvizhkcapncaeexemd7qeblwca8ayv8oy8oi/ou/oxafzipkt+apn8wfnvyz9</latexit> <latexit sha_base64="l9wxoub9debvmhlg7jhtz0ou4=">aaab6hicbvbns8naej34wetxaoxxsj4kokieix68dic/yalm0q7dbmlurgihv8clb0w8+po8+w/ctjlo64obx3szzmwlesgcdvz9y3nru7rt3t3bpzishb3dzwqhi0wiha6prciktw43abqkqrohatjc5m/mdjsax/lbzan6erjhnjgjzwaabsdwvuhgsveawpqohgoplvh8ysjvaajqjwpc9njj9tztgtoc33u40jzrm6wp6lkkao/xx+6jscwviwljzkobmd8toy0zqladkbujpwynxp/83qpcw/8nmsknsjzylgycmjimvuadllczkrmcwwksjgnfmbhzlg0i3vllq6r9wfpcmte8qtzvizhkcapncaeexemd7qeblwca8ayv8oy8oi/ou/oxafzipkt+apn8wfnvyz9</latexit> <latexit sha_base64="l9wxoub9debvmhlg7jhtz0ou4=">aaab6hicbvbns8naej34wetxaoxxsj4kokieix68dic/yalm0q7dbmlurgihv8clb0w8+po8+w/ctjlo64obx3szzmwlesgcdvz9y3nru7rt3t3bpzishb3dzwqhi0wiha6prciktw43abqkqrohatjc5m/mdjsax/lbzan6erjhnjgjzwaabsdwvuhgsveawpqohgoplvh8ysjvaajqjwpc9njj9tztgtoc33u40jzrm6wp6lkkao/xx+6jscwviwljzkobmd8toy0zqladkbujpwynxp/83qpcw/8nmsknsjzylgycmjimvuadllczkrmcwwksjgnfmbhzlg0i3vllq6r9wfpcmte8qtzvizhkcapncaeexemd7qeblwca8ayv8oy8oi/ou/oxafzipkt+apn8wfnvyz9</latexit> A<latexit sha_base64="baevobc5obwqgcfk5klp7hcwrg=">aaab6hicbvbns8naej3urq/qh69lbbbu0le0gpvi8cw7ae0owyk3btzhnn0ij/qvepcjiz/kzx/jtsbwx8mpn6bywzekaiujet+o4w9y3nrejawd3b/+gfhju0ngqgdzzlglvcahgwsudtcco4lcgguc8h4bua3nbphsshm0nqj+hq8pazaqzuuomxk7vnyosei8nfchr75e/eooyprfkwwtvuuu5ifezqgxnaqelxqoxowxmh9ivniitz/nd5sm6smsbgrw9kqufp7iqorpmosj0rnso97m3e/7xuasjrp+mysqktlgupokymmy+jgoukbkxsyqyxethiooszybeobg/55vxsuqh6btvrxfzqt3kcrtibuzghd66gbvdqhyywqhigv3hzhp0x5935wlqwnhzmgp7a+fwbktmxq==</latexit> <latexit sha_base64="baevobc5obwqgcfk5klp7hcwrg=">aaab6hicbvbns8naej3urq/qh69lbbbu0le0gpvi8cw7ae0owyk3btzhnn0ij/qvepcjiz/kzx/jtsbwx8mpn6bywzekaiujet+o4w9y3nrejawd3b/+gfhju0ngqgdzzlglvcahgwsudtcco4lcgguc8h4bua3nbphsshm0nqj+hq8pazaqzuuomxk7vnyosei8nfchr75e/eooyprfkwwtvuuu5ifezqgxnaqelxqoxowxmh9ivniitz/nd5sm6smsbgrw9kqufp7iqorpmosj0rnso97m3e/7xuasjrp+mysqktlgupokymmy+jgoukbkxsyqyxethiooszybeobg/55vxsuqh6btvrxfzqt3kcrtibuzghd66gbvdqhyywqhigv3hzhp0x5935wlqwnhzmgp7a+fwbktmxq==</latexit> <latexit sha_base64="baevobc5obwqgcfk5klp7hcwrg=">aaab6hicbvbns8naej3urq/qh69lbbbu0le0gpvi8cw7ae0owyk3btzhnn0ij/qvepcjiz/kzx/jtsbwx8mpn6bywzekaiujet+o4w9y3nrejawd3b/+gfhju0ngqgdzzlglvcahgwsudtcco4lcgguc8h4bua3nbphsshm0nqj+hq8pazaqzuuomxk7vnyosei8nfchr75e/eooyprfkwwtvuuu5ifezqgxnaqelxqoxowxmh9ivniitz/nd5sm6smsbgrw9kqufp7iqorpmosj0rnso97m3e/7xuasjrp+mysqktlgupokymmy+jgoukbkxsyqyxethiooszybeobg/55vxsuqh6btvrxfzqt3kcrtibuzghd66gbvdqhyywqhigv3hzhp0x5935wlqwnhzmgp7a+fwbktmxq==</latexit> <latexit sha_base64="baevobc5obwqgcfk5klp7hcwrg=">aaab6hicbvbns8naej3urq/qh69lbbbu0le0gpvi8cw7ae0owyk3btzhnn0ij/qvepcjiz/kzx/jtsbwx8mpn6bywzekaiujet+o4w9y3nrejawd3b/+gfhju0ngqgdzzlglvcahgwsudtcco4lcgguc8h4bua3nbphsshm0nqj+hq8pazaqzuuomxk7vnyosei8nfchr75e/eooyprfkwwtvuuu5ifezqgxnaqelxqoxowxmh9ivniitz/nd5sm6smsbgrw9kqufp7iqorpmosj0rnso97m3e/7xuasjrp+mysqktlgupokymmy+jgoukbkxsyqyxethiooszybeobg/55vxsuqh6btvrxfzqt3kcrtibuzghd66gbvdqhyywqhigv3hzhp0x5935wlqwnhzmgp7a+fwbktmxq==</latexit> <latexit sha_base64="fyzimwbr/dgjzp6tz360fhrqni=">aaab6hicbvbns8naej3urq/qh69lbbbu0le0gpri8cw7ae0owyk3btzhnnij/qvepcjiz/kzx/jtsbwx8mpn6bywzekaiujet+o4w9y3nrejawd3b/+gfhju0ngqgdzzlglvcahgwsudtcco4lcgguc8h4dua3hfphst7m0nqj+hq8pazaqzueoqxk7vnyosei8nfchr75e/eooyprfkwwtvuuu5ifezqgxnaqelxqoxowxmh9ivniitz/nd5sm6smsbgrw9kqufp7iqorpmosj0rnso97m3e/7xuasjrp+mysqktlgupokymmy+jgoukbkxsyqyxethiooszybeobg/55vxsuqh6btvrxfzqn3kcrtibuzghd66gbndqhyywqhigv3hzhpwx5935wlqwnhzmgp7a+fwb5jmm/a==</latexit> <latexit sha_base64="fyzimwbr/dgjzp6tz360fhrqni=">aaab6hicbvbns8naej3urq/qh69lbbbu0le0gpri8cw7ae0owyk3btzhnnij/qvepcjiz/kzx/jtsbwx8mpn6bywzekaiujet+o4w9y3nrejawd3b/+gfhju0ngqgdzzlglvcahgwsudtcco4lcgguc8h4dua3hfphst7m0nqj+hq8pazaqzueoqxk7vnyosei8nfchr75e/eooyprfkwwtvuuu5ifezqgxnaqelxqoxowxmh9ivniitz/nd5sm6smsbgrw9kqufp7iqorpmosj0rnso97m3e/7xuasjrp+mysqktlgupokymmy+jgoukbkxsyqyxethiooszybeobg/55vxsuqh6btvrxfzqn3kcrtibuzghd66gbndqhyywqhigv3hzhpwx5935wlqwnhzmgp7a+fwb5jmm/a==</latexit> <latexit sha_base64="fyzimwbr/dgjzp6tz360fhrqni=">aaab6hicbvbns8naej3urq/qh69lbbbu0le0gpri8cw7ae0owyk3btzhnnij/qvepcjiz/kzx/jtsbwx8mpn6bywzekaiujet+o4w9y3nrejawd3b/+gfhju0ngqgdzzlglvcahgwsudtcco4lcgguc8h4dua3hfphst7m0nqj+hq8pazaqzueoqxk7vnyosei8nfchr75e/eooyprfkwwtvuuu5ifezqgxnaqelxqoxowxmh9ivniitz/nd5sm6smsbgrw9kqufp7iqorpmosj0rnso97m3e/7xuasjrp+mysqktlgupokymmy+jgoukbkxsyqyxethiooszybeobg/55vxsuqh6btvrxfzqn3kcrtibuzghd66gbndqhyywqhigv3hzhpwx5935wlqwnhzmgp7a+fwb5jmm/a==</latexit> b A x<latexit sha_base64="fyzimwbr/dgjzp6tz360fhrqni=">aaab6hicbvbns8naej3urq/qh69lbbbu0le0gpri8cw7ae0owyk3btzhnnij/qvepcjiz/kzx/jtsbwx8mpn6bywzekaiujet+o4w9y3nrejawd3b/+gfhju0ngqgdzzlglvcahgwsudtcco4lcgguc8h4dua3hfphst7m0nqj+hq8pazaqzueoqxk7vnyosei8nfchr75e/eooyprfkwwtvuuu5ifezqgxnaqelxqoxowxmh9ivniitz/nd5sm6smsbgrw9kqufp7iqorpmosj0rnso97m3e/7xuasjrp+mysqktlgupokymmy+jgoukbkxsyqyxethiooszybeobg/55vxsuqh6btvrxfzqn3kcrtibuzghd66gbndqhyywqhigv3hzhpwx5935wlqwnhzmgp7a+fwb5jmm/a==</latexit> To measure the sparsity of x, we can use x 0 := cardx) the cardinality of x. However, to get a convex relaxation of x 0, we use the l -norm x as a convex envelope of x 0 and reformulate the problem of recovering a sparse signal vector x as min x R p Ax) y + λ x, 39) where λ > 0 is a regularization parameter. As we can see, problem 39) has exactly the same form as 36), where fx) := Ax y, which is convex and has a Lipschitz gradient fx) = A Ax y), and gx) := λ x is convex but nonsmooth. However, the prox-operator of g can be computed in a closed form analytical form) as: 0 if x i, prox g x) = sign x) max x, 0 = 40) / x i )x i otherwise. This operator is known as the soft-thresholding operator. Example 7 Sparse logistic regression). Given a dataset D := w ), y ), w ), y ),, w n), y n ), where w i) R p and y i), + for i =,, n. The conditional probability of a label y given w is defined as P y w) = / + e yx w+µ) ), where x R p is a weight vector, and µ is the intercept. The aim is to find a sparse weight vector x via the maximum log-likelihood principle. This problem can be formulated into the follow composite convex minimization problem: m min x R p m ly i w i) ) x + µ)), 4) i= fx) + λ x gx) where w i) is the i-th row of the data matrix W in R n p n p), λ > 0 is a regularization parameter, and l is the logistic loss function given by lτ) := log + e τ ). As seen previously, both f and g satisfy Assumption. Example 8 Imaging deconvolution). Given a noisy or blurred image y R n p, the aim is to recover a clean image x via y = Ax)+n, where A is a decoding or convolution operator and n is a Gaussian noise. This problem can be reformulated in the form of 36) as follows: min x R n p Ax) y F fx) + λ x TV gx), 4) where λ > 0 is a regularization parameter and TV is the total variation norm TV)-norm) that promotes to remove noise, or to preserve the sharped-edge structures: i,j x TV := x i,j+ x i,j + x i+,j x i,j for the anisotropic case, i,j xi,j+ x i,j + x i+,j x i,j for the isotropic case By solving this problem, we can approximately recover the original image x. Figure 7 illustrates how to use the optimization model 4) to de-blur an image. 0/78

original image input: Noise image output: After deconvolution Figure 7: Original image x Observed image y Clean image x after deconvolution Example 9 Sparse inverse covariance estimation). Given a dataset D := z ),, z i) generated from a Example: Log-determinant for LMIs Gaussian Markov random field z in R p. Let Σ be the covariance matrix corresponding to this graphical model of Application: Graphical model selection the Gaussian Markov random field. The aim is to learn a sparse matrix Θ that approximates the inverse Σ of the covariance matrix Σ. See Figure 8 for an illustration. x 5 x 4 x x x x 3 x 4 x 5 x x x 3 = x x 3 x 4 x 5 Given a datafigure set D 8: := Anx illustration,...,x N of,wherex the graphical i model is a Gaussian that leads to random 43) variable. Let be the covariance matrix corresponding to the graphical model of the Gausian Markov random field. The aim is to learn a sparse matrix that approximates the inverse. This problem can be formulated as follows see Lecture ): Optimization problem 8 9 >< >=. Optimality condition min and log det ) fixed-point + trace ) characterization + kvec )k 0 >: z z >; From the following well-known Moreau-Rockafellar fx) theorem [4] for two proper, gx) closed, and convex functions f and g in Γ 0 R p ) such that ri dom f) dom g)), we have Thursday, June, 4 min Θ 0 Tr ΣΘ) log detθ) fθ) + λ vec Θ) gθ), 43) where Θ 0 means that Θ is symmetric and positive definite and λ > 0 is a regularization parameter and vec is the vectorization operator. For more details of this problem can be found in [3, 4,, 40, 46]. Under some appropriate assumptions on data, one can show that f ) in 43) also satisfies Assumption... Optimality condition F x) = f + g)x) fx) + gx), x dom F ) := dom f) dom g). Now, we can write the optimality condition for 36) as 0 F x ) fx ) + gx ), x dom F ). 44) If f F, L Rp ) i.e., f is Lipschitz continuous), then 44) reduces to 0 F x ) fx ) + gx ), x dom F ). 45) The following lemma gives a necessary and sufficient condition for a point x to be an optimal solution of 36). /78

Lemma.. The necessary and sufficient condition for a point x dom F ) to be globally optimal to 36) is 44) or 45) if f F, L Rp ). Proof. By definition of the subdifferential of F, we have F x) F x ) ξ x x ), for any ξ F x ), x dom f). If 44) or 45)) is satisfied, then F x) F x ) 0. Consequently, x is a global solution to 36). Conversely, if x is a globally optimal of 36) then F x) F x ), x dom F ) F x) F x ) 0 x x ), x R p. This leads to 0 F x ) or 44) or 45))... Properties of proximal operators Let prox g be a prox-operator of a proper, closed and convex function g Γ 0 R p ). We recall some basic properties of prox g from Lecture. Lemma.. Given a proper, closed, and convex function g Γ 0 R p ). Let prox g be the proximal operator of g defined by 38). Then, the following properties hold: a) prox g is well-defined and single-valued for any x R p. b) prox g x) satisfies the following inclusion x prox g x) + gprox g x)), x R p. 46) c) x is a fixed point of prox g, iff x = arg min x R p gx), i.e.: d) prox g is a non-expansive operator, i.e.: x = arg min x R p gx) x = prox g x ). prox g x) prox g y) x y, x, y R p. Proof. Since g is convex, g ) + x is strongly convex with the parameter µ = for any x R p. There exists an optimal solution z x) = prox g x) of min z R p gz) + z x for any x R p. Moreover, since the function is strongly convex, this z x) is unique. The statement a) is proved. Writing down the optimality condition of 38), we obtain 0 gprox g x)) + prox g x) x. Rearranging this expression, we obtain 46). The statement b) is proved. A point x = arg min x R p gx) is equivalent to 0 gx ). This is equivalent to 0 gx ) + x x, which shows that x satisfies the optimality condition of 38) with x = x. Therefore, x = prox g x ). Conversely, by using 46), we also obtain 0 gx ), which means that x = arg min x R p gx). The statement c) is proved. Finally, we prove d). Let u := prox g x) and v := prox g y). By 46) we have x u gu) and y v gv). However, since g is convex, we have u v) p q) 0, where p gu) and q gv) the monotonicity of g, see, e.g., [30]). Using this relation with p = x u and q = y v, we have u v) x u y + v) 0. This implies u v) x y) u v. Finally, by using the Cauchy-Schwarz inequality, we have u v x y u v) x y) u v, which leads to d). /78

..3 Optimality condition vs. fixed-point formulation The optimality condition 4.) is equivalent to for any λ > 0, which says that x is a fixed-point of the mapping T λ. Alternatively, the optimality condition 44) is equivalent to x prox λg x λ fx )) := T λ x ), 47) x prox λg x λ fx )) := S λ x ), 48) for any λ > 0, which says that x is a fixed-point of the mapping S λ. While T λ is a set-valued operator, S λ is a single-valued operator. Proof. Indeed, we prove 48) 47) is done similarly). From 44), we can write 0 fx ) + gx ) x λ fx ) x + λ gx ) I + λg)x ), where I is the identity mapping, i.e., Ix) = x. Using the basis property b) of prox λg from Lemma., we have Since prox λg and f are single-valued, we obtain 48)..3 Proximal-gradient methods x prox λg x λ fx )). Now, we present the first algorithm for solving 36) called proximal-gradient method [6, 34]. is sometimes called ISTA [6] as referred to the Iterative Shrinkage-Thresholding Algorithm. monotone inclusions [4], this algorithm is known as the forward-backward splitting FBS) method. This algorithm relies on the following assumptions, which has been stated in Assumption : Assumption 3. This algorithm In the context of a) f F, L Rp ) and g F prox R p ) i.e., f is Lipschitz gradient and g has a tractable proximity operator). b) Oracle: Proximal-gradient algorithms typically use F ), f ) and prox λg ) at each iteration..3. Derivation of proximal-gradient method Since f F, L Rp ), for any x R p, we can approximate it by the following quadratic model: Q L y, x) := fx) + fx) y x) + L y x, y R p. 49) One can prove the following lower and upper bounds see Lecture ) or [30]: fx) + fx) y x) fy) by the convexity of f fy) fx) + fx) y x) + L f y x by the Lipschitz continuity of f, 50) for all x, y R p. Now, combing 49) and g, for given a point x k R p and L > 0, we can define a quadratic-convex model of F at x k as follows: P L x, x k ) := Q L x, x k ) + gx) fx k ) + fx k ) x x k ) + L x xk + gx). Since P L, x k ) is strongly convex for any L > 0, the following problem is well-defined and has a unique solution: S L x k ) := arg min x domf )) P Lx, x k ) prox /L)g x k /L) fx k ) ). 5) 3/78

Definition Proximal-gradient mapping). The proximal-gradient mapping of F is defined as: G L x k ) := Lx k S L x k )). 5) In particular, if g 0, then G L x k ) fx k ). Figure 9 illustrates the lower and upper bounds of F. The F x) P L x, x k ):=fx k )+rfx k ) T x x k )+ L kx xk k + gx) F x) =fx) + gx) x k S L x k ) fx k )+rfx k ) T x x k ) + gx) x k x k+ x? x Figure 9: An illustration of the lower and upper bounds for F following lemma shows an optimality condition of 36), which will be used to terminate the algorithm. Lemma.3. If G L x ) = 0, then x is an optimal solution of 36). Proof. If G L x ) = 0 then we have Lx S L x )) = 0, which leads to x = S L x ) due to L > 0. Now, using the definition ofthursday, S L, we June, have 4 x = prox λg x λ fx )) for λ := /L. By using 48), we can see that x is an optimal solution to 36)..3. The algorithm Now, using the fixed-point formulation 48), we can eventually design Algorithm 4 for solving 36). Algorithm 4 Basic proximal-gradient scheme ISTA)) : Inputs: Choose an arbitrarily initial point x 0 dom F ) and a desired accuracy ε > 0 : Output: An ε-solution x k of 36). 3: For k = 0,, k max, perform: 4: Compute the gradient fx k ). If x k+ x k ε max, x k, then Terminate. 5: Update 6: End for x k+ := prox αg x k α fx k ) ), where α := L f or α is determined by line-search. Per-iteration complexity: Each iteration of Algorithm 4 requires One gradient fx k ) of f and one proximal operator prox /L)g of g. the evaluation of L. If L f is available, then we set L := L f. Otherwise, if we use a line-search procedure as in Algorithm, then we need to evaluate F at each line-search iteration. Based on Lemma.3, we can terminate this algorithm using a condition as at Step 4. 4/78

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018