A Second-Order Method for Strongly Convex l 1 -Regularization Problems

Size: px
Start display at page:

Download "A Second-Order Method for Strongly Convex l 1 -Regularization Problems"

Transcription

1 Noname manuscript No. (will be inserted by the editor) A Second-Order Method for Strongly Convex l 1 -Regularization Problems Kimon Fountoulakis and Jacek Gondzio Technical Report ERGO June, 13 Abstract In this paper a second-order method for solving large-scale strongly convex l 1 -regularized problems is developed. The proposed method is a Newton- CG (Conjugate Gradients) algorithm with backtracking line-search embedded in a doubly-continuation scheme. Worst-case iteration complexity of the proposed Newton-CG is established. Based on the analysis of Newton-CG, worstcase iteration complexity of the doubly-continuation scheme is obtained. Numerical results are presented on large-scale problems for the doublycontinuation Newton-CG algorithm, which show that the proposed secondorder method competes favourably with state-of-the-art first-order methods. In addition, l 1 -regularized Sparse Least-Squares problems are discussed for which a parallel block coordinate descent method stagnates. Keywords l 1 -regularization Strongly convex optimization Second-order methods Iteration Complexity Continuation Newton Conjugate-Gradients method Mathematics Subject Classification () 68W4, 65K5, 9C6, 9C5, 9C3, 9C51 J. Gondzio is supported by EPSRC Grant EP/I1717/1 Kimon Fountoulakis School of Mathematics and Maxwell Institute, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, United Kingdom. K.Fountoulakis@sms.ed.ac.uk Tel.: Jacek Gondzio School of Mathematics and Maxwell Institute, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, United Kingdom. J.Gondzio@ed.ac.uk Tel.: , Fax:

2 Kimon Fountoulakis and Jacek Gondzio 1 Introduction We are concerned with the solution of the following optimization problem minimize f (x) := x 1 + ϕ(x), (1) where x R m, > and 1 is the l 1 -norm. The following three assumptions are made: The function ϕ(x) is twice differentiable, and strongly convex, which implies that at any x its second derivative ϕ(x) is uniformly bounded λ m I ϕ(x) λ 1 I, () with < λ m λ 1, where I is the identity matrix in appropriate dimension. The second derivative of ϕ(x) is Lipschitz continuous ϕ(y) ϕ(x) L ϕ y x, (3) for any x, y, where L ϕ is the Lipschitz constant of ϕ(x), is the l -norm. Recently there has been an increased interest in the solution of strongly convex problems such as (1). Such problems can be found in well-established scientific fields, for example, Regression [39] and Machine Learning [49]. Additionally, there has been a number of efficient methods in the optimization field [6,19,35,37,4 4,45,47] for the solution of problem (1). Observe, that all approaches proposed in the previously cited papers are first-order methods. There has been also a series of interesting papers which describe the adaptation of second-order methods to such problems [4,1,13,16,1 4,36] but despite their author s efforts they do not seem to compete favourably with state-of-the-art first-order methods [48]. First-order methods rely on properties of the l 1 -norm to obtain the new direction at each iteration. In particular, very often the direction of the solver is obtained by minimizing exactly an upper bound of the objective function in problem (1), d := arg min p x + p 1 + ϕ(x) + ϕ(x) p + L ϕ p, where x is the current iteration and d is the direction [7]. Other first-order methods use the decomposability of the former problem and solve it only for some chosen coordinates [35]. In this case, the Lipschitz constant is replaced by partial Lipschitz constants for each chosen coordinate. Another approach exploits the replacement of the l 1 -norm by a first-order differentiable parameterized function which approximates the former [8]. These efficient techniques of manipulating the non-smooth part, the l 1 -norm, in combination with the strongly convex properties of the problem have settled first-order methods as the prevalent approach. On the contrary, for second-order methods, examples of the most commonly used techniques are

3 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 3 [11,36] in which the l 1 -norm, x 1, is replaced with m u i + v i, subject to: u i, v i i=1 i = 1,,..., m and then the optimal x is recovered by x = u v. [1] in which the direction at every iteration is obtained by approximately solving d := arg min p x + p 1 + ϕ(x) + ϕ(x) p + 1 dt ϕ(x)d, using a coordinate descent algorithm. The drawback of these approaches is that in the first case the dimension of the problem is doubled and additional simple constraints are introduced. Whilst, in the second case, the convergence of the coordinate descent algorithm is sensitive to the spectral properties of matrix ϕ(x). In this paper, our goals are: 1. Keep the unconstrained nature of the problem, maintain the size of it unchanged, while the consequences of the non-smoothness of the l 1 -norm are weakened.. Employ a Newton-CG algorithm with backtracking line-search which is embedded in a doubly-continuation scheme for further acceleration. 3. Give a complete analysis of Newton-CG with backtracking line-search, i.e. proof of global convergence, global and local convergence rates results, local region of fast convergence rate and worst-case iteration complexity, as in standard analysis of the Newton algorithm with backtracking line-search in []. Based on a range of published papers and a book [3,7,8,6,31,3], from various fields in which Newton-CG with backtracking line-search is applied, there has been no similar analysis. To be precise, the analysis in the previous citations provides global convergence guarantees for various globalization techniques of Newton-CG. Additionally, local convergence rate results are obtained assuming that the method is initialized in a neighbourhood of the optimal solution which is not explicitly defined, 4. Provide worst-case iteration complexity of the doubly-continuation scheme based on the iteration complexity of Newton-CG with backtracking linesearch. Continuation schemes are generally well-known among optimizers [1,1,14,15,5,43,44,46], they can speed-up a good solver, unfortunately, they are difficult to analyze. In particular, in the previously cited papers there has been no iteration complexity results. In order to meet the first goal, the l 1 -norm is approximated by a smooth function which has derivatives of all degrees. Hence, problem (1) is replaced by minimize f µ (x) := ψ µ (x) + ϕ(x).

4 4 Kimon Fountoulakis and Jacek Gondzio where ψ µ (x) denotes the smooth function which replaces the l 1 -norm and µ is a parameter which controls the quality of approximation. Then, a Newton- CG algorithm embedded in a doubly-continuation scheme is applied to solve approximately a sequence of the above approximate problems. In what follows in this section we give a brief introduction of the proposed approach. In Section, necessary basic results are given which will be used to support theoretical results in Sections 4 and 5. In Section 3, the proposed doubly-continuation scheme is discussed in details. In Section 4, the convergence analysis of Newton-CG with backtracking line-search is studied. In Section 5, worst-case iteration complexity of doubly-continuation Newton- CG with backtracking line-search is presented. Finally, in Section 6, numerical results are presented, in which we compare the proposed method with stateof-the-art first-order methods on large-scale Sparse Least-Squares and Logistic Regression problems. 1.1 Pseudo-Huber regularization The non-smoothness of the l 1 -norm makes a straightforward application of the second-order method to problem (1) impossible. In this subsection, we focus on approximating the non-smooth l 1 -norm by a smooth function. To meet such a goal, the first-order methods community replaces the l 1 -norm with the so-called Huber penalty function m i=1 φ µ(x i ) [1], where { 1 x i φ µ (x i ) = µ, if x i µ x i 1 µ, if x i = 1,,..., m i µ and µ >. The smaller the parameter µ of the Huber function is, the better the function approximates the l 1 -norm. Observe that the Huber function is only first-order differentiable, therefore, this approximation trick is not applicable to second-order methods. Fortunately, there is a smooth version of the Huber function, the pseudo-huber function which has derivatives of all degrees [17]. The pseudo-huber function parameterized with µ > is ψ µ (x) = µ m i=1 ( 1 + x i µ 1 ). (4) A comparison of the three functions l 1 -norm, Huber and Pseudo-Huber function can be seen in Figure 1. The pseudo-huber function is employed in design of an efficient secondorder method in this paper. In particular, the l 1 -regularization problem in (1) is replaced with the following approximation minimize f µ (x) := ψ µ (x) + ϕ(x). (5) The advantages of such an approach are listed below.

5 A Second-Order Method for Strongly Convex l 1 -Regularization Problems µ= µ= µ= L1 norm Huber Pseudo Huber (a) l 1 -norm, Huber, Pseudo-Huber functions.1 L1 norm Pseudo Huber µ= (b) Pseudo-Huber function for µ Fig. 1 Comparison of the approximation functions, Huber and pseudo-huber, with the l 1 - norm in one dimensional space. Fig.1a shows the quality of approximation for the Huber and pseudo-huber functions. Fig.1b shows how pseudo-huber function converges to the l 1 -norm as µ Availability of second-order information owed to the differentiability of the pseudo-huber function. Avoiding the need to double problem dimensions. Opening the door to using iterative methods to compute descent directions. There is an obvious cost which comes along with the above benefits, and that is the approximate nature of the pseudo-huber function. There is a concern that in case that a very accurate solution is required, the pseudo-huber function may be unable to deliver it. In theory, since the quality of the approximation is controlled by parameter µ in (4), see Figure 1, then the pseudo-huber function can recover any level of accuracy under the condition that sufficiently small µ is chosen. In practise a very small parameter µ might cause instabilities in the linear algebra of the solver. However, we shall show in Section 6 that even when µ is set to small values, i.e. 1.e 8, the proposed method is very efficient. By replacing the initial problem (1) with the approximation (5) we obtain a smooth strongly convex problem. Since problem (5) is unconstrained the application of a Newton-CG algorithm [9,6,33] is a good fit for purpose. In particular, a Newton-CG direction is calculated by solving the Newton linear system f µ (x)d = f µ (x) (6) using the CG algorithm [18,, 38]. The CG algorithm is terminated prematurely, therefore, an approximation of the actual Newton direction is obtained. Moreover, a backtracking line-search approach is used which guarantees global convergence of the Newton-CG algorithm. To achieve a maximum practical efficiency the Newton-CG algorithm is embedded into a doubly-continuation scheme. By the term doubly-continuation, it is meant that continuation is ap-

6 6 Kimon Fountoulakis and Jacek Gondzio plied on the parameters and µ of problem (5) simultaneously. Details for this method will be discussed in Section 3. Preliminaries Throughout the paper the standard dot product, i.e. y, x = y x is used. Occasionally, the local norm x :=, f µ (x) will be employed. The denotes the infinity norm. The operator diag( ) returns a diagonal matrix with the diagonal components of the input square matrix or takes as input a vector and creates a diagonal matrix with the input vector on the diagonal. The operator [ ] ij returns the element at row i and column j of the input matrix, for vectors, the operator [ ] j is used instead. Finally, is the ceil function..1 Properties of pseudo-huber function The gradient of the pseudo-huber function ψ µ (x) in (4) is given by ψ µ (x) = 1 µ and the Hessian is given by [x 1 ( 1 + x 1 µ ) 1 ( ),..., x m 1 + x 1 ] m µ, (7) ( 1 ψ µ (x) = diag (1 + µ[ x 1 µ ) 3 x ]),..., (1 + m µ ) 3. (8) The following lemma shows that the gradient of the function ψ µ (x) is bounded. Lemma 1 The gradient ψ µ (x) satisfies 1 m ψ µ (x) 1 m. where 1 m is a vector of ones of length m. Proof Since µ + x i x i µ + x i i = 1,,..., m, we get 1 ) 1 x i (µ + x i 1. The proof is complete. The next lemma guarantees that the Hessian of the pseudo-huber function ψ µ (x) is bounded. Lemma The Hessian matrix ψ µ (x) satisfies I ψ µ (x) 1 µ I where I is the identity matrix in appropriate dimension.

7 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 7 Proof The result follows easily by observing that < x i, i = 1,..., m. The proof is complete. ( ) x i µ 1 for any According to Lemma all eigenvalues of the Hessian matrix of the pseudo- Huber function are positive, thus, the function is strictly convex. The next lemma shows that the Hessian matrix of the pseudo-huber function is Lipschitz continuous. Lemma 3 The Hessian matrix ψ µ (x) is Lipschitz continuous ψ µ (y) ψ µ (x) 1 y x. µ Proof ψ µ (y) ψ µ (x) = 1 1 d ψ µ (x + s(y x)) ds ds d ψ µ (x + s(y x)) ds (9) ds where d ψ µ(x+s(y x)) ds is a diagonal matrix with each diagonal component, i = 1,,... m, given by [ d ψ µ (x + s(y x)) ] ds Using the previous observation we have that d ψ µ (x + s(y x)) = ds Moreover, we have [ d ψ µ (x + s(y x)) ] ds = ii ii = 3(x i + s(y i x i ))(y i x i ). µ 3 (1 + (xi+s(yi xi)) µ ) 5 max i=1,,...,m [ d ψ µ (x + s(y x)) ] ds 3(x i + s(y i x i ))(y i x i ) = µ 3 (1 + (xi+s(yi xi)) µ ) 5 3(x i + s(y i x i )) µ 3 (1 + (xi+s(yi xi)) µ ) 5 where the first absolute value in (11) has a maximum at 3(x i + s(y i x i )) µ 3 (1 + (xi+s(yi xi)) µ ) 5 Combining (11) and (1) we get (1) ii (y i x i ) (11) µ x i (y i x i), which gives µ < 1 µ. (1) [ d ψ µ (x + s(y x)) ] 1 ds ii µ y i x i. (13)

8 8 Kimon Fountoulakis and Jacek Gondzio Replacing (13) in (1) and using the fact that we get d ψ µ (x + s(y x)) 1 y x. ds µ Replacing the above expression in (9) and calculating the integral we arrive at the desired result. The proof is complete. Now consider the gradient of pseudo-huber function of a particular (fixed) point x and consider a function of two parameters an µ defined as follows ξ x (, µ) = ψ µ (x). Lemma 4 Let ξ x (, µ) = ψ µ (x), then if and µ > µ, the following bound holds ξ x (, µ) ξ x (, µ) ω(, µ,, µ) m where ω(, µ,, µ) = ( ) + In case that µ = µ, the following holds µ µ µ µ ξ x (, µ) ξ x (, µ) ( ) m. log µ µ. Proof For simplicity of notation, let us define variables ζ = (, µ) and ζ = (, µ). Then where ξ x ( ζ) ξ x (ζ) = 1 1 dξ x (ζ + s( ζ ζ)) ds ds dξ x(ζ + s( ζ ζ)) ds (14) ds dξ x (ζ + s( ζ ζ)) ds ψ µ+s( µ µ) (x) = dξ x(ζ + s( ζ ζ)) d(ζ + s( ζ ζ)) = d(ζ + s( ζ ζ)) ds γ 1 (s)[ ψ µ+s( µ µ) (x)] 1 γ (s)[ ψ µ+s( µ µ) (x)]. γ m (s)[ ψ µ+s( µ µ) (x)] m [ µ µ ] (15) where γ i (s) = Using Lemma 1 and (15) we get that ( + s( ))(µ + s( µ µ)) (µ + s( µ µ)) + x i dξ x(ζ + s( ζ ζ)) ( + γ(s) µ µ ) m, (16) ds

9 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 9 where γ(s) = + s( ) µ + s( µ µ) By replacing (16) in (14) and using and µ > µ, we get the first result. For the case that µ = µ, using Lemma 1 we get ξ x (, µ) ξ x (, µ) = ( ) ψ µ (x) ( ) m. The proof is complete.. Properties of function f µ (x) Having all necessary properties of pseudo-huber function (4) we can present basic properties of function f µ (x) defined in (5). The gradient of f µ (x) is given by f µ (x) = ψ µ (x) + ϕ(x) where ψ µ (x) has been defined in (7). The Hessian matrix of f µ (x) is f µ (x) = ψ µ (x) + ϕ(x). where ψ µ (x) has been defined in (8). Using () and Lemma we get the following bounds on the Hessian matrix of f µ (x) ( ) λ m I f µ (x) µ + λ 1 I, (17) where I is the identity matrix in appropriate dimension. Hence, the equivalence of the Euclidean and the local norm is given by the following inequality λ 1 m d d x ( µ + λ 1 ) 1 d. (18) Lemma 5 For any x and x, the minimizer of f µ (x), the following hold 1 ( ) f µ + λ µ (x) f µ (x) f µ (x ) 1 f µ (x) λ m 1 and x x λ m f µ (x). Proof The right hand side of the first inequality is proved in page 46 of []. The left hand side of the first inequality is proved by using strong convexity of f µ (x), f µ (y) f µ (x) + f µ (x) (y x) + µ + λ 1 y x

10 1 Kimon Fountoulakis and Jacek Gondzio and defining ỹ = x 1 f µ µ +λ1 (x), we get f µ (x) f µ (x ) f µ (x) f µ 1 (ỹ) ( ) f µ + λ µ (x). 1 The last inequality is proved in page 46 of []. The proof is complete. The following lemma guarantees that the Hessian matrix f µ (x) is Lipschitz continuous. In this lemma, L ϕ is defined in (3). Lemma 6 The function f µ (x) is Lipschitz continuous where L f µ := µ + L ϕ. f µ (y) f µ (x) L f µ y x, Proof Using Lemma 3 and (3) we have f µ (y) f µ (x) ψ µ (y) ψ µ (x) + ϕ(y) ϕ(x) ( ) µ + L ϕ y x. The Lipschitz constant of f µ (x) is therefore L f µ := µ + L ϕ. The next lemma gives a useful bound for the function f µ (x) where its independent variables are the parameters, µ and x is a constant. Lemma 7 If and µ > µ, then the gradient f µ (x) satisfies f µ (x) f µ (x) ω(, µ,, µ) m with ω(, µ,, µ) as in Lemma 4. In case that µ = µ, the following holds f µ (x) f µ (x) ( ) m. Proof The proof follows easily from Lemma 4 because f µ (x) f µ (x) = ξ x (, µ) ξ x (, µ). The proof is complete. Using the reverse triangular inequality on f µ (x) as a function of parameters and µ we get f µ (x) f µ (x) f µ (x) f µ (x) and by applying Lemma 7 in the former we get and f µ (x) f µ (x) + ω(, µ,, µ) m, (19) f µ (x) f µ (x) + ( ) m, () respectively. The next lemma shows how well the second-order Taylor expansion of f µ (x) approximates the function f µ (x).

11 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 11 Lemma 8 If q µ (y) is a quadratic approximation of the function f µ (x) at x q µ (y) := f µ (x) + f µ (x) (y x) + 1 (y x) f µ (x)(y x), then f µ (y) q µ (y) 1 6 L f µ y x 3. Proof Using corollary in [34] and Lemma 6 we have f µ (y) q µ (y) y x 1 y x 1 t t = 1 6 L f µ y x 3. The proof is complete. f µ (x + s(y x)) f µ (x) dsdt sl f µ y x dsdt.3 Approximate Newton directions via Conjugate Gradients algorithm In the case that pseudo-huber regularization is performed, the direction at every iteration is calculated by a Newton-CG algorithm. This implies that the Newton system (6) is solved iteratively using the CG algorithm and the process is truncated when a required accuracy is obtained. The result is an approximate solution of the Newton system, which means that an inexact Newton direction is calculated. For Newton-CG, the CG algorithm is always initialized with the zero solution and the termination criterion is set to r µ (x) η f µ (x), (1) where r µ (x) = f µ (x)d + f µ (x) is the residual in equation (6), η < 1 is a user-defined constant and d R m is the Newton-CG direction. In the literature there has been extensive theoretical work regarding the termination parameter η. The most efficient of the proposed approaches of setting η suggest that the parameter should be varied from iteration to iteration starting from a well-defined large value and steadily decreasing. According to [31], page 167, if k is the iteration index of Newton-CG, then by setting η k = min{ 1, f µ (x k ) c }, () local super-linear convergence of Newton-CG algorithm is obtained for c = 1, whilst, local quadratic convergence is obtained for c = 1. Although in practice we have observed that by setting η k =1.e 1 results in very fast convergence of the Newton-CG algorithm, it will be analyzed for η k set as in () with c = 1. The next lemma determines bounds on the norm of the Newton-CG direction d as a function of f µ (x).

12 1 Kimon Fountoulakis and Jacek Gondzio Lemma 9 Let d R m be the Newton-CG direction calculated by CG algorithm which is terminated according to criterion (1) with η < 1. Then the following holds 1 η ( µ + λ 1) f µ (x) d λ m f µ (x) Proof By squaring (1) and making simple rearrangements of it we get d ( f µ (x)) d + f µ (x) f µ (x)d + (1 η ) f µ (x). Using (17) in the above inequality we get ) λ m d ( µ + λ 1 f µ (x) d + (1 η ) f µ (x). By dropping the quadratic term λ m d from the previous inequality and making appropriate rearrangements we get d 1 η ( µ + λ 1) f µ (x). Which proves the left hand side of the result. For the right hand side, we simply use the equality d = f µ (x) 1 ( f µ (x) + r µ (x)). By taking norms and using (1) and f µ (x) 1 = 1 hand side of the result. The proof is complete. λ m we obtain the right The following property is shown in the proof of Lemma.4.1 in []. For a CG algorithm that is initialised with the zero solution p = the returned direction d i satisfies where d i := arg min{ f µ (x) p + 1 p p f µ (x)p p E i }, E i := span( f µ (x), f µ (x) f µ (x),..., f µ (x) i 1 f µ (x)). Therefore for every p E i, at t =, we get Since, d i E i, then d( f µ (x) (d i + tp) + 1 (d i + tp) f µ (x)(d i + tp)) = dt ( f µ (x) + f µ (x)d i ) p =. ( f µ (x) + f µ (x)d i ) d i = d i f µ (x)d i = f µ (x) d i. (3)

13 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 13 As a termination criterion of the Newton-CG algorithm we will use an upper bound of the Newton decrement d N x, where d N is the Newton direction at a given point x, d N x = inf{c 1 f µ (x) h c 1 h x h R m }. It is shown in [3], pages 15-16, that the optimal c 1 = d N x of the previous minimization problem is given when h = d N. Therefore, for h = d, given by CG, and due to (3) we have d N x d x. The measure d x is inexpensive to be calculated, because of the relation (3) and will be used as a termination criterion of Newton-CG algorithm. For details about the Newton decrement see page 15 in [3]. 3 Doubly-Continuation Newton-CG The pseudo-huber regularization problem (5) to be solved is a smooth and strongly convex problem. Therefore, one could directly apply an efficient and simple Newton-CG algorithm. In this paper, we make a step further and embed a Newton-CG algorithm with backtracking line-search in a doublycontinuation scheme. Continuation schemes define a sequence of minimizers as a function of a parameter. For l 1 -regularization problems as in (1), it is common that this sequence depends on parameter. In such cases the continuation path, is constructed for decreasing values of to a user defined constant. For the proposed method, the continuation path, consists of minimizers of problem (5) for simultaneously decreasing parameters and µ, till and µ µ, where and µ are pre-defined parameters. This explains the term doubly-continuation. The following two observations justify the purpose of the doubly-continuation scheme. It is shown in [] that problem (1) has the zero solution for any ϕ(). In case of problem (5), by setting µ to a relatively large value, i.e. µ 1, the regularization effect of the l 1 -norm is weakened, see Figure 1b. Therefore, ϕ() is not a sufficient condition for the subproblem (5) to have the zero solution. However, our empirical experience shows that Newton-CG is warm-started, i.e. Newton-CG converges in few iterations, when the initial solution is set to zero and ϕ(). The linear algebra is favoured when = O(µ), since in this case the Hessian matrices f µ (x) are tightly bounded, see (17). Briefly, large values of favour warm-starting, while a small ratio µ assures good numerical properties of linear systems to be solved. However, the parameters and µ have to be decreased to their pre-defined values eventually. We propose a continuation approach which consists of gradual decreasing and µ from their large initial values to the ones which determine an accurate enough solution to the original problem (1). We expect that solving a sequence of easier problems is much more efficient than giving one-shot to the actual

14 14 Kimon Fountoulakis and Jacek Gondzio problem itself. The process of decreasing the, µ parameters follows the rule l+1 = max{β 1 l, } and µ l+1 = max{β µ l, µ}, (4) where < β 1, β < 1. Observe in (4) that the parameters and µ can be decreased at every iteration arbitrarily. The former implies that the proposed method is long-step depending on these parameters. Long-step, in the sense that the reduction parameters β 1 and β is not limited in order to guarantee convergence of the doubly-continuation scheme. The pseudo-code of the proposed doubly-continuation Newton-CG and the backtracking line-search algorithms follow. In these pseudo-codes we redefine the notation of the local inner product to distinguish among continuation iterations, x,l =, f µl (x). l Algorithm dcncg 1: Input Choose x and, µ,, µ, ɛ >, < β 1, β < 1. : Set flag=. 3: For l =, 1,,... and k =, 1,,... generate l+1, µ l+1 from l, µ l and x k+1 from x k, respectively, according to the iteration: 4: while flag< do 5: for maximum Newton-CG iterations do 6: Use CG algorithm to approximately solve f µl l (xk )d k = f µl l (xk ), till (1) is satisfied with η as in () and c = 1. 7: If d k x k,l ɛl break for-loop. 8: Calculate < α k 1 which satisfies f µl l (xk + α k d k ) f µl l (xk ) c α k d k x k,l, where c (, 1 ), by using backtracking line-search. 9: set x k+1 = x k + α k d k and k = k + 1 1: end for 11: l+1 = max{β 1 l, } and µ l+1 = max{β µ l, µ} 1: If ( l+1, µ l+1 ) = (, µ), then set ɛ l = ɛ and flag=flag+1 13: l = l : end while The outer loop of the above algorithm is the doubly-continuation scheme on the parameters, µ. The inner loop is the Newton-CG algorithm with backtracking line-search that minimizes f µl (x) for fixed parameters l, µ l given from the l th l outer loop. Although, the termination criterion of CG algorithm is set as in (), in practise it can be relaxed to any η < 1. Step 8 is the backtracking line-search algorithm given below.

15 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 15 Algorithm backtracking line-search 1: Input Given l, µ l and an inexact Newton direction d for point x, calculated by CG in step 6 of algorithm dcncg and c (, 1 ), c 3 (, 1), then calculate a step-size α: : set α = 1 3: while f µl µl l (x + αd) > f (x) c α d l x,l do 4: α = c 3 α 5: end while For more details about backtracking line-search the reader is referred to Chapter 9 in []. The termination criteria of the inner loop in step 7 can be relaxed, for example, ɛ 1 ɛ and ɛ l ɛ l 1 l >, since the subproblems (5) for given l and µ l µ do not have to be solved in high accuracy. The analysis of Newton-CG algorithm, steps 5-1, without the continuation scheme is presented in the next section. This analysis will provide us with useful results that will be used for the proof of the worst-case iteration complexity of the doubly-continuation scheme. 4 Convergence analysis of Newton-CG with backtracking line-search In this section we analyze the Newton-CG with backtracking line-search, steps 5-1, of algorithm dcncg. In particular, we prove global convergence, we study the global and local convergence rates and we explicitly define a region in which Newton-CG has fast convergence rate. Additionally, worst-case iteration complexity result of this Newton-CG is presented. Since Newton-CG in this section does not depend on the continuation scheme, the indexes and µ from function f µ (x) are dropped. Moreover, the upper bound of the largest eigenvalue of f µ (x), shown in (17), is denoted by λ 1 = ( µ + λ 1 ), and the Lipschitz constant L f µ defined in Lemma 6, is denoted by L. Finally, for this section, an upper bound of the condition number of matrix f(x) for any x will be used, which is denoted by κ = λ 1 λ m. 4.1 Global convergence rate of Newton-CG with backtracking line-search In the following lemma the decrease of the objective function at every iteration of Newton-CG with backtracking line-search is calculated. In this lemma the constants c and c 3 are defined in the backtracking line-search algorithm given in Section 3. Lemma 1 Let x R m be the current iteration of Newton-CG, d R m be the Newton-CG direction calculated using the truncated CG algorithm described in subsection.3. The parameter η of the termination criterion (1) of the CG

16 16 Kimon Fountoulakis and Jacek Gondzio algorithm is set to η < 1. If x is not the minimizer of problem (5), i.e. f(x), then, the backtracking line-search algorithm will calculate a step-size ᾱ such that For this step-size ᾱ the following holds where c 4 = c c 3 1 κ ᾱ c 3 λ m λ 1. f(x) f(x(ᾱ)) > c 4 d x, and x(ᾱ) = x + ᾱd. Proof For x(α) = x + αd and from strong convexity of f(x) we have f(x(α)) f(x) + α f(x) d + α λ 1 d. If f(x) =, due to positive definiteness of f(x), see (17), the CG algorithm terminated at the i th iteration returns the vector d i which according to (3) satisfies Therefore, by setting d := d i we get d i f(x)d i = f(x) d i. Using (18) we get f(x(α)) f(x) α d x + α λ 1 d. f(x(α)) f(x) α d x + α λ 1 λ m d x. The right hand side of the above inequality is minimized for ᾱ = λm, which λ 1 gives f(x(ᾱ)) f(x) 1 λ m d λ x. 1 Observe, that for this step-size the exit condition of the backtracking linesearch algorithm is satisfied, since f(x(ᾱ)) f(x) 1 λ m d λ λ m x < f(x) c d 1 λ x. 1 Therefore, the step-size ᾱ returned by the backtracking line-search algorithm is in worst-case bounded by ᾱ c 3 λ m λ 1, which results in the following decrease of the objective function The proof is complete. f(x) f(x(ᾱ)) > c c 3 λ m λ 1 d x = c c 3 1 κ d x.

17 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 17 Standard convergence analysis treats the Newton algorithm with backtracking line-search as a two phase method. For the first phase, which is of our interest in this subsection, linear convergence rate is proved. In particular, convergence is measured according to the decrease of the objective function at every iteration, which is proved to be c 4 d N x, where d N is the Newton direction, see page 49 in []. For the Newton-CG algorithm with backtracking line-search, which is studied in this paper, it is proved in Lemma 1 that the decrease of the objective function at every iteration of Newton-CG for the first phase is c 4 d x, where d is the inexact Newton direction returned by CG. Hence, Newton-CG with backtracking line-search enjoys analogous global linear convergence rate to that of Newton algorithm with backtracking line-search. 4. Region of fast convergence rate of Newton-CG In this section we define a region based on d x, in which, by setting parameter η as in () with c = 1, the Newton-CG algorithm converges with fast rate. By fast rate it is meant that when Newton-CG is initialized in this region, then the constant required accuracy. worst-case iteration complexity result is of the form log log We will show in Subsection 4.4 that the rate of convergence which is proved in this subsection for the explicitly defined region satisfies the previously stated condition. The lemma below is crucial for the analysis in this subsection, it shows the behaviour of the function f(x) when a step along the Newton-CG direction is made. Lemma 11 Let x R m be the current iteration of Newton-CG, d R m be the Newton-CG direction calculated using the CG algorithm described in subsection.3. The parameter η of the termination criterion (1) of the CG algorithm is set to η < 1. Then f(x) f(x(α)) α d x α d x α3 L d 3 6 x, λ 3 m where x(α) = x + αd and α >. Proof Using Lemma 8 and setting y = x(α) = x + αd we get f(x(α)) f(x) + α f(x) d + α d f(x)d + α3 6 L d 3. Using (18) and (3) we get f(x(α)) f(x) α d x + α d x + α3 L d 3 6 x. λ 3 m The result is obtained by rearrangement of terms. The proof is complete.

18 18 Kimon Fountoulakis and Jacek Gondzio Based on Lemma 11, a region is defined in the following lemma, in which unit-step sizes are calculated by the backtracking line-search algorithm. Additionally, for this region, d k+1 x k+1 is bounded as a function of d k x k. In this lemma the constants c and c 3 have been defined in the backtracking line-search algorithm, given in Section 3, moreover, x k+1 = x k + d k. 3 Lemma 1 If d k x k 3(1 c ) λ ml, then the backtracking line-search algorithm calculates unit step-sizes. Moreover, if the parameter η k of the termination criterion (1) of the CG algorithm is set as in () with c = 1, then for two consequent directions d k and d k+1 and points x k and x k+1 of Newton-CG algorithm, the following holds 1 16 λ 1 + L ( 1 d k+1 λ 3 x k+1 m 16 λ 1 + L ). d k x k Proof By setting ᾱ = 1 in Lemma 11 we get f(x k ) f(x k+1 ) 1 dk x 1 L ( 1 d k 3 k 6 λ 3 x = k m 1 L ) d k 6 λ 3 x k d k x k. m if d k x k 3(1 c ) λ 3 ml we get λ 3 m f(x k ) f(x k+1 ) c d k x k, which implies that ᾱ = 1 satisfies the exist condition of the backtracking linesearch algorithm. Let us define the quantity f(x(t)) h, where x(t) = x k +td k and h R m, then we have f(x(t)) h = f(x k ) h + t(d k ) f(x k )h + t u 3 f(x(δ))[d k, d k, h]dδdu f(x k ) h + t(d k ) f(x k )h t u + 3 f(x(δ))[d k, d k, h] dδdu = f(x k ) h + t(d k ) f(x k )h t u + lim (dk ) ( f(x(δ)) f(x k ))h dδdu δ δ f(x k ) h + t(d k ) f(x k )h t u + d k h lim 1 δ ( f(x(δ)) f(x k )) dδdu δ f(x k ) h + t(d k ) f(x k )h + d k h t u = f(x k ) h + t(d k ) f(x k )h + t L dk h. L d k dδdu

19 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 19 By taking absolute values and setting t = 1 we get f(x k+1 ) h f(x k ) h + (d k ) f(x k )h + 1 L dk h From (18) and (1), by setting r k = f(x k )d k + f(x k ), we get ( η f(x k+1 ) k h f(x k ) + 1 L ) d k λ 1 m λ 3 x h k x k+1. m Since, η k is set according to (), by using (18) and Lemma 9 we get ( f(x k+1 ) 4 λ 1 h + 1 L λ 3 m (1 (η k ) ) 1 ( 16 λ 1 + L λ 3 m λ 3 m ) d k x k h x k+1. ) d k x k h x k+1 The previous result holds for every h R m, hence, by setting h = d k+1 and by using (3) we prove the second part of this lemma. The proof is complete. The following corollary states the region of fast convergence rate of Newton- CG. Corollary 1 If the parameter η k in the termination criterion (1) of the CG algorithm is set as in () with c = 1 and d k x k < ϖ, < ϖ c 5, where { c 5 = min 3(1 c ) λ 3 m L, 3 λ } m 16 λ 1 + L, then according to Lemma 1 Newton-CG convergences with fast rate. 4.3 Global Convergence of Newton-CG with backtracking line-search The global convergence result of Newton-CG algorithm with backtracking linesearch, steps 5-1 of dcncg algorithm, follows. Theorem 1 Let {x k } be a sequence generated by Newton-CG algorithm with step-sizes calculated by backtracking line-search. The parameter η of the termination criterion (1) of the CG algorithm is set to a value η < 1. The sequence {x k } converges to x, which is the minimizer of f(x) in problem (5). Proof Let {x k } be a sequence generated by Newton-CG algorithm with stepsizes ᾱ k calculated by the backtracking line-search algorithm presented in Section 3. The positive definiteness of f(x) at any x, see (17), and the fact that η < 1 in (1), imply that CG returns d k = at a point x k if and only if f(x k ) =. Hence, only at optimality CG will return a zero direction. Moreover, from Lemma 1 we get that if f(x k ) =, then ᾱ k is bounded away from zero and the function f(x) is monotonically decreasing

20 Kimon Fountoulakis and Jacek Gondzio when the step ᾱ k d k is applied. The monotonic decrease of the objective function implies that {f(x k )} converges to a limit, thus, {f(x k ) f(x k+1 )}. Therefore, the sequence {x k } converges to a point x. Using Lemma 1 and {f(x k ) f(x k+1 )} we get that d k x, hence, due to strong convexity of f(x), d k. Moreover, from Lemma 9 we deduce that if d k, then f(x k ), therefore, x is a stationary point of function f(x). Strong convexity of f(x) guarantees that a stationary point must be a minimizer. The proof is complete. 4.4 Worst-case iteration complexity of Newton-CG with backtracking line-search The following theorem shows the worst-case iteration complexity of Newton- CG with backtracking line-search in order to enter the region of fast convergence rate, i.e. d x < ϖ, where < ϖ c 5 and c 5 has been defined in Corollary 1. In this theorem the constant c 4 has been defined in Lemma 1, c and c 3 are constants of the backtracking line-search algorithm given in Section 3. Moreover, x denotes the minimizer of f(x). Theorem Starting from an initial point x, such that d x ϖ, Newton- CG with backtracking line-search and η < 1 in the termination criterion (1) of CG, requires at most ( f(x ) f(x ) ) K 1 = c 6 log c 7 ϖ, iterations to obtain a solution x k, k >, such that d k x k < ϖ, where c 6 = κ 3 (1 η ) c c 3 and c 7 = 1 8κ. Proof Let us assume an iteration index k >, then from (18), Lemmas 5 and 9 we get f(x k ) f(x ) 1 8κ dk xk, (5) and From Lemma 1 we have f(x k 1 ) f(x ) κ (1 η ) dk 1 xk 1. (6) f(x k ) < f(x k 1 ) c 4 d k 1 xk 1. (7) Combining (6), (7) and subtracting f(x ) from both sides we get ( f(x k ) f(x ) < 1 (1 η ) c ) 4 κ (f(x k 1 ) f(x )) ( < 1 (1 η ) c ) 4 k(f(x κ ) f(x )) ( = 1 (1 η ) c c ) 3 k(f(x κ 3 ) f(x ))

21 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 1 From the last inequality and (5) we get 1 ( 8κ dk x < 1 (1 η ) c c ) 3 k(f(x k κ 3 ) f(x )). Using the definitions of constants c 6 and c 7 we have d k x k < (1 1 c 6 ) k 1 c 7 (f(x ) f(x )). Hence, we conclude that after at most K 1 iterations as defined in the preamble of this theorem, the algorithm produces d k x k < ϖ. The proof is complete. The following theorem presents the worst-case iteration complexity result of Newton-CG to obtain a solution x l, of accuracy f(x l ) f(x ) < ɛ, when initialized at a point inside the region of fast convergence. The index l has been used as an iteration counter of the doubly-continuation loop, however, only for the next theorem it will be used as an iteration counter of Newton-CG algorithm. Theorem 3 Suppose that there is an iteration index k, such that d k x k < ϖ. If η in (1) is set as in () with c = 1, then Newton-CG needs at most ( c8 ) K = log log ɛ iterations to obtain a solution x l, l > k, such that f(x l ) f(x ) < ɛ, where c 8 = 3κ λ 3 m (16 λ 1 + L). Proof Suppose that there is an iteration index k such that d k x k for an index l > k, by applying Lemma 1 recursively we get < ϖ, then 1 16 λ 1 + L ( 1 d l λ 3 x l m 16 λ 1 + L d k x k λ 3 m From (18), Lemmas 5 and 9 and η k in (1) we get f(x l ) f(x ) ) l k < κ (1 η ) dl x l 8κ d l x l, By replacing (8) in the above inequality we get f(x l ) f(x ) < 3κ λ 3 ( m 1 ) l k+1 (16 λ 1 +. L) ( 1 ) l k. (8) Hence, in order to obtain a solution x l, such that f(x l ) f(x ) < ɛ, Newton- CG requires at most as many iterations as in the preamble of this theorem. The proof is complete.

22 Kimon Fountoulakis and Jacek Gondzio The following theorem summarizes the complexity result of Newton-CG with backtracking line-search. The constants c 6, c 7 and c 8 in this theorem are defined in Theorems and 3, respectively. A worst-case iteration complexity result of the form in Theorem 4 below has been first obtained in [5], for an inexact Newton method based on cubic regularization. Theorem 4 Starting from an initial point x, such that d x ϖ, Newton- CG with backtracking line-search requires at most ( f(x ) f(x ) ) ( c8 ) K 3 = c 6 log c 7 ϖ + log log ɛ iterations to converge to a solution x k, k >, of accuracy f(x k ) f(x ) < ɛ. 5 Worst-case iteration complexity of dcncg with backtracking line-search First, let us remind the reader that in Section 4 the Newton-CG algorithm has been analyzed independently of the doubly-continuation scheme, hence, for simplicity of notation, the parameters and µ have been dropped. In this section, it is necessary to introduce back the dependence on the parameters l and µ l for each continuation iteration l. In particular, when a reference to results from Section 4 is made, then the usual notation which includes these parameters will be used. Moreover, k denotes the iteration index of dcncg algorithm, x l is used to denote the initial solution of the l th inner loop or an approximate minimizer of f µl 1 (x) for l >. For example, if the termination l 1 criterion in step 7 is satisfied for x k and d k at the (l 1) th continuation iteration, then x l := x k. Finally, z l is the minimizer of f µl l (x) and z is the minimizer of f µ ( z). We start by an intuitive explanation of the worst-case iteration complexity result. For this, the following summary of doubly-continuation scheme is helpful. Summary of algorithm dcncg 1: Outer loop: For l =, 1,,..., ϑ, produce sequences { l } and {µ l }, where ϑ is the number of continuation iterations. : Inner loop: Approximately solve the subproblem minimize f µl l (x) using Newton-CG with backtracking line-search. From the summary of the continuation scheme it is clear that the total number of required steps is the sum of steps required for the approximate

23 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 3 solution of each subproblem. In the previous section, we established the worstcase iteration complexity of the inner loop, i.e. the approximate solution of the subproblem by using Newton-CG with backtracking line-search. In particular, to obtain f µl (x l+1 ) f µl (z l ) < ɛ for the l th problem, then from Theorem 4 we l l have that at most K3 l iterations of Newton-CG with backtracking line-search are required. Hence, if every l th problem is solved to accuracy f µl (x l+1 ) l f µl (z l ) < ɛ then the worst-case iteration complexity of the continuation scheme l is ϑ l=1 Kl 3. However, it would be preferable to have a bound of this result independent of the indexing l. For this reason, in this section we will show that ϑ l=1 K l 3 ϑ max l ϑ Kl 3 < ϑk 4, where K 4 is an upper bound of the worst-case K l 3 l. For convenience, let us explicitly state K l 3, where ( f µ l K3 l = c l (x l ) f µl (z l )) ( 6 log l c l ) l c l + log 7 (ϖl ) log 8, (9) ɛ c l 6 = c l 8 = ( ) 3 l + λ µ l 1 (1 η ) λ 3, c l 7 = mc c 3 ( ( 16 l ( ) λm 3 l + λ µ l 1 µ l + λ 1 ) + ( λ m ( ), 8 l + λ µ l 1 )) l + L (µ l ) ϕ and c and c 3 are constants of the backtracking line-search algorithm, which do not depend on the continuation iteration index l. Moreover, according to Corollary 1, the region of fast convergence of Newton-CG for the l th subproblem is defined for d k x k,l < ϖ l, where ϖ l c l 5 and { c l 5 = min 3(1 c ) ( λ 3 m ), l + L (µ l ) ϕ ( 16 l λ 3 m µ l + λ 1 ) + ( ) l + L (µ l ) ϕ In order to obtain the upper bound K 4 one needs to calculate upper bounds for the quantities c l 6, c l 8 and f µl (x l ) f µl (z l ), and a lower bound for c l l l 7 over all l. Additionally, a lower bound of the quantity c l 5 is required, which will be used to obtain a tighter upper bound for ϖ l over all l, as required by Corollary 1. The following remark will be used for this purpose. Remark 1 The parameters β 1, β and, µ are set such that < β 1 β < 1 µ and log β µ < log β1. Briefly, we require that the sequence of iterates {µ l } produced by (4) converges to µ before the sequence { l } converges to, }.

24 4 Kimon Fountoulakis and Jacek Gondzio although, { l } has a faster or similar rate of convergence. Then, according to rule (4) the sequences {β l 1} and {β l } are generated for which < β l 1 β l 1 and l+1 = β l 1 l and µ l+1 = β l µ l. (3) Even though this assumption is not necessary for convergence of the doublycontinuation scheme, it facilitates the proof of monotonic decrease of f µl (x) l l, and it simplifies the calculation of some non-essential results. Using Remark 1 it is easy to find upper bounds for the quantities c l 6, c l 8 and lower bounds for c l 5 and c l 7. In particular, from Remark 1 we have µ l µ l µ l. Hence, the quantities c l 5, c l 6, c l 7, c l 8 are replaced with and { c 5 = min 3(1 c ) ( λ 3 m ), µ µ + L ϕ λ 3 m ( ) ( 16 µ + λ 1 + µ µ + L ϕ ( ) 3 µ + λ c 6 = 1 λ m (1 η ) λ 3, c 7 = ( ) mc c, 3 8 µ + λ 1 ( ) λm 3 µ + λ c 8 = 1 ( ( ) ( )). 16 µ + λ 1 + µµ + L ϕ } ), Having a lower bound of c l 5 l, we can tighten the conditions of Corollary 1, which are replaced with d k xk,l < ϖ and ϖ c 5 l. (31) Another consequence of Remark 1 is the following lemma. Lemma 13 For any x, if Remark 1 is satisfied, then f µ (x) > f µ1 1 (x) and f µl 1 l 1 (x) > f µl l (x) l >. Proof For (3) we can rewrite functions f µ1 (x) and f µl (x) as f β 1 µ l β1 (x) and f βl 1 µ l 1 β l 1 1 f µl µ (x), respectively. It is easy to show that f l 1 (x) > f β µ β1 (x) and µ l 1 β l 1 1 l (x) > f βl 1 l 1 (x). The proof is complete. Therefore using Lemma 13, the fact that z l is the minimizer of function f µl l (x) and x is the initial point of the continuation scheme we get f µ (x ) > f µ (z ) > f µ1 1 (z ) > f µ1 1 (z 1 ) >... > f µ ( z). (3)

25 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 5 It remains now to find a bound for the distance ρ l = f µl (x l ) f µl (z l ). For l l l = using (3) we get ρ < f µ (x ) f µ ( z). (33) For l > we proceed by adding and subtracting the term f µl l (z l 1 ) in ρ l, ρ l = f µl l (x l ) f µl l (z l 1 ) + f µl l (z l 1 ) f µl l (z l ). We shall analyze independently the following two quantities ρ l 1 = f µl l (x l ) f µl l (z l 1 ) and ρ l = f µl l (z l 1 ) f µl l (z l ) First we start with the bound on ρ l 1. From strong convexity of f µl l (x), we have that ρ l 1 f µl l (x l ) (x l z l 1 ) λ m xl z l 1 f µl l (x l ) x l z l 1 λ m xl z l 1. (34) It is obvious that we need tight bounds for f µl l (x l ) and x l z l 1. From Remark 1, we distinguish two scenarios for the sequences {β l 1} and {β l }, first < β l 1 β l < 1, second < β l 1 1 and β l = 1. For the first case by using (3) and (19) in (34) we get ρ l 1 ( f µl 1 (x l ) + l 1( ) log(βl ) m) x l l 1 1 β l z l 1 λ m xl z l 1. While if < β l 1 1 and β l = 1, then using (3) and () in (34) we get ρ l 1 ( f µl 1 l 1 (x l ) + l 1 m) x l z l 1 λ m xl z l 1. The upper bound of ρ l based on these two cases is given by the first bound and by replacing l with and β l with β = min l β l, ρ l 1 ( f µl 1 l 1 (x l ) + ω m) x l z l 1 λ m xl z l 1, (35) ( where ω = log(β) 1 β ). Using (18), Lemmas 5 and 9 and setting η < 1 we have that ( ) ( ) l 1 + λ f µl 1 (x l µ ) l 1 1 l 1 + λ l 1 1 η d l µ l 1 1 d l (1 η )λ 1 x l,l 1 (36) m and x l z l 1 f µ l 1 (x l ) λ l 1 m ( ) 4 l 1 + λ µ l 1 1 (1 η )λ 3 m d l xl,l 1. (37)

26 6 Kimon Fountoulakis and Jacek Gondzio By replacing (36) and (37) in (35) we get ( ) ( ) ( l 1 + λ ρ l µ 1 l 1 1 d l (1 η )λ 1 x l,l 1 + ω )4 l 1 + λ µ m l 1 1 d l m (1 η )λ 3 x l,l 1 m ( ) 8 l 1 + λ µ l 1 1 (1 η )λ d l x l,l 1 m 4 ω ( ) m l 1 + λ µ = l ω ( ) m d l µ + λ (1 η )λ 3 x l,l 1 1 d l m (1 η )λ 3 x,l 1. l m Notice that at termination of the (l 1) th inner loop of dcncg algorithm, step 7, we have that d l xl,l 1 ɛ l 1. By considering a sequence of termination tolerances ɛ 1 ɛ and ɛ l ɛ l 1 l >, then d l xl,l 1 ɛ, hence, we get 4 ω ( ) m ρ l µ + λ 1 1 ɛ. (38) (1 η )λ 3 m Regarding ρ l, using (3) we get ρ l < f µ (x ) f µ ( z), (39) which is a finite constant. We can now calculate the bound of ρ l = ρ l 1 + ρ l, for l >, through the bounds of ρ l 1 and ρ l, (38) and (39), respectively, 4 ω ( ) m ρ l = ρ l 1 + ρ l µ + λ < 1 ɛ + f µ (1 η )λ 3 (x ) f µ ( z) = ρ, (4) m which holds if every x l point satisfies d l x l,l 1 ɛ. In the following theorem the worst-case iteration complexity of the doublycontinuation scheme is given. Let us remind the reader that in this theorem, the constants c 5, c 6, c 7, c 8 and ρ have been defined in the beginning of this section, the constant ϖ satisfies (31), η < 1 is the tolerance of the termination criterion (1) of the CG algorithm. Theorem 5 If β 1, are set according to Remark 1 and ɛ 1 ɛ, ɛ l ɛ l 1 l > in step 7 of dcncg algorithm, then the dcncg algorithm initialized at x requires at most ( ( ρ ) ( )) K 5 = c 6 log c 7 ϖ + log log log c8 ɛ β1 iterations to converge at a solution x of accuracy λ m f µ ( x) f µ ( z) < ɛ, where ɛ = ( ) (ɛ ). 8 µ + λ 1

27 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 7 Proof According to Remark 1 the number of continuation iterations is ϑ = log β1. A worst-case K l 3 over all l, denoted at the beginning of this section with K 4, is given by replacing the quantities ϖ l, c l 5, c l 6, c l 7, c l 8 and f µl (x l ) l f µl (z l ) in (9) with appropriate bounds over all continuation iterations l, as it l has been shown previously in this section. First, c l 5, c l 7 and c l 6, c l 8 are replaced with their lower and upper bounds c 5, c 7 and c 6, c 8, respectively, over all l. The conditions d k x k,l < ϖ l and ϖ l c l 5 l, of Corollary 1, are replaced by the tighter conditions (31). For l = we bound f µl (x l ) f µl (z l ) by (33), while for l l l > we bound this distance by (4), which holds if d k xk,l 1 ɛ in step 7 of dcncg. To guarantee the previous condition, sufficient iterations of Newton- CG are performed at the continuation step l 1. In order to calculate a worstcase iteration complexity such that d k x k,l 1 ɛ two cases are distinguished for the magnitude of ɛ, first ɛ < ϖ, second ɛ ϖ. The worst case is ( f µ (x ) f µ ( z) ) given for ɛ < ϖ. If l =, according to Theorem 4, c 6 log c 7 ϖ + ( ) c log log 8 ɛ l iterations are needed for Newton-CG to converge to a solution x l+1 such that f µl (x l+1 ) f µl (z l ) < ɛ l and d l+1 l l x l+1,l < ɛ, where ɛ l = λ ( m ) (ɛ ). The previous bound on d l+1 xl+1,l is obtained by using (18), 8 l µ l +λ1 Lemmas 5 and 9, which give us λ m ( ) d l+1 x l+1,l f µl (x l+1 ) f µl (z l ) < ɛ l 8 l l l + λ µ l 1 l. Hence from (4) we get that ρ 1 < ρ. If the previous statement holds, ( then ) for l = 1, according to Theorem 4 and (4) by performing c 6 log ρ c 7 ϖ + ( ) c log log 8 ɛ l iterations of Newton-CG, it is guaranteed that d l+1 x l+1,l < ɛ and ρ < ρ. In a similar way it is guaranteed that ρ l < ρ l >. Additionally, ρ < ρ, because the upper bound of ρ in (33) is smaller than ρ, therefore ρ l < ρ l. From Remark 1 we have µ l µ l µ l, hence, ɛ ɛ l l. Notice that the condition d l+1 x l+1,l < ɛ is preserved if we require accuracy ɛ for all l continuation iterations instead of ɛ l. We conclude that ( ρ ) ( ) max l ϑ Kl 3 K 4 = c 6 log c 7 ϖ + log log. c8 ɛ If K 4 iterations of Newton-CG are performed for every l th continuation iteration, then d l+1 x l+1,l < ɛ and f µl (x l+1 ) f µl (z l ) < ɛ l. By multiplying K l l 4 with ϑ we obtain the result. The proof is complete.

Edinburgh Research Explorer

Edinburgh Research Explorer Edinburgh Research Explorer A Second-Order Method for Strongly Convex L1-Regularization Problems Citation for published version: Fountoulakis, K & Gondzio, J 216, 'A Second-Order Method for Strongly Convex

More information

A Second-Order Method for Strongly Convex l 1 -Regularization Problems

A Second-Order Method for Strongly Convex l 1 -Regularization Problems Noname manuscript No. (will be inserted by the editor) A Second-Order Method for Strongly Convex l 1 -Regularization Problems Kimon Fountoulakis and Jacek Gondzio the date of receipt and acceptance should

More information

Line Search Methods for Unconstrained Optimisation

Line Search Methods for Unconstrained Optimisation Line Search Methods for Unconstrained Optimisation Lecture 8, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The Generic

More information

The Steepest Descent Algorithm for Unconstrained Optimization

The Steepest Descent Algorithm for Unconstrained Optimization The Steepest Descent Algorithm for Unconstrained Optimization Robert M. Freund February, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 1 Steepest Descent Algorithm The problem

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL) Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective

More information

Numerical Methods for Large-Scale Nonlinear Systems

Numerical Methods for Large-Scale Nonlinear Systems Numerical Methods for Large-Scale Nonlinear Systems Handouts by Ronald H.W. Hoppe following the monograph P. Deuflhard Newton Methods for Nonlinear Problems Springer, Berlin-Heidelberg-New York, 2004 Num.

More information

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Clément Royer - University of Wisconsin-Madison Joint work with Stephen J. Wright MOPTA, Bethlehem,

More information

Convex Optimization. Problem set 2. Due Monday April 26th

Convex Optimization. Problem set 2. Due Monday April 26th Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining

More information

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent 10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Optimal Newton-type methods for nonconvex smooth optimization problems

Optimal Newton-type methods for nonconvex smooth optimization problems Optimal Newton-type methods for nonconvex smooth optimization problems Coralia Cartis, Nicholas I. M. Gould and Philippe L. Toint June 9, 20 Abstract We consider a general class of second-order iterations

More information

Optimization Tutorial 1. Basic Gradient Descent

Optimization Tutorial 1. Basic Gradient Descent E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

More information

Unconstrained minimization of smooth functions

Unconstrained minimization of smooth functions Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and

More information

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China) Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu

More information

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization Frank E. Curtis Department of Industrial and Systems Engineering, Lehigh University Daniel P. Robinson Department

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

c 2007 Society for Industrial and Applied Mathematics

c 2007 Society for Industrial and Applied Mathematics SIAM J. OPTIM. Vol. 18, No. 1, pp. 106 13 c 007 Society for Industrial and Applied Mathematics APPROXIMATE GAUSS NEWTON METHODS FOR NONLINEAR LEAST SQUARES PROBLEMS S. GRATTON, A. S. LAWLESS, AND N. K.

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

Self-Concordant Barrier Functions for Convex Optimization

Self-Concordant Barrier Functions for Convex Optimization Appendix F Self-Concordant Barrier Functions for Convex Optimization F.1 Introduction In this Appendix we present a framework for developing polynomial-time algorithms for the solution of convex optimization

More information

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic Applied Mathematics 205 Unit V: Eigenvalue Problems Lecturer: Dr. David Knezevic Unit V: Eigenvalue Problems Chapter V.4: Krylov Subspace Methods 2 / 51 Krylov Subspace Methods In this chapter we give

More information

Performance of First- and Second-Order Methods for l 1 -Regularized Least Squares Problems

Performance of First- and Second-Order Methods for l 1 -Regularized Least Squares Problems Noname manuscript No. (will be inserted by the editor) Performance of First- and Second-Order Methods for l 1 -Regularized Least Squares Problems Kimon Fountoulakis Jacek Gondzio Received: date / Accepted:

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

FIXED POINT ITERATIONS

FIXED POINT ITERATIONS FIXED POINT ITERATIONS MARKUS GRASMAIR 1. Fixed Point Iteration for Non-linear Equations Our goal is the solution of an equation (1) F (x) = 0, where F : R n R n is a continuous vector valued mapping in

More information

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J 7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured

More information

The Conjugate Gradient Method

The Conjugate Gradient Method The Conjugate Gradient Method Lecture 5, Continuous Optimisation Oxford University Computing Laboratory, HT 2006 Notes by Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The notion of complexity (per iteration)

More information

A Distributed Newton Method for Network Utility Maximization, II: Convergence

A Distributed Newton Method for Network Utility Maximization, II: Convergence A Distributed Newton Method for Network Utility Maximization, II: Convergence Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract The existing distributed algorithms for Network Utility

More information

This manuscript is for review purposes only.

This manuscript is for review purposes only. 1 2 3 4 5 6 7 8 9 10 11 12 THE USE OF QUADRATIC REGULARIZATION WITH A CUBIC DESCENT CONDITION FOR UNCONSTRAINED OPTIMIZATION E. G. BIRGIN AND J. M. MARTíNEZ Abstract. Cubic-regularization and trust-region

More information

Numerical optimization

Numerical optimization Numerical optimization Lecture 4 Alexander & Michael Bronstein tosca.cs.technion.ac.il/book Numerical geometry of non-rigid shapes Stanford University, Winter 2009 2 Longest Slowest Shortest Minimal Maximal

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Chapter 8 Gradient Methods

Chapter 8 Gradient Methods Chapter 8 Gradient Methods An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Introduction Recall that a level set of a function is the set of points satisfying for some constant. Thus, a point

More information

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings Structural and Multidisciplinary Optimization P. Duysinx and P. Tossings 2018-2019 CONTACTS Pierre Duysinx Institut de Mécanique et du Génie Civil (B52/3) Phone number: 04/366.91.94 Email: P.Duysinx@uliege.be

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué

Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué Clément Royer (Université du Wisconsin-Madison, États-Unis) Toulouse, 8 janvier 2019 Nonconvex optimization via Newton-CG

More information

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization E5295/5B5749 Convex optimization with engineering applications Lecture 8 Smooth convex unconstrained and equality-constrained minimization A. Forsgren, KTH 1 Lecture 8 Convex optimization 2006/2007 Unconstrained

More information

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems 1 Numerical optimization Alexander & Michael Bronstein, 2006-2009 Michael Bronstein, 2010 tosca.cs.technion.ac.il/book Numerical optimization 048921 Advanced topics in vision Processing and Analysis of

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

Lecture 17: October 27

Lecture 17: October 27 0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018 Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Convergence Analysis of Inexact Infeasible Interior Point Method. for Linear Optimization

Convergence Analysis of Inexact Infeasible Interior Point Method. for Linear Optimization Convergence Analysis of Inexact Infeasible Interior Point Method for Linear Optimization Ghussoun Al-Jeiroudi Jacek Gondzio School of Mathematics The University of Edinburgh Mayfield Road, Edinburgh EH9

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

5 Handling Constraints

5 Handling Constraints 5 Handling Constraints Engineering design optimization problems are very rarely unconstrained. Moreover, the constraints that appear in these problems are typically nonlinear. This motivates our interest

More information

Linear algebra issues in Interior Point methods for bound-constrained least-squares problems

Linear algebra issues in Interior Point methods for bound-constrained least-squares problems Linear algebra issues in Interior Point methods for bound-constrained least-squares problems Stefania Bellavia Dipartimento di Energetica S. Stecco Università degli Studi di Firenze Joint work with Jacek

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Zhaosong Lu October 5, 2012 (Revised: June 3, 2013; September 17, 2013) Abstract In this paper we study

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

An Active Set Strategy for Solving Optimization Problems with up to 200,000,000 Nonlinear Constraints

An Active Set Strategy for Solving Optimization Problems with up to 200,000,000 Nonlinear Constraints An Active Set Strategy for Solving Optimization Problems with up to 200,000,000 Nonlinear Constraints Klaus Schittkowski Department of Computer Science, University of Bayreuth 95440 Bayreuth, Germany e-mail:

More information

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth

More information

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

6. Iterative Methods for Linear Systems. The stepwise approach to the solution... 6 Iterative Methods for Linear Systems The stepwise approach to the solution Miriam Mehl: 6 Iterative Methods for Linear Systems The stepwise approach to the solution, January 18, 2013 1 61 Large Sparse

More information

Sparse Approximation via Penalty Decomposition Methods

Sparse Approximation via Penalty Decomposition Methods Sparse Approximation via Penalty Decomposition Methods Zhaosong Lu Yong Zhang February 19, 2012 Abstract In this paper we consider sparse approximation problems, that is, general l 0 minimization problems

More information

Primal-Dual Interior-Point Methods for Linear Programming based on Newton s Method

Primal-Dual Interior-Point Methods for Linear Programming based on Newton s Method Primal-Dual Interior-Point Methods for Linear Programming based on Newton s Method Robert M. Freund March, 2004 2004 Massachusetts Institute of Technology. The Problem The logarithmic barrier approach

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

University of Houston, Department of Mathematics Numerical Analysis, Fall 2005

University of Houston, Department of Mathematics Numerical Analysis, Fall 2005 3 Numerical Solution of Nonlinear Equations and Systems 3.1 Fixed point iteration Reamrk 3.1 Problem Given a function F : lr n lr n, compute x lr n such that ( ) F(x ) = 0. In this chapter, we consider

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Complexity of gradient descent for multiobjective optimization

Complexity of gradient descent for multiobjective optimization Complexity of gradient descent for multiobjective optimization J. Fliege A. I. F. Vaz L. N. Vicente July 18, 2018 Abstract A number of first-order methods have been proposed for smooth multiobjective optimization

More information

Chapter 7 Iterative Techniques in Matrix Algebra

Chapter 7 Iterative Techniques in Matrix Algebra Chapter 7 Iterative Techniques in Matrix Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematics University of California, Berkeley Math 128B Numerical Analysis Vector Norms Definition

More information

Written Examination

Written Examination Division of Scientific Computing Department of Information Technology Uppsala University Optimization Written Examination 202-2-20 Time: 4:00-9:00 Allowed Tools: Pocket Calculator, one A4 paper with notes

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum

More information

Search Directions for Unconstrained Optimization

Search Directions for Unconstrained Optimization 8 CHAPTER 8 Search Directions for Unconstrained Optimization In this chapter we study the choice of search directions used in our basic updating scheme x +1 = x + t d. for solving P min f(x). x R n All

More information

Newton-like method with diagonal correction for distributed optimization

Newton-like method with diagonal correction for distributed optimization Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić August 15, 2015 Abstract We consider distributed optimization problems

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

December 30, 2014, revised in July 2, 2015

December 30, 2014, revised in July 2, 2015 A PRECONDITIONER FOR A PRIMAL-DUAL NEWTON CONJUGATE GRADIENTS METHOD FOR COMPRESSED SENSING PROBLEMS December 30, 2014, revised in July 2, 2015 IOANNIS DASSIOS, KIMON FOUNTOULAKIS, AND JACEK GONDZIO Abstract.

More information

An Inexact Sequential Quadratic Optimization Method for Nonlinear Optimization

An Inexact Sequential Quadratic Optimization Method for Nonlinear Optimization An Inexact Sequential Quadratic Optimization Method for Nonlinear Optimization Frank E. Curtis, Lehigh University involving joint work with Travis Johnson, Northwestern University Daniel P. Robinson, Johns

More information

Gradient Methods Using Momentum and Memory

Gradient Methods Using Momentum and Memory Chapter 3 Gradient Methods Using Momentum and Memory The steepest descent method described in Chapter always steps in the negative gradient direction, which is orthogonal to the boundary of the level set

More information

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University Convex Optimization 9. Unconstrained minimization Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao Tong University 2017 Autumn Semester SJTU Ying Cui 1 / 40 Outline Unconstrained minimization

More information

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications Weijun Zhou 28 October 20 Abstract A hybrid HS and PRP type conjugate gradient method for smooth

More information

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Gradient Descent, Newton-like Methods Mark Schmidt University of British Columbia Winter 2017 Admin Auditting/registration forms: Submit them in class/help-session/tutorial this

More information

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem min{f (x) : x R n }. The iterative algorithms that we will consider are of the form x k+1 = x k + t k d k, k = 0, 1,...

More information

Newton-like method with diagonal correction for distributed optimization

Newton-like method with diagonal correction for distributed optimization Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić February 7, 2017 Abstract We consider distributed optimization

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

Introduction to Nonlinear Optimization Paul J. Atzberger

Introduction to Nonlinear Optimization Paul J. Atzberger Introduction to Nonlinear Optimization Paul J. Atzberger Comments should be sent to: atzberg@math.ucsb.edu Introduction We shall discuss in these notes a brief introduction to nonlinear optimization concepts,

More information

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem min{f (x) : x R n }. The iterative algorithms that we will consider are of the form x k+1 = x k + t k d k, k = 0, 1,...

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Second Order Optimization Algorithms I

Second Order Optimization Algorithms I Second Order Optimization Algorithms I Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 7, 8, 9 and 10 1 The

More information

Lecture 14: October 17

Lecture 14: October 17 1-725/36-725: Convex Optimization Fall 218 Lecture 14: October 17 Lecturer: Lecturer: Ryan Tibshirani Scribes: Pengsheng Guo, Xian Zhou Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

10. Unconstrained minimization

10. Unconstrained minimization Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation

More information

An Inexact Newton Method for Nonlinear Constrained Optimization

An Inexact Newton Method for Nonlinear Constrained Optimization An Inexact Newton Method for Nonlinear Constrained Optimization Frank E. Curtis Numerical Analysis Seminar, January 23, 2009 Outline Motivation and background Algorithm development and theoretical results

More information

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i 8.409 An Algorithmist s oolkit December, 009 Lecturer: Jonathan Kelner Lecture Last time Last time, we reduced solving sparse systems of linear equations Ax = b where A is symmetric and positive definite

More information

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6) EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement to the material discussed in

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Part 5: Penalty and augmented Lagrangian methods for equality constrained optimization. Nick Gould (RAL)

Part 5: Penalty and augmented Lagrangian methods for equality constrained optimization. Nick Gould (RAL) Part 5: Penalty and augmented Lagrangian methods for equality constrained optimization Nick Gould (RAL) x IR n f(x) subject to c(x) = Part C course on continuoue optimization CONSTRAINED MINIMIZATION x

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Lecture 14: Newton s Method

Lecture 14: Newton s Method 10-725/36-725: Conve Optimization Fall 2016 Lecturer: Javier Pena Lecture 14: Newton s ethod Scribes: Varun Joshi, Xuan Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes

More information

Infeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization

Infeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization Infeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization Frank E. Curtis, Lehigh University involving joint work with James V. Burke, University of Washington Daniel

More information

Physics 202 Laboratory 5. Linear Algebra 1. Laboratory 5. Physics 202 Laboratory

Physics 202 Laboratory 5. Linear Algebra 1. Laboratory 5. Physics 202 Laboratory Physics 202 Laboratory 5 Linear Algebra Laboratory 5 Physics 202 Laboratory We close our whirlwind tour of numerical methods by advertising some elements of (numerical) linear algebra. There are three

More information

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30 Optimization Escuela de Ingeniería Informática de Oviedo (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30 Unconstrained optimization Outline 1 Unconstrained optimization 2 Constrained

More information