A Second-Order Method for Strongly Convex l 1 -Regularization Problems

Size: px

Start display at page:

Download "A Second-Order Method for Strongly Convex l 1 -Regularization Problems"

Augustine Hill
5 years ago
Views:

1 Noname manuscript No. (will be inserted by the editor) A Second-Order Method for Strongly Convex l 1 -Regularization Problems Kimon Fountoulakis and Jacek Gondzio Technical Report ERGO June, 13 Abstract In this paper a second-order method for solving large-scale strongly convex l 1 -regularized problems is developed. The proposed method is a Newton- CG (Conjugate Gradients) algorithm with backtracking line-search embedded in a doubly-continuation scheme. Worst-case iteration complexity of the proposed Newton-CG is established. Based on the analysis of Newton-CG, worstcase iteration complexity of the doubly-continuation scheme is obtained. Numerical results are presented on large-scale problems for the doublycontinuation Newton-CG algorithm, which show that the proposed secondorder method competes favourably with state-of-the-art first-order methods. In addition, l 1 -regularized Sparse Least-Squares problems are discussed for which a parallel block coordinate descent method stagnates. Keywords l 1 -regularization Strongly convex optimization Second-order methods Iteration Complexity Continuation Newton Conjugate-Gradients method Mathematics Subject Classification () 68W4, 65K5, 9C6, 9C5, 9C3, 9C51 J. Gondzio is supported by EPSRC Grant EP/I1717/1 Kimon Fountoulakis School of Mathematics and Maxwell Institute, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, United Kingdom. K.Fountoulakis@sms.ed.ac.uk Tel.: Jacek Gondzio School of Mathematics and Maxwell Institute, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, United Kingdom. J.Gondzio@ed.ac.uk Tel.: , Fax:

2 Kimon Fountoulakis and Jacek Gondzio 1 Introduction We are concerned with the solution of the following optimization problem minimize f (x) := x 1 + ϕ(x), (1) where x R m, > and 1 is the l 1 -norm. The following three assumptions are made: The function ϕ(x) is twice differentiable, and strongly convex, which implies that at any x its second derivative ϕ(x) is uniformly bounded λ m I ϕ(x) λ 1 I, () with < λ m λ 1, where I is the identity matrix in appropriate dimension. The second derivative of ϕ(x) is Lipschitz continuous ϕ(y) ϕ(x) L ϕ y x, (3) for any x, y, where L ϕ is the Lipschitz constant of ϕ(x), is the l -norm. Recently there has been an increased interest in the solution of strongly convex problems such as (1). Such problems can be found in well-established scientific fields, for example, Regression [39] and Machine Learning [49]. Additionally, there has been a number of efficient methods in the optimization field [6,19,35,37,4 4,45,47] for the solution of problem (1). Observe, that all approaches proposed in the previously cited papers are first-order methods. There has been also a series of interesting papers which describe the adaptation of second-order methods to such problems [4,1,13,16,1 4,36] but despite their author s efforts they do not seem to compete favourably with state-of-the-art first-order methods [48]. First-order methods rely on properties of the l 1 -norm to obtain the new direction at each iteration. In particular, very often the direction of the solver is obtained by minimizing exactly an upper bound of the objective function in problem (1), d := arg min p x + p 1 + ϕ(x) + ϕ(x) p + L ϕ p, where x is the current iteration and d is the direction [7]. Other first-order methods use the decomposability of the former problem and solve it only for some chosen coordinates [35]. In this case, the Lipschitz constant is replaced by partial Lipschitz constants for each chosen coordinate. Another approach exploits the replacement of the l 1 -norm by a first-order differentiable parameterized function which approximates the former [8]. These efficient techniques of manipulating the non-smooth part, the l 1 -norm, in combination with the strongly convex properties of the problem have settled first-order methods as the prevalent approach. On the contrary, for second-order methods, examples of the most commonly used techniques are

3 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 3 [11,36] in which the l 1 -norm, x 1, is replaced with m u i + v i, subject to: u i, v i i=1 i = 1,,..., m and then the optimal x is recovered by x = u v. [1] in which the direction at every iteration is obtained by approximately solving d := arg min p x + p 1 + ϕ(x) + ϕ(x) p + 1 dt ϕ(x)d, using a coordinate descent algorithm. The drawback of these approaches is that in the first case the dimension of the problem is doubled and additional simple constraints are introduced. Whilst, in the second case, the convergence of the coordinate descent algorithm is sensitive to the spectral properties of matrix ϕ(x). In this paper, our goals are: 1. Keep the unconstrained nature of the problem, maintain the size of it unchanged, while the consequences of the non-smoothness of the l 1 -norm are weakened.. Employ a Newton-CG algorithm with backtracking line-search which is embedded in a doubly-continuation scheme for further acceleration. 3. Give a complete analysis of Newton-CG with backtracking line-search, i.e. proof of global convergence, global and local convergence rates results, local region of fast convergence rate and worst-case iteration complexity, as in standard analysis of the Newton algorithm with backtracking line-search in []. Based on a range of published papers and a book [3,7,8,6,31,3], from various fields in which Newton-CG with backtracking line-search is applied, there has been no similar analysis. To be precise, the analysis in the previous citations provides global convergence guarantees for various globalization techniques of Newton-CG. Additionally, local convergence rate results are obtained assuming that the method is initialized in a neighbourhood of the optimal solution which is not explicitly defined, 4. Provide worst-case iteration complexity of the doubly-continuation scheme based on the iteration complexity of Newton-CG with backtracking linesearch. Continuation schemes are generally well-known among optimizers [1,1,14,15,5,43,44,46], they can speed-up a good solver, unfortunately, they are difficult to analyze. In particular, in the previously cited papers there has been no iteration complexity results. In order to meet the first goal, the l 1 -norm is approximated by a smooth function which has derivatives of all degrees. Hence, problem (1) is replaced by minimize f µ (x) := ψ µ (x) + ϕ(x).

4 4 Kimon Fountoulakis and Jacek Gondzio where ψ µ (x) denotes the smooth function which replaces the l 1 -norm and µ is a parameter which controls the quality of approximation. Then, a Newton- CG algorithm embedded in a doubly-continuation scheme is applied to solve approximately a sequence of the above approximate problems. In what follows in this section we give a brief introduction of the proposed approach. In Section, necessary basic results are given which will be used to support theoretical results in Sections 4 and 5. In Section 3, the proposed doubly-continuation scheme is discussed in details. In Section 4, the convergence analysis of Newton-CG with backtracking line-search is studied. In Section 5, worst-case iteration complexity of doubly-continuation Newton- CG with backtracking line-search is presented. Finally, in Section 6, numerical results are presented, in which we compare the proposed method with stateof-the-art first-order methods on large-scale Sparse Least-Squares and Logistic Regression problems. 1.1 Pseudo-Huber regularization The non-smoothness of the l 1 -norm makes a straightforward application of the second-order method to problem (1) impossible. In this subsection, we focus on approximating the non-smooth l 1 -norm by a smooth function. To meet such a goal, the first-order methods community replaces the l 1 -norm with the so-called Huber penalty function m i=1 φ µ(x i ) [1], where { 1 x i φ µ (x i ) = µ, if x i µ x i 1 µ, if x i = 1,,..., m i µ and µ >. The smaller the parameter µ of the Huber function is, the better the function approximates the l 1 -norm. Observe that the Huber function is only first-order differentiable, therefore, this approximation trick is not applicable to second-order methods. Fortunately, there is a smooth version of the Huber function, the pseudo-huber function which has derivatives of all degrees [17]. The pseudo-huber function parameterized with µ > is ψ µ (x) = µ m i=1 ( 1 + x i µ 1 ). (4) A comparison of the three functions l 1 -norm, Huber and Pseudo-Huber function can be seen in Figure 1. The pseudo-huber function is employed in design of an efficient secondorder method in this paper. In particular, the l 1 -regularization problem in (1) is replaced with the following approximation minimize f µ (x) := ψ µ (x) + ϕ(x). (5) The advantages of such an approach are listed below.

5 A Second-Order Method for Strongly Convex l 1 -Regularization Problems µ= µ= µ= L1 norm Huber Pseudo Huber (a) l 1 -norm, Huber, Pseudo-Huber functions.1 L1 norm Pseudo Huber µ= (b) Pseudo-Huber function for µ Fig. 1 Comparison of the approximation functions, Huber and pseudo-huber, with the l 1 - norm in one dimensional space. Fig.1a shows the quality of approximation for the Huber and pseudo-huber functions. Fig.1b shows how pseudo-huber function converges to the l 1 -norm as µ Availability of second-order information owed to the differentiability of the pseudo-huber function. Avoiding the need to double problem dimensions. Opening the door to using iterative methods to compute descent directions. There is an obvious cost which comes along with the above benefits, and that is the approximate nature of the pseudo-huber function. There is a concern that in case that a very accurate solution is required, the pseudo-huber function may be unable to deliver it. In theory, since the quality of the approximation is controlled by parameter µ in (4), see Figure 1, then the pseudo-huber function can recover any level of accuracy under the condition that sufficiently small µ is chosen. In practise a very small parameter µ might cause instabilities in the linear algebra of the solver. However, we shall show in Section 6 that even when µ is set to small values, i.e. 1.e 8, the proposed method is very efficient. By replacing the initial problem (1) with the approximation (5) we obtain a smooth strongly convex problem. Since problem (5) is unconstrained the application of a Newton-CG algorithm [9,6,33] is a good fit for purpose. In particular, a Newton-CG direction is calculated by solving the Newton linear system f µ (x)d = f µ (x) (6) using the CG algorithm [18,, 38]. The CG algorithm is terminated prematurely, therefore, an approximation of the actual Newton direction is obtained. Moreover, a backtracking line-search approach is used which guarantees global convergence of the Newton-CG algorithm. To achieve a maximum practical efficiency the Newton-CG algorithm is embedded into a doubly-continuation scheme. By the term doubly-continuation, it is meant that continuation is ap-

6 6 Kimon Fountoulakis and Jacek Gondzio plied on the parameters and µ of problem (5) simultaneously. Details for this method will be discussed in Section 3. Preliminaries Throughout the paper the standard dot product, i.e. y, x = y x is used. Occasionally, the local norm x :=, f µ (x) will be employed. The denotes the infinity norm. The operator diag( ) returns a diagonal matrix with the diagonal components of the input square matrix or takes as input a vector and creates a diagonal matrix with the input vector on the diagonal. The operator [ ] ij returns the element at row i and column j of the input matrix, for vectors, the operator [ ] j is used instead. Finally, is the ceil function..1 Properties of pseudo-huber function The gradient of the pseudo-huber function ψ µ (x) in (4) is given by ψ µ (x) = 1 µ and the Hessian is given by [x 1 ( 1 + x 1 µ ) 1 ( ),..., x m 1 + x 1 ] m µ, (7) ( 1 ψ µ (x) = diag (1 + µ[ x 1 µ ) 3 x ]),..., (1 + m µ ) 3. (8) The following lemma shows that the gradient of the function ψ µ (x) is bounded. Lemma 1 The gradient ψ µ (x) satisfies 1 m ψ µ (x) 1 m. where 1 m is a vector of ones of length m. Proof Since µ + x i x i µ + x i i = 1,,..., m, we get 1 ) 1 x i (µ + x i 1. The proof is complete. The next lemma guarantees that the Hessian of the pseudo-huber function ψ µ (x) is bounded. Lemma The Hessian matrix ψ µ (x) satisfies I ψ µ (x) 1 µ I where I is the identity matrix in appropriate dimension.

7 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 7 Proof The result follows easily by observing that < x i, i = 1,..., m. The proof is complete. ( ) x i µ 1 for any According to Lemma all eigenvalues of the Hessian matrix of the pseudo- Huber function are positive, thus, the function is strictly convex. The next lemma shows that the Hessian matrix of the pseudo-huber function is Lipschitz continuous. Lemma 3 The Hessian matrix ψ µ (x) is Lipschitz continuous ψ µ (y) ψ µ (x) 1 y x. µ Proof ψ µ (y) ψ µ (x) = 1 1 d ψ µ (x + s(y x)) ds ds d ψ µ (x + s(y x)) ds (9) ds where d ψ µ(x+s(y x)) ds is a diagonal matrix with each diagonal component, i = 1,,... m, given by [ d ψ µ (x + s(y x)) ] ds Using the previous observation we have that d ψ µ (x + s(y x)) = ds Moreover, we have [ d ψ µ (x + s(y x)) ] ds = ii ii = 3(x i + s(y i x i ))(y i x i ). µ 3 (1 + (xi+s(yi xi)) µ ) 5 max i=1,,...,m [ d ψ µ (x + s(y x)) ] ds 3(x i + s(y i x i ))(y i x i ) = µ 3 (1 + (xi+s(yi xi)) µ ) 5 3(x i + s(y i x i )) µ 3 (1 + (xi+s(yi xi)) µ ) 5 where the first absolute value in (11) has a maximum at 3(x i + s(y i x i )) µ 3 (1 + (xi+s(yi xi)) µ ) 5 Combining (11) and (1) we get (1) ii (y i x i ) (11) µ x i (y i x i), which gives µ < 1 µ. (1) [ d ψ µ (x + s(y x)) ] 1 ds ii µ y i x i. (13)

8 8 Kimon Fountoulakis and Jacek Gondzio Replacing (13) in (1) and using the fact that we get d ψ µ (x + s(y x)) 1 y x. ds µ Replacing the above expression in (9) and calculating the integral we arrive at the desired result. The proof is complete. Now consider the gradient of pseudo-huber function of a particular (fixed) point x and consider a function of two parameters an µ defined as follows ξ x (, µ) = ψ µ (x). Lemma 4 Let ξ x (, µ) = ψ µ (x), then if and µ > µ, the following bound holds ξ x (, µ) ξ x (, µ) ω(, µ,, µ) m where ω(, µ,, µ) = ( ) + In case that µ = µ, the following holds µ µ µ µ ξ x (, µ) ξ x (, µ) ( ) m. log µ µ. Proof For simplicity of notation, let us define variables ζ = (, µ) and ζ = (, µ). Then where ξ x ( ζ) ξ x (ζ) = 1 1 dξ x (ζ + s( ζ ζ)) ds ds dξ x(ζ + s( ζ ζ)) ds (14) ds dξ x (ζ + s( ζ ζ)) ds ψ µ+s( µ µ) (x) = dξ x(ζ + s( ζ ζ)) d(ζ + s( ζ ζ)) = d(ζ + s( ζ ζ)) ds γ 1 (s)[ ψ µ+s( µ µ) (x)] 1 γ (s)[ ψ µ+s( µ µ) (x)]. γ m (s)[ ψ µ+s( µ µ) (x)] m [ µ µ ] (15) where γ i (s) = Using Lemma 1 and (15) we get that ( + s( ))(µ + s( µ µ)) (µ + s( µ µ)) + x i dξ x(ζ + s( ζ ζ)) ( + γ(s) µ µ ) m, (16) ds

9 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 9 where γ(s) = + s( ) µ + s( µ µ) By replacing (16) in (14) and using and µ > µ, we get the first result. For the case that µ = µ, using Lemma 1 we get ξ x (, µ) ξ x (, µ) = ( ) ψ µ (x) ( ) m. The proof is complete.. Properties of function f µ (x) Having all necessary properties of pseudo-huber function (4) we can present basic properties of function f µ (x) defined in (5). The gradient of f µ (x) is given by f µ (x) = ψ µ (x) + ϕ(x) where ψ µ (x) has been defined in (7). The Hessian matrix of f µ (x) is f µ (x) = ψ µ (x) + ϕ(x). where ψ µ (x) has been defined in (8). Using () and Lemma we get the following bounds on the Hessian matrix of f µ (x) ( ) λ m I f µ (x) µ + λ 1 I, (17) where I is the identity matrix in appropriate dimension. Hence, the equivalence of the Euclidean and the local norm is given by the following inequality λ 1 m d d x ( µ + λ 1 ) 1 d. (18) Lemma 5 For any x and x, the minimizer of f µ (x), the following hold 1 ( ) f µ + λ µ (x) f µ (x) f µ (x ) 1 f µ (x) λ m 1 and x x λ m f µ (x). Proof The right hand side of the first inequality is proved in page 46 of []. The left hand side of the first inequality is proved by using strong convexity of f µ (x), f µ (y) f µ (x) + f µ (x) (y x) + µ + λ 1 y x

10 1 Kimon Fountoulakis and Jacek Gondzio and defining ỹ = x 1 f µ µ +λ1 (x), we get f µ (x) f µ (x ) f µ (x) f µ 1 (ỹ) ( ) f µ + λ µ (x). 1 The last inequality is proved in page 46 of []. The proof is complete. The following lemma guarantees that the Hessian matrix f µ (x) is Lipschitz continuous. In this lemma, L ϕ is defined in (3). Lemma 6 The function f µ (x) is Lipschitz continuous where L f µ := µ + L ϕ. f µ (y) f µ (x) L f µ y x, Proof Using Lemma 3 and (3) we have f µ (y) f µ (x) ψ µ (y) ψ µ (x) + ϕ(y) ϕ(x) ( ) µ + L ϕ y x. The Lipschitz constant of f µ (x) is therefore L f µ := µ + L ϕ. The next lemma gives a useful bound for the function f µ (x) where its independent variables are the parameters, µ and x is a constant. Lemma 7 If and µ > µ, then the gradient f µ (x) satisfies f µ (x) f µ (x) ω(, µ,, µ) m with ω(, µ,, µ) as in Lemma 4. In case that µ = µ, the following holds f µ (x) f µ (x) ( ) m. Proof The proof follows easily from Lemma 4 because f µ (x) f µ (x) = ξ x (, µ) ξ x (, µ). The proof is complete. Using the reverse triangular inequality on f µ (x) as a function of parameters and µ we get f µ (x) f µ (x) f µ (x) f µ (x) and by applying Lemma 7 in the former we get and f µ (x) f µ (x) + ω(, µ,, µ) m, (19) f µ (x) f µ (x) + ( ) m, () respectively. The next lemma shows how well the second-order Taylor expansion of f µ (x) approximates the function f µ (x).

11 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 11 Lemma 8 If q µ (y) is a quadratic approximation of the function f µ (x) at x q µ (y) := f µ (x) + f µ (x) (y x) + 1 (y x) f µ (x)(y x), then f µ (y) q µ (y) 1 6 L f µ y x 3. Proof Using corollary in [34] and Lemma 6 we have f µ (y) q µ (y) y x 1 y x 1 t t = 1 6 L f µ y x 3. The proof is complete. f µ (x + s(y x)) f µ (x) dsdt sl f µ y x dsdt.3 Approximate Newton directions via Conjugate Gradients algorithm In the case that pseudo-huber regularization is performed, the direction at every iteration is calculated by a Newton-CG algorithm. This implies that the Newton system (6) is solved iteratively using the CG algorithm and the process is truncated when a required accuracy is obtained. The result is an approximate solution of the Newton system, which means that an inexact Newton direction is calculated. For Newton-CG, the CG algorithm is always initialized with the zero solution and the termination criterion is set to r µ (x) η f µ (x), (1) where r µ (x) = f µ (x)d + f µ (x) is the residual in equation (6), η < 1 is a user-defined constant and d R m is the Newton-CG direction. In the literature there has been extensive theoretical work regarding the termination parameter η. The most efficient of the proposed approaches of setting η suggest that the parameter should be varied from iteration to iteration starting from a well-defined large value and steadily decreasing. According to [31], page 167, if k is the iteration index of Newton-CG, then by setting η k = min{ 1, f µ (x k ) c }, () local super-linear convergence of Newton-CG algorithm is obtained for c = 1, whilst, local quadratic convergence is obtained for c = 1. Although in practice we have observed that by setting η k =1.e 1 results in very fast convergence of the Newton-CG algorithm, it will be analyzed for η k set as in () with c = 1. The next lemma determines bounds on the norm of the Newton-CG direction d as a function of f µ (x).

12 1 Kimon Fountoulakis and Jacek Gondzio Lemma 9 Let d R m be the Newton-CG direction calculated by CG algorithm which is terminated according to criterion (1) with η < 1. Then the following holds 1 η ( µ + λ 1) f µ (x) d λ m f µ (x) Proof By squaring (1) and making simple rearrangements of it we get d ( f µ (x)) d + f µ (x) f µ (x)d + (1 η ) f µ (x). Using (17) in the above inequality we get ) λ m d ( µ + λ 1 f µ (x) d + (1 η ) f µ (x). By dropping the quadratic term λ m d from the previous inequality and making appropriate rearrangements we get d 1 η ( µ + λ 1) f µ (x). Which proves the left hand side of the result. For the right hand side, we simply use the equality d = f µ (x) 1 ( f µ (x) + r µ (x)). By taking norms and using (1) and f µ (x) 1 = 1 hand side of the result. The proof is complete. λ m we obtain the right The following property is shown in the proof of Lemma.4.1 in []. For a CG algorithm that is initialised with the zero solution p = the returned direction d i satisfies where d i := arg min{ f µ (x) p + 1 p p f µ (x)p p E i }, E i := span( f µ (x), f µ (x) f µ (x),..., f µ (x) i 1 f µ (x)). Therefore for every p E i, at t =, we get Since, d i E i, then d( f µ (x) (d i + tp) + 1 (d i + tp) f µ (x)(d i + tp)) = dt ( f µ (x) + f µ (x)d i ) p =. ( f µ (x) + f µ (x)d i ) d i = d i f µ (x)d i = f µ (x) d i. (3)

13 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 13 As a termination criterion of the Newton-CG algorithm we will use an upper bound of the Newton decrement d N x, where d N is the Newton direction at a given point x, d N x = inf{c 1 f µ (x) h c 1 h x h R m }. It is shown in [3], pages 15-16, that the optimal c 1 = d N x of the previous minimization problem is given when h = d N. Therefore, for h = d, given by CG, and due to (3) we have d N x d x. The measure d x is inexpensive to be calculated, because of the relation (3) and will be used as a termination criterion of Newton-CG algorithm. For details about the Newton decrement see page 15 in [3]. 3 Doubly-Continuation Newton-CG The pseudo-huber regularization problem (5) to be solved is a smooth and strongly convex problem. Therefore, one could directly apply an efficient and simple Newton-CG algorithm. In this paper, we make a step further and embed a Newton-CG algorithm with backtracking line-search in a doublycontinuation scheme. Continuation schemes define a sequence of minimizers as a function of a parameter. For l 1 -regularization problems as in (1), it is common that this sequence depends on parameter. In such cases the continuation path, is constructed for decreasing values of to a user defined constant. For the proposed method, the continuation path, consists of minimizers of problem (5) for simultaneously decreasing parameters and µ, till and µ µ, where and µ are pre-defined parameters. This explains the term doubly-continuation. The following two observations justify the purpose of the doubly-continuation scheme. It is shown in [] that problem (1) has the zero solution for any ϕ(). In case of problem (5), by setting µ to a relatively large value, i.e. µ 1, the regularization effect of the l 1 -norm is weakened, see Figure 1b. Therefore, ϕ() is not a sufficient condition for the subproblem (5) to have the zero solution. However, our empirical experience shows that Newton-CG is warm-started, i.e. Newton-CG converges in few iterations, when the initial solution is set to zero and ϕ(). The linear algebra is favoured when = O(µ), since in this case the Hessian matrices f µ (x) are tightly bounded, see (17). Briefly, large values of favour warm-starting, while a small ratio µ assures good numerical properties of linear systems to be solved. However, the parameters and µ have to be decreased to their pre-defined values eventually. We propose a continuation approach which consists of gradual decreasing and µ from their large initial values to the ones which determine an accurate enough solution to the original problem (1). We expect that solving a sequence of easier problems is much more efficient than giving one-shot to the actual

14 14 Kimon Fountoulakis and Jacek Gondzio problem itself. The process of decreasing the, µ parameters follows the rule l+1 = max{β 1 l, } and µ l+1 = max{β µ l, µ}, (4) where < β 1, β < 1. Observe in (4) that the parameters and µ can be decreased at every iteration arbitrarily. The former implies that the proposed method is long-step depending on these parameters. Long-step, in the sense that the reduction parameters β 1 and β is not limited in order to guarantee convergence of the doubly-continuation scheme. The pseudo-code of the proposed doubly-continuation Newton-CG and the backtracking line-search algorithms follow. In these pseudo-codes we redefine the notation of the local inner product to distinguish among continuation iterations, x,l =, f µl (x). l Algorithm dcncg 1: Input Choose x and, µ,, µ, ɛ >, < β 1, β < 1. : Set flag=. 3: For l =, 1,,... and k =, 1,,... generate l+1, µ l+1 from l, µ l and x k+1 from x k, respectively, according to the iteration: 4: while flag< do 5: for maximum Newton-CG iterations do 6: Use CG algorithm to approximately solve f µl l (xk )d k = f µl l (xk ), till (1) is satisfied with η as in () and c = 1. 7: If d k x k,l ɛl break for-loop. 8: Calculate < α k 1 which satisfies f µl l (xk + α k d k ) f µl l (xk ) c α k d k x k,l, where c (, 1 ), by using backtracking line-search. 9: set x k+1 = x k + α k d k and k = k + 1 1: end for 11: l+1 = max{β 1 l, } and µ l+1 = max{β µ l, µ} 1: If ( l+1, µ l+1 ) = (, µ), then set ɛ l = ɛ and flag=flag+1 13: l = l : end while The outer loop of the above algorithm is the doubly-continuation scheme on the parameters, µ. The inner loop is the Newton-CG algorithm with backtracking line-search that minimizes f µl (x) for fixed parameters l, µ l given from the l th l outer loop. Although, the termination criterion of CG algorithm is set as in (), in practise it can be relaxed to any η < 1. Step 8 is the backtracking line-search algorithm given below.

15 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 15 Algorithm backtracking line-search 1: Input Given l, µ l and an inexact Newton direction d for point x, calculated by CG in step 6 of algorithm dcncg and c (, 1 ), c 3 (, 1), then calculate a step-size α: : set α = 1 3: while f µl µl l (x + αd) > f (x) c α d l x,l do 4: α = c 3 α 5: end while For more details about backtracking line-search the reader is referred to Chapter 9 in []. The termination criteria of the inner loop in step 7 can be relaxed, for example, ɛ 1 ɛ and ɛ l ɛ l 1 l >, since the subproblems (5) for given l and µ l µ do not have to be solved in high accuracy. The analysis of Newton-CG algorithm, steps 5-1, without the continuation scheme is presented in the next section. This analysis will provide us with useful results that will be used for the proof of the worst-case iteration complexity of the doubly-continuation scheme. 4 Convergence analysis of Newton-CG with backtracking line-search In this section we analyze the Newton-CG with backtracking line-search, steps 5-1, of algorithm dcncg. In particular, we prove global convergence, we study the global and local convergence rates and we explicitly define a region in which Newton-CG has fast convergence rate. Additionally, worst-case iteration complexity result of this Newton-CG is presented. Since Newton-CG in this section does not depend on the continuation scheme, the indexes and µ from function f µ (x) are dropped. Moreover, the upper bound of the largest eigenvalue of f µ (x), shown in (17), is denoted by λ 1 = ( µ + λ 1 ), and the Lipschitz constant L f µ defined in Lemma 6, is denoted by L. Finally, for this section, an upper bound of the condition number of matrix f(x) for any x will be used, which is denoted by κ = λ 1 λ m. 4.1 Global convergence rate of Newton-CG with backtracking line-search In the following lemma the decrease of the objective function at every iteration of Newton-CG with backtracking line-search is calculated. In this lemma the constants c and c 3 are defined in the backtracking line-search algorithm given in Section 3. Lemma 1 Let x R m be the current iteration of Newton-CG, d R m be the Newton-CG direction calculated using the truncated CG algorithm described in subsection.3. The parameter η of the termination criterion (1) of the CG

16 16 Kimon Fountoulakis and Jacek Gondzio algorithm is set to η < 1. If x is not the minimizer of problem (5), i.e. f(x), then, the backtracking line-search algorithm will calculate a step-size ᾱ such that For this step-size ᾱ the following holds where c 4 = c c 3 1 κ ᾱ c 3 λ m λ 1. f(x) f(x(ᾱ)) > c 4 d x, and x(ᾱ) = x + ᾱd. Proof For x(α) = x + αd and from strong convexity of f(x) we have f(x(α)) f(x) + α f(x) d + α λ 1 d. If f(x) =, due to positive definiteness of f(x), see (17), the CG algorithm terminated at the i th iteration returns the vector d i which according to (3) satisfies Therefore, by setting d := d i we get d i f(x)d i = f(x) d i. Using (18) we get f(x(α)) f(x) α d x + α λ 1 d. f(x(α)) f(x) α d x + α λ 1 λ m d x. The right hand side of the above inequality is minimized for ᾱ = λm, which λ 1 gives f(x(ᾱ)) f(x) 1 λ m d λ x. 1 Observe, that for this step-size the exit condition of the backtracking linesearch algorithm is satisfied, since f(x(ᾱ)) f(x) 1 λ m d λ λ m x < f(x) c d 1 λ x. 1 Therefore, the step-size ᾱ returned by the backtracking line-search algorithm is in worst-case bounded by ᾱ c 3 λ m λ 1, which results in the following decrease of the objective function The proof is complete. f(x) f(x(ᾱ)) > c c 3 λ m λ 1 d x = c c 3 1 κ d x.

17 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 17 Standard convergence analysis treats the Newton algorithm with backtracking line-search as a two phase method. For the first phase, which is of our interest in this subsection, linear convergence rate is proved. In particular, convergence is measured according to the decrease of the objective function at every iteration, which is proved to be c 4 d N x, where d N is the Newton direction, see page 49 in []. For the Newton-CG algorithm with backtracking line-search, which is studied in this paper, it is proved in Lemma 1 that the decrease of the objective function at every iteration of Newton-CG for the first phase is c 4 d x, where d is the inexact Newton direction returned by CG. Hence, Newton-CG with backtracking line-search enjoys analogous global linear convergence rate to that of Newton algorithm with backtracking line-search. 4. Region of fast convergence rate of Newton-CG In this section we define a region based on d x, in which, by setting parameter η as in () with c = 1, the Newton-CG algorithm converges with fast rate. By fast rate it is meant that when Newton-CG is initialized in this region, then the constant required accuracy. worst-case iteration complexity result is of the form log log We will show in Subsection 4.4 that the rate of convergence which is proved in this subsection for the explicitly defined region satisfies the previously stated condition. The lemma below is crucial for the analysis in this subsection, it shows the behaviour of the function f(x) when a step along the Newton-CG direction is made. Lemma 11 Let x R m be the current iteration of Newton-CG, d R m be the Newton-CG direction calculated using the CG algorithm described in subsection.3. The parameter η of the termination criterion (1) of the CG algorithm is set to η < 1. Then f(x) f(x(α)) α d x α d x α3 L d 3 6 x, λ 3 m where x(α) = x + αd and α >. Proof Using Lemma 8 and setting y = x(α) = x + αd we get f(x(α)) f(x) + α f(x) d + α d f(x)d + α3 6 L d 3. Using (18) and (3) we get f(x(α)) f(x) α d x + α d x + α3 L d 3 6 x. λ 3 m The result is obtained by rearrangement of terms. The proof is complete.

18 18 Kimon Fountoulakis and Jacek Gondzio Based on Lemma 11, a region is defined in the following lemma, in which unit-step sizes are calculated by the backtracking line-search algorithm. Additionally, for this region, d k+1 x k+1 is bounded as a function of d k x k. In this lemma the constants c and c 3 have been defined in the backtracking line-search algorithm, given in Section 3, moreover, x k+1 = x k + d k. 3 Lemma 1 If d k x k 3(1 c ) λ ml, then the backtracking line-search algorithm calculates unit step-sizes. Moreover, if the parameter η k of the termination criterion (1) of the CG algorithm is set as in () with c = 1, then for two consequent directions d k and d k+1 and points x k and x k+1 of Newton-CG algorithm, the following holds 1 16 λ 1 + L ( 1 d k+1 λ 3 x k+1 m 16 λ 1 + L ). d k x k Proof By setting ᾱ = 1 in Lemma 11 we get f(x k ) f(x k+1 ) 1 dk x 1 L ( 1 d k 3 k 6 λ 3 x = k m 1 L ) d k 6 λ 3 x k d k x k. m if d k x k 3(1 c ) λ 3 ml we get λ 3 m f(x k ) f(x k+1 ) c d k x k, which implies that ᾱ = 1 satisfies the exist condition of the backtracking linesearch algorithm. Let us define the quantity f(x(t)) h, where x(t) = x k +td k and h R m, then we have f(x(t)) h = f(x k ) h + t(d k ) f(x k )h + t u 3 f(x(δ))[d k, d k, h]dδdu f(x k ) h + t(d k ) f(x k )h t u + 3 f(x(δ))[d k, d k, h] dδdu = f(x k ) h + t(d k ) f(x k )h t u + lim (dk ) ( f(x(δ)) f(x k ))h dδdu δ δ f(x k ) h + t(d k ) f(x k )h t u + d k h lim 1 δ ( f(x(δ)) f(x k )) dδdu δ f(x k ) h + t(d k ) f(x k )h + d k h t u = f(x k ) h + t(d k ) f(x k )h + t L dk h. L d k dδdu

19 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 19 By taking absolute values and setting t = 1 we get f(x k+1 ) h f(x k ) h + (d k ) f(x k )h + 1 L dk h From (18) and (1), by setting r k = f(x k )d k + f(x k ), we get ( η f(x k+1 ) k h f(x k ) + 1 L ) d k λ 1 m λ 3 x h k x k+1. m Since, η k is set according to (), by using (18) and Lemma 9 we get ( f(x k+1 ) 4 λ 1 h + 1 L λ 3 m (1 (η k ) ) 1 ( 16 λ 1 + L λ 3 m λ 3 m ) d k x k h x k+1. ) d k x k h x k+1 The previous result holds for every h R m, hence, by setting h = d k+1 and by using (3) we prove the second part of this lemma. The proof is complete. The following corollary states the region of fast convergence rate of Newton- CG. Corollary 1 If the parameter η k in the termination criterion (1) of the CG algorithm is set as in () with c = 1 and d k x k < ϖ, < ϖ c 5, where { c 5 = min 3(1 c ) λ 3 m L, 3 λ } m 16 λ 1 + L, then according to Lemma 1 Newton-CG convergences with fast rate. 4.3 Global Convergence of Newton-CG with backtracking line-search The global convergence result of Newton-CG algorithm with backtracking linesearch, steps 5-1 of dcncg algorithm, follows. Theorem 1 Let {x k } be a sequence generated by Newton-CG algorithm with step-sizes calculated by backtracking line-search. The parameter η of the termination criterion (1) of the CG algorithm is set to a value η < 1. The sequence {x k } converges to x, which is the minimizer of f(x) in problem (5). Proof Let {x k } be a sequence generated by Newton-CG algorithm with stepsizes ᾱ k calculated by the backtracking line-search algorithm presented in Section 3. The positive definiteness of f(x) at any x, see (17), and the fact that η < 1 in (1), imply that CG returns d k = at a point x k if and only if f(x k ) =. Hence, only at optimality CG will return a zero direction. Moreover, from Lemma 1 we get that if f(x k ) =, then ᾱ k is bounded away from zero and the function f(x) is monotonically decreasing

20 Kimon Fountoulakis and Jacek Gondzio when the step ᾱ k d k is applied. The monotonic decrease of the objective function implies that {f(x k )} converges to a limit, thus, {f(x k ) f(x k+1 )}. Therefore, the sequence {x k } converges to a point x. Using Lemma 1 and {f(x k ) f(x k+1 )} we get that d k x, hence, due to strong convexity of f(x), d k. Moreover, from Lemma 9 we deduce that if d k, then f(x k ), therefore, x is a stationary point of function f(x). Strong convexity of f(x) guarantees that a stationary point must be a minimizer. The proof is complete. 4.4 Worst-case iteration complexity of Newton-CG with backtracking line-search The following theorem shows the worst-case iteration complexity of Newton- CG with backtracking line-search in order to enter the region of fast convergence rate, i.e. d x < ϖ, where < ϖ c 5 and c 5 has been defined in Corollary 1. In this theorem the constant c 4 has been defined in Lemma 1, c and c 3 are constants of the backtracking line-search algorithm given in Section 3. Moreover, x denotes the minimizer of f(x). Theorem Starting from an initial point x, such that d x ϖ, Newton- CG with backtracking line-search and η < 1 in the termination criterion (1) of CG, requires at most ( f(x ) f(x ) ) K 1 = c 6 log c 7 ϖ, iterations to obtain a solution x k, k >, such that d k x k < ϖ, where c 6 = κ 3 (1 η ) c c 3 and c 7 = 1 8κ. Proof Let us assume an iteration index k >, then from (18), Lemmas 5 and 9 we get f(x k ) f(x ) 1 8κ dk xk, (5) and From Lemma 1 we have f(x k 1 ) f(x ) κ (1 η ) dk 1 xk 1. (6) f(x k ) < f(x k 1 ) c 4 d k 1 xk 1. (7) Combining (6), (7) and subtracting f(x ) from both sides we get ( f(x k ) f(x ) < 1 (1 η ) c ) 4 κ (f(x k 1 ) f(x )) ( < 1 (1 η ) c ) 4 k(f(x κ ) f(x )) ( = 1 (1 η ) c c ) 3 k(f(x κ 3 ) f(x ))

21 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 1 From the last inequality and (5) we get 1 ( 8κ dk x < 1 (1 η ) c c ) 3 k(f(x k κ 3 ) f(x )). Using the definitions of constants c 6 and c 7 we have d k x k < (1 1 c 6 ) k 1 c 7 (f(x ) f(x )). Hence, we conclude that after at most K 1 iterations as defined in the preamble of this theorem, the algorithm produces d k x k < ϖ. The proof is complete. The following theorem presents the worst-case iteration complexity result of Newton-CG to obtain a solution x l, of accuracy f(x l ) f(x ) < ɛ, when initialized at a point inside the region of fast convergence. The index l has been used as an iteration counter of the doubly-continuation loop, however, only for the next theorem it will be used as an iteration counter of Newton-CG algorithm. Theorem 3 Suppose that there is an iteration index k, such that d k x k < ϖ. If η in (1) is set as in () with c = 1, then Newton-CG needs at most ( c8 ) K = log log ɛ iterations to obtain a solution x l, l > k, such that f(x l ) f(x ) < ɛ, where c 8 = 3κ λ 3 m (16 λ 1 + L). Proof Suppose that there is an iteration index k such that d k x k for an index l > k, by applying Lemma 1 recursively we get < ϖ, then 1 16 λ 1 + L ( 1 d l λ 3 x l m 16 λ 1 + L d k x k λ 3 m From (18), Lemmas 5 and 9 and η k in (1) we get f(x l ) f(x ) ) l k < κ (1 η ) dl x l 8κ d l x l, By replacing (8) in the above inequality we get f(x l ) f(x ) < 3κ λ 3 ( m 1 ) l k+1 (16 λ 1 +. L) ( 1 ) l k. (8) Hence, in order to obtain a solution x l, such that f(x l ) f(x ) < ɛ, Newton- CG requires at most as many iterations as in the preamble of this theorem. The proof is complete.

22 Kimon Fountoulakis and Jacek Gondzio The following theorem summarizes the complexity result of Newton-CG with backtracking line-search. The constants c 6, c 7 and c 8 in this theorem are defined in Theorems and 3, respectively. A worst-case iteration complexity result of the form in Theorem 4 below has been first obtained in [5], for an inexact Newton method based on cubic regularization. Theorem 4 Starting from an initial point x, such that d x ϖ, Newton- CG with backtracking line-search requires at most ( f(x ) f(x ) ) ( c8 ) K 3 = c 6 log c 7 ϖ + log log ɛ iterations to converge to a solution x k, k >, of accuracy f(x k ) f(x ) < ɛ. 5 Worst-case iteration complexity of dcncg with backtracking line-search First, let us remind the reader that in Section 4 the Newton-CG algorithm has been analyzed independently of the doubly-continuation scheme, hence, for simplicity of notation, the parameters and µ have been dropped. In this section, it is necessary to introduce back the dependence on the parameters l and µ l for each continuation iteration l. In particular, when a reference to results from Section 4 is made, then the usual notation which includes these parameters will be used. Moreover, k denotes the iteration index of dcncg algorithm, x l is used to denote the initial solution of the l th inner loop or an approximate minimizer of f µl 1 (x) for l >. For example, if the termination l 1 criterion in step 7 is satisfied for x k and d k at the (l 1) th continuation iteration, then x l := x k. Finally, z l is the minimizer of f µl l (x) and z is the minimizer of f µ ( z). We start by an intuitive explanation of the worst-case iteration complexity result. For this, the following summary of doubly-continuation scheme is helpful. Summary of algorithm dcncg 1: Outer loop: For l =, 1,,..., ϑ, produce sequences { l } and {µ l }, where ϑ is the number of continuation iterations. : Inner loop: Approximately solve the subproblem minimize f µl l (x) using Newton-CG with backtracking line-search. From the summary of the continuation scheme it is clear that the total number of required steps is the sum of steps required for the approximate

23 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 3 solution of each subproblem. In the previous section, we established the worstcase iteration complexity of the inner loop, i.e. the approximate solution of the subproblem by using Newton-CG with backtracking line-search. In particular, to obtain f µl (x l+1 ) f µl (z l ) < ɛ for the l th problem, then from Theorem 4 we l l have that at most K3 l iterations of Newton-CG with backtracking line-search are required. Hence, if every l th problem is solved to accuracy f µl (x l+1 ) l f µl (z l ) < ɛ then the worst-case iteration complexity of the continuation scheme l is ϑ l=1 Kl 3. However, it would be preferable to have a bound of this result independent of the indexing l. For this reason, in this section we will show that ϑ l=1 K l 3 ϑ max l ϑ Kl 3 < ϑk 4, where K 4 is an upper bound of the worst-case K l 3 l. For convenience, let us explicitly state K l 3, where ( f µ l K3 l = c l (x l ) f µl (z l )) ( 6 log l c l ) l c l + log 7 (ϖl ) log 8, (9) ɛ c l 6 = c l 8 = ( ) 3 l + λ µ l 1 (1 η ) λ 3, c l 7 = mc c 3 ( ( 16 l ( ) λm 3 l + λ µ l 1 µ l + λ 1 ) + ( λ m ( ), 8 l + λ µ l 1 )) l + L (µ l ) ϕ and c and c 3 are constants of the backtracking line-search algorithm, which do not depend on the continuation iteration index l. Moreover, according to Corollary 1, the region of fast convergence of Newton-CG for the l th subproblem is defined for d k x k,l < ϖ l, where ϖ l c l 5 and { c l 5 = min 3(1 c ) ( λ 3 m ), l + L (µ l ) ϕ ( 16 l λ 3 m µ l + λ 1 ) + ( ) l + L (µ l ) ϕ In order to obtain the upper bound K 4 one needs to calculate upper bounds for the quantities c l 6, c l 8 and f µl (x l ) f µl (z l ), and a lower bound for c l l l 7 over all l. Additionally, a lower bound of the quantity c l 5 is required, which will be used to obtain a tighter upper bound for ϖ l over all l, as required by Corollary 1. The following remark will be used for this purpose. Remark 1 The parameters β 1, β and, µ are set such that < β 1 β < 1 µ and log β µ < log β1. Briefly, we require that the sequence of iterates {µ l } produced by (4) converges to µ before the sequence { l } converges to, }.

24 4 Kimon Fountoulakis and Jacek Gondzio although, { l } has a faster or similar rate of convergence. Then, according to rule (4) the sequences {β l 1} and {β l } are generated for which < β l 1 β l 1 and l+1 = β l 1 l and µ l+1 = β l µ l. (3) Even though this assumption is not necessary for convergence of the doublycontinuation scheme, it facilitates the proof of monotonic decrease of f µl (x) l l, and it simplifies the calculation of some non-essential results. Using Remark 1 it is easy to find upper bounds for the quantities c l 6, c l 8 and lower bounds for c l 5 and c l 7. In particular, from Remark 1 we have µ l µ l µ l. Hence, the quantities c l 5, c l 6, c l 7, c l 8 are replaced with and { c 5 = min 3(1 c ) ( λ 3 m ), µ µ + L ϕ λ 3 m ( ) ( 16 µ + λ 1 + µ µ + L ϕ ( ) 3 µ + λ c 6 = 1 λ m (1 η ) λ 3, c 7 = ( ) mc c, 3 8 µ + λ 1 ( ) λm 3 µ + λ c 8 = 1 ( ( ) ( )). 16 µ + λ 1 + µµ + L ϕ } ), Having a lower bound of c l 5 l, we can tighten the conditions of Corollary 1, which are replaced with d k xk,l < ϖ and ϖ c 5 l. (31) Another consequence of Remark 1 is the following lemma. Lemma 13 For any x, if Remark 1 is satisfied, then f µ (x) > f µ1 1 (x) and f µl 1 l 1 (x) > f µl l (x) l >. Proof For (3) we can rewrite functions f µ1 (x) and f µl (x) as f β 1 µ l β1 (x) and f βl 1 µ l 1 β l 1 1 f µl µ (x), respectively. It is easy to show that f l 1 (x) > f β µ β1 (x) and µ l 1 β l 1 1 l (x) > f βl 1 l 1 (x). The proof is complete. Therefore using Lemma 13, the fact that z l is the minimizer of function f µl l (x) and x is the initial point of the continuation scheme we get f µ (x ) > f µ (z ) > f µ1 1 (z ) > f µ1 1 (z 1 ) >... > f µ ( z). (3)

25 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 5 It remains now to find a bound for the distance ρ l = f µl (x l ) f µl (z l ). For l l l = using (3) we get ρ < f µ (x ) f µ ( z). (33) For l > we proceed by adding and subtracting the term f µl l (z l 1 ) in ρ l, ρ l = f µl l (x l ) f µl l (z l 1 ) + f µl l (z l 1 ) f µl l (z l ). We shall analyze independently the following two quantities ρ l 1 = f µl l (x l ) f µl l (z l 1 ) and ρ l = f µl l (z l 1 ) f µl l (z l ) First we start with the bound on ρ l 1. From strong convexity of f µl l (x), we have that ρ l 1 f µl l (x l ) (x l z l 1 ) λ m xl z l 1 f µl l (x l ) x l z l 1 λ m xl z l 1. (34) It is obvious that we need tight bounds for f µl l (x l ) and x l z l 1. From Remark 1, we distinguish two scenarios for the sequences {β l 1} and {β l }, first < β l 1 β l < 1, second < β l 1 1 and β l = 1. For the first case by using (3) and (19) in (34) we get ρ l 1 ( f µl 1 (x l ) + l 1( ) log(βl ) m) x l l 1 1 β l z l 1 λ m xl z l 1. While if < β l 1 1 and β l = 1, then using (3) and () in (34) we get ρ l 1 ( f µl 1 l 1 (x l ) + l 1 m) x l z l 1 λ m xl z l 1. The upper bound of ρ l based on these two cases is given by the first bound and by replacing l with and β l with β = min l β l, ρ l 1 ( f µl 1 l 1 (x l ) + ω m) x l z l 1 λ m xl z l 1, (35) ( where ω = log(β) 1 β ). Using (18), Lemmas 5 and 9 and setting η < 1 we have that ( ) ( ) l 1 + λ f µl 1 (x l µ ) l 1 1 l 1 + λ l 1 1 η d l µ l 1 1 d l (1 η )λ 1 x l,l 1 (36) m and x l z l 1 f µ l 1 (x l ) λ l 1 m ( ) 4 l 1 + λ µ l 1 1 (1 η )λ 3 m d l xl,l 1. (37)

26 6 Kimon Fountoulakis and Jacek Gondzio By replacing (36) and (37) in (35) we get ( ) ( ) ( l 1 + λ ρ l µ 1 l 1 1 d l (1 η )λ 1 x l,l 1 + ω )4 l 1 + λ µ m l 1 1 d l m (1 η )λ 3 x l,l 1 m ( ) 8 l 1 + λ µ l 1 1 (1 η )λ d l x l,l 1 m 4 ω ( ) m l 1 + λ µ = l ω ( ) m d l µ + λ (1 η )λ 3 x l,l 1 1 d l m (1 η )λ 3 x,l 1. l m Notice that at termination of the (l 1) th inner loop of dcncg algorithm, step 7, we have that d l xl,l 1 ɛ l 1. By considering a sequence of termination tolerances ɛ 1 ɛ and ɛ l ɛ l 1 l >, then d l xl,l 1 ɛ, hence, we get 4 ω ( ) m ρ l µ + λ 1 1 ɛ. (38) (1 η )λ 3 m Regarding ρ l, using (3) we get ρ l < f µ (x ) f µ ( z), (39) which is a finite constant. We can now calculate the bound of ρ l = ρ l 1 + ρ l, for l >, through the bounds of ρ l 1 and ρ l, (38) and (39), respectively, 4 ω ( ) m ρ l = ρ l 1 + ρ l µ + λ < 1 ɛ + f µ (1 η )λ 3 (x ) f µ ( z) = ρ, (4) m which holds if every x l point satisfies d l x l,l 1 ɛ. In the following theorem the worst-case iteration complexity of the doublycontinuation scheme is given. Let us remind the reader that in this theorem, the constants c 5, c 6, c 7, c 8 and ρ have been defined in the beginning of this section, the constant ϖ satisfies (31), η < 1 is the tolerance of the termination criterion (1) of the CG algorithm. Theorem 5 If β 1, are set according to Remark 1 and ɛ 1 ɛ, ɛ l ɛ l 1 l > in step 7 of dcncg algorithm, then the dcncg algorithm initialized at x requires at most ( ( ρ ) ( )) K 5 = c 6 log c 7 ϖ + log log log c8 ɛ β1 iterations to converge at a solution x of accuracy λ m f µ ( x) f µ ( z) < ɛ, where ɛ = ( ) (ɛ ). 8 µ + λ 1

27 A Second-Order Method for Strongly Convex l 1 -Regularization Problems 7 Proof According to Remark 1 the number of continuation iterations is ϑ = log β1. A worst-case K l 3 over all l, denoted at the beginning of this section with K 4, is given by replacing the quantities ϖ l, c l 5, c l 6, c l 7, c l 8 and f µl (x l ) l f µl (z l ) in (9) with appropriate bounds over all continuation iterations l, as it l has been shown previously in this section. First, c l 5, c l 7 and c l 6, c l 8 are replaced with their lower and upper bounds c 5, c 7 and c 6, c 8, respectively, over all l. The conditions d k x k,l < ϖ l and ϖ l c l 5 l, of Corollary 1, are replaced by the tighter conditions (31). For l = we bound f µl (x l ) f µl (z l ) by (33), while for l l l > we bound this distance by (4), which holds if d k xk,l 1 ɛ in step 7 of dcncg. To guarantee the previous condition, sufficient iterations of Newton- CG are performed at the continuation step l 1. In order to calculate a worstcase iteration complexity such that d k x k,l 1 ɛ two cases are distinguished for the magnitude of ɛ, first ɛ < ϖ, second ɛ ϖ. The worst case is ( f µ (x ) f µ ( z) ) given for ɛ < ϖ. If l =, according to Theorem 4, c 6 log c 7 ϖ + ( ) c log log 8 ɛ l iterations are needed for Newton-CG to converge to a solution x l+1 such that f µl (x l+1 ) f µl (z l ) < ɛ l and d l+1 l l x l+1,l < ɛ, where ɛ l = λ ( m ) (ɛ ). The previous bound on d l+1 xl+1,l is obtained by using (18), 8 l µ l +λ1 Lemmas 5 and 9, which give us λ m ( ) d l+1 x l+1,l f µl (x l+1 ) f µl (z l ) < ɛ l 8 l l l + λ µ l 1 l. Hence from (4) we get that ρ 1 < ρ. If the previous statement holds, ( then ) for l = 1, according to Theorem 4 and (4) by performing c 6 log ρ c 7 ϖ + ( ) c log log 8 ɛ l iterations of Newton-CG, it is guaranteed that d l+1 x l+1,l < ɛ and ρ < ρ. In a similar way it is guaranteed that ρ l < ρ l >. Additionally, ρ < ρ, because the upper bound of ρ in (33) is smaller than ρ, therefore ρ l < ρ l. From Remark 1 we have µ l µ l µ l, hence, ɛ ɛ l l. Notice that the condition d l+1 x l+1,l < ɛ is preserved if we require accuracy ɛ for all l continuation iterations instead of ɛ l. We conclude that ( ρ ) ( ) max l ϑ Kl 3 K 4 = c 6 log c 7 ϖ + log log. c8 ɛ If K 4 iterations of Newton-CG are performed for every l th continuation iteration, then d l+1 x l+1,l < ɛ and f µl (x l+1 ) f µl (z l ) < ɛ l. By multiplying K l l 4 with ϑ we obtain the result. The proof is complete.

Edinburgh Research Explorer

Edinburgh Research Explorer A Second-Order Method for Strongly Convex L1-Regularization Problems Citation for published version: Fountoulakis, K & Gondzio, J 216, 'A Second-Order Method for Strongly Convex