CONJUGATE GRADIENT WITH SUBSPACE OPTIMIZATION

Size: px

Start display at page:

Download "CONJUGATE GRADIENT WITH SUBSPACE OPTIMIZATION"

Hannah Matthews
6 years ago
Views:

1 CONJUGATE GRADIENT WITH SUBSPACE OPTIMIZATION SAHAR KARIMI AND STEPHEN VAVASIS Abstract. In this paper we present a variant of the conjugate gradient (CG) agorithm in which we invoke a subspace minimization subprobem on each iteration. We ca this agorithm CGSO for conjugate gradient with subspace optimization. It is reated to earier work by Nemirovsky and Yudin. We appy the agorithm to sove unconstrained stricty convex probems. As with other CG agorithms, the update step on each iteration is a inear combination of the ast gradient and ast update. Unike some other conjugate gradient methods, our agorithm attains a theoretica compexity bound of O( / og(1/ɛ)), where the ratio / characterizes the strong convexity of the objective function. In practice, CGSO competes with other CG-type agorithms by incorporating some second order information in each iteration. Key words. Conjugate Gradient, First Order Agorithms, Convex Optimization AMS subject cassifications. 1. Introduction. The method of conjugate gradients (CG) was introduced by Hestenes and Stiefe in 1952 [8] to minimize quadratic objective functions of the form f(x) = x t Ax/2 b t x (or equivaenty, to sove Ax = b), where A is a symmetric positive definite matrix. We refer to this agorithm as inear conjugate gradient. ater, the agorithm was generaized to noninear conjugate gradient which addresses unconstrained minimization of arbitrary differentiabe objective functions, by Fetcher and Reeves [3] and Poak and Ribiére [13], and others. Noninear CG agorithms a have the property that when appied to a quadratic objective function and couped with an exact ine search, they reduce to inear CG. None of these agorithms, however, has a known compexity bound when appied to nonquadratic functions. Indeed, Nemirovsky and Yudin [9] argue that their worst-case compexity for stricty convex functions is quite poor. Nemirovsky and Yudin, on the other hand, found a different generaization of CG. Actuay, their agorithm, which we refer to as NYCG, is not a generaization of the CG agorithm itsef but rather of its derivation from some equations and inequaities that underie it. Their agorithm achieves a worst-case compexity bound of O(n(1/ɛ) /). Here ɛ is the desired reative accuracy, that is, ɛ = (f(x n ) f(x ))/(f(x 0 ) f(x )), where x 0 is the starting point, x is the optimizer, and x n is the fina iterate, and and characterize the convexity of f. In particuar, we assume that for a x, y ying in the eve set of x 0, f(x) f(y) x y, (1.1) f(y) f(x) f(x), y x + 2 y x 2. (1.2) For exampe, in the case of a convex quadratic function f(x) = x t Ax/2 b t x, / is the condition number of A. From inequaity (1.1), it foows that f(y) f(x) f(x), y x + 2 y x 2, (1.3) which wi be usefu in our anaysis. One drawback of the NYCG agorithm is that in the case of quadratic objective functions, it does not reduce to inear CG (and in fact, is much sower in practice). A second drawback is that it requires prior knowedge of the ratio /. This is because 1

2 2 Sahar Karimi and Stephen Vavasis the agorithm needs to be restarted every O( /) iterations in order to achieve the above-mentioned compexity bound. A further disadvantage of NYCG is that it has a reativey expensive computation on every iteration, namey, one must sove a two-dimensiona convex optimization probem using the eipsoid method. This drawback was ater remedied by a different agorithm due to Nesterov [10]. Nesterov s agorithm generates two sequences of iterates; the first sequence of iterates is generated such that a sufficient reduction in objective function is achieved, and the new iterate in the second sequence is an affine combination of the ast two iterates of the first sequence. Nesterov s agorithm, however, sti has the two disadvantages mentioned in the previous paragraph. The purpose of this paper is to deveop an agorithm that we ca conjugate gradient with subspace optimization (CGSO) that mainy avoids the disadvantages mentioned above. CGSO is mosty cosey reated to NYCG. In particuar, we seek a conjugate-gradient-ike agorithm with the foowing properties. 1. The agorithm shoud reduce to inear CG when the objective function is quadratic. 2. The agorithm shoud have the compexity bound of O(n(1/ɛ) /) iterations for convex functions satisfying (1.1) and (1.2). 3. The agorithm shoud not require prior knowedge of any parameters describing f.. The cost per iteration shoud not be excessive. The agorithm we propose achieves goas 1 3, and mosty achieves goa. ike NYCG, our agorithm must sove a convex optimization subprobem on every iteration. The dimension of the subprobem is aways at east 2 and is determined adaptivey by the agorithm. The worst-case upper bound we are abe to prove for the dimension of this subprobem is O(og j), where j is the number of iterations so far. However, in our testing, the dimension of the subprobem was 2 in amost every case; in one test case it reached the vaue 3 for a few iterations, but it never exceeded 3. Furthermore, we find that soving the subprobem is usuay fairy efficient because we usuay appy Newton s method, using automatic differentiation to obtain the necessary first and second derivatives. Goa 3, the condition of no prior knowedge of parameters, has the obvious benefit of making the agorithm easier to appy in practice. It aso has a second subter benefit. For some convex probems, e.g., minimization of a og-barrier function, there is no prior bound on / over the domain of the function since the derivatives tend to infinity at the boundaries. However, as the minimizer is approached, the bad behavior at the boundaries becomes irreevant, and there is a new smaer ratio / reevant for eve sets in the neighborhood of the minimizer. In this case, CGSO automaticay adapts to the improved vaue of /. Such adaptation is possibe aso with NYCG, and Nesterov s agorithm too, but, as far as we know, the adaptation must be done by the user and cannot be easiy automated. Methods reviewed in this paper are among techniques that are generay referred to as first-order agorithms, because they use ony the first derivative information of the function in each iteration. Due to the successfu theory of NYCG and Nesterov techniques, first-order agorithms have attracted many researchers during the ast decade and have been extended to soving different casses of probems. Nesterov in [11] propsed a variation of his earier agorithms for minimizing a nonsmooth function. In addition to nonsmooth optimization, Nesterov agorithm has been adapted for constrained probems with simpe enough feasibe regions so that a projection on these

3 Conjugate Gradient with Subspace Optimization 3 sets can be easiy computed. One may refer to [15] and references therein for a more eaborated discussion on different adaptations of Nesterov s agorithm. The focus of this paper, however, is more on CG agorithm and not on first-order techniques in genera. We now provide further background on conjugate gradient. The origina inear CG has the foowing form: ( ) r x j+1 = x j j t d j (d j ) t Ad, (1.a) j d j+1 = r j+1 + (r j+1) t Ad j (d j ) t Ad j d j. (1.b) In the above equations r j is f(x) = Ax j b and d 0 = r 0. It is possibe to show that the number of iterations in inear CG is bounded by the dimension of the probem, n. For more detais on inear CG, one may refer to [5] or [12]. Noninear CG was proposed by Fetcher and Reeves [3] as an adaptation of the above agorithm for minimizing a genera noninear function. The genera form of this agorithm is as foows: x j+1 = x j + α j d j, (1.5a) d j+1 = g j+1 + β j d j. (1.5b) Here, d j is the search direction at each iteration, g j+1 is the gradient of the function at (j +1)th iterate, i.e., f(x j+1 ); and α j is the step size, usuay determined by a ine search. Different updating rues for β j give us different variants of noninear CG. The most common formuas for computing β j are: Fetcher-Reeves (196): β F R = gj+1 g j, Poak-Ribière (1969): β P R = (gj+1 ) t (g j+1 g j ) g j. Hager and Zhang [6] present a compete ist of a updating rues in their survey on noninear CG. The convergence of noninear CG is highy dependent on the ine search; for some the exact ine search is crucia. There are numerous papers devoted to the study of goba convergence of noninear CG agorithms, most of which discuss variants of noninear CG that do not rey on exact ine search to be gobay convergent. A- Baai [1] discusses the convergence of Fetcher-Reeves agorithm with exact ine search. Gibert and Noceda [] study the convergence of noninear CG agorithms with no restart and no exact ine search. Dai and Yuan [2] present a noninear CG for which the standard Wofe condition suffices. A recent variant of CG has been proposed by Hager and Zhang [7] that reies on a ine search satisfying the Wofe Conditions. Furthermore this agorithm has the advantage that every search direction is a descent direction, which is not necessariy the case in noninear CG. From Yuan and Stoer s perspective [16], CG is a technique in which the search direction d j+1 ies in the subspace spanned by Sp{g j, d j }. In the agorithm they propose they compute the new search direction by minimizing a quadratic approximation of the objective function over the mentioned subspace. A more generaized

4 Sahar Karimi and Stephen Vavasis form of CG caed Heavy Ba Method, was introduced by Poyak [1], in which x j+1 is x j +α( g j )+β(x j x j 1 ). He proved a geometric progression rate for this agorithm when α and β beong to a specific range. In the same work, Poyak reviews the CG method; our simpest form of CGSO, which is given in section 2.1, coincides with his presentation of CG. The remainder of this paper is devoted to CGSO and its properties. In section 2 we describe the agorithm. Our main resut on the convergence of the agorithm is presented in section 2.3. The impementation of the agorithm is discussed in section 3; furthermore the resuts and comparison of CGSO with CG are presented in this section. 2. CGSO. In this section we present CGSO for soving the probem min f (x), (2.1) x Rn where f(x) is a stricty convex function characterized by parameters and. We foow the standard notation throughout this paper..,. represents the inner product of two vectors in proper dimension, and. stands for the 2-norm of a vector uness otherwise is stated. Bod ower case characters and upper case characters are used for vectors and matrices respectivey; and their superscript states the iteration count The Agorithm. et g(x) denote the derivative of f(x); CGSO, in its preiminary form, is as foows: x 0 = arbitrary; for j = 1, 2,... x j+1 = x j + α j g(x j ) + β j d j where d j = x j x j 1 and α j, β j = argmin α,β f(x j + αg(x j ) + βd j ) As we sha see, some modifications are necessary to achieve the desired compexity bound. The above agorithm is certainy a generaization of noninear CG, since each search direction is a combination of previous gradients. Notice that the above agorithm is a descent method (i.e. f(x j+1 ) f(x j )). Furthermore, the above agorithm performs at east as we as steepest descent in each iteration; hence it converges to a stationary point, which, for the cass of convex probems, coincides with the minimizer of the function. For convenience, et us represent g(x j ) by g j, and et v f (x) denote the residua of the function, i.e. f(x) f(x ). By strong convexity of the function, we get the foowing properties for the above agorithm: (a) f(x j+1 ) f(x j ) 1 2 gj 2 (b) g j, x x j f(x ) f(x j ) (c) v f (x 0 ) = f(x 0 ) f 2 (x x 0 ) 2 Property (a) foows from (1.3) and the fact that f(x j+1 ) f(x j 1 gj ). Property (b) is true by convexity of the function, and property (c) is a direct derivation from inequaity (1.2). We are now ready to present the main emma on the compexity bound of the agorithm. emma 2.1. Suppose m 8ρ and ( f(x ) f(x 0 ) ) λ j + λ j g j, x j x 0 < 0, (2.2)

5 Conjugate Gradient with Subspace Optimization 5 and λ j g j ρ (λ j ) 2 g j 2, (2.3) are satisfied, where ρ is a constant 1, and λ j f(x = j ) f(x j+1 ) g j 2. Then the residua of the function is divided in haf after m iterations; i.e. v f (x m ) 1 2 v f (x 0 ). Proof. Our proof is an extension of the proof in section 7.3 in [9]. Suppose by contradiction that m 8ρ, (2.2) and (2.3) are satisfied; but v f (x m ) > v f (x 0 ) 2. By definition of λ j, hence f(x j+1 ) = f(x j ) ( λ j) 2 g j 2, v f (x j+1 ) = v f (x j ) ( λ j) 2 g j 2. Summing these inequaities over j = 0,, m 1, we get: or equivaenty, 0 v f (x m ) = v f (x 0 ) ( λ j ) 2 g j 2, ( λ j ) 2 g j 2 v f (x 0 ). (2.) By convexity of the function we have, g j, x x j f(x ) + f(x j ) = v f (x j ), and so g j, x x 0 g j, x j x 0 v f (x j ) v f (x m ) < v f (x 0 ). 2 et s consider the weighted sum of a the above inequaities for j = 0,... m 1 with weights λ j s to get: λ j g j, x x 0 λ j g j, x j x 0 < v f (x 0 ) λ j, 2 which can be rearranged to the foowing form, λ j g j, x x 0 < v f (x 0 ) 2 λ j + λ j g j, x j x 0.

6 6 Sahar Karimi and Stephen Vavasis Equivaenty we can rewrite the above inequaity as: λ j g j, x x 0 < v f (x 0 ) λ j + f(x ) f(x 0 ) λ j + λ j g j, x j x 0. Using inequaity (2.2) aong with the facts that f(x ) f(x j ) and λ j 0 for a j, we get: λ j g j, x x 0 < v f (x 0 ) λ j. (2.5) By the Cauchy-Schwarz inequaity we have λ j g j x x 0 λ j g j, x x 0 < v f (x 0 ) λ j, hence λ j g j x x 0 > v f (x 0 ) λ j. (2.6) By property (c) we have x x 0 2vf (x 0 ). (2.7) Furthermore, by inequaities (2.3) and (2.) we get: λ j g j ρ (λ j ) 2 g j 2 ρ v f (x 0 ). (2.8) Repacing inequaities (2.7) and (2.8) in inequaity (2.6), we get 2vf ρ v f (x 0 (x ) 0 ) > v f (x 0 ) λ j. (2.9) Notice that by definition of λ and property (a), λ j λ j 1 2 m. Using this fact in inequaity (2.9), we get 2vf ρ v f (x 0 (x ) 0 ) > v f (x 0 ) 1 2 ( ) 1 2 m, for a j, so

7 Conjugate Gradient with Subspace Optimization 7 therefore m < 8ρ, (2.10) which contradicts our assumption on the vaue of m. emma 2.1 shows that under conditions (2.2) and (2.3), the residua of the function is divided in haf every m = O( ) iterations. For the next sequence of m iterations, a further reduction of 1 2 is achieved provided (2.2) and (2.3) hod, with x m substituted in pace of x 0. Hence by etting x m be the new x 0 and repeating the same agorithm, we can find the ɛ-optima soution in 1 og 2 ɛ 8ρ iterations. Of course, m is not known to our agorithm, a point to which we return in the next section Restarting. As mentioned before, the agorithm shoud use first 0, then m, then 2m and so on when checking (2.2) and (2.3) in order to achieve the desired compexity bound. We use the term restarting to refer to the process of repacing x 0 in inequaities (2.2) and (2.3) with x m for some m > 0. Notice that this process does not change any of the iterates, and it ony changes the interva of indices in which we check inequaities (2.2) and (2.3). If we know parameters of the function, i.e. and, we can compute m directy; hence we woud be abe to determine exacty when we need to restart the agorithm. In many cases, however, finding the parameters of the function is a nontrivia probem itsef. In order to have an agorithm which does not rey on any prior knowedge about the function, we propose the foowing technique. Since 2 p m 2 p+1 for some p, by restarting the agorithm every 2 p+1 iterations, we are guaranteed that the residua of the function is divided in haf between every two consecutive restarts which contains at most 2m = 16ρ iterations, where the ast statement foows from the fact that 2 p+1 2m. Since the exact vaue of p is usuay unknown, we repeat the above procedure for every vaue of p {P,..., P u }. In the section on impementation of the agorithm we wi discuss the range of p we used, however note that P u does not need to exceed og 2 j, where j is the current iteration count. This is because inequaities (2.2) and (2.3) are the same for a p og 2 j. We can now state the CGSO agorithm in a more compete form: x 0 = arbitrary; ρ : given; For j = 1, 2,... x j+1 = argmin f(x) where E j = x j + Sp { g j, d j} x E j et λ j = f(x j ) f(x j+1 ) g j 2 For p = P,..., og 2 j If j + 1 = k p 2 p for some integer k p et r p = (k p 1) 2 p ; check (f(x j+1 ) f(x rp )) and ( j λ i ) + j λ i g i, x i x rp < 0 j λ i g i j ρ (λ i ) 2 g i 2 If any of the above inequaities fais, take the correction step. (which wi be defined in the subsequent section)

8 8 Sahar Karimi and Stephen Vavasis 2.3. Correction Step. et us refer to the set of iterates between two consecutive restart as a bock of iterates; in other words for any p, the set of iterates x 0, x 1,..., x 2p 1 is the first bock of size 2 p, x 2p, x 2p +1,..., x 2(2p ) 1 is the second bock of size 2 p, and so on. At the end of each bock we check inequaities (2.2) and (2.3). If they are satisfied and 2 p 8ρ, then by emma 2.1 we know that the residua of the function is divided in haf; however if any of these inequaities fais, then as mentioned in the previous section we need to take correction step for the next bock of iterates. The correction step is basicay computing the next bock of iterates in a way that satisfaction of inequaities (2.2) and (2.3) is guaranteed at the end of this bock. Then the correction step is omitted in the subsequent bocks unti the inequaities are vioated again. Reca that in an ordinary step, the new iterate, x j+1 is cacuated by a 2- dimensiona search on the pane passing through x j, and parae to the 2-dimensiona subspace spanned by g j, and d j. Suppose at east one of the inequaities (2.2) and (2.3) is vioated for kth bock of p; i.e. for the bock of iterates x rp,..., x rp+2p 1 where r p = (k 1)2 p. Then for the next bock we search for the new iterate x j+1 on the space of x j + Sp { } g j, d j, q j p, x j x rp where q j p = j λ i g i. Finding the new iterate x j+1 through a search on the space that in addition to g j and d j incudes q j p and x j x rp is what we referred to as correction step. Notice that for each p with the vioated constraints we increase the dimension of the search space by 2. However, the dimension of the search space never exceeds O(P u ) = 2+2 og 2 j, which happens to be the case when the inequaities are vioated for a possibe vaues of p. It is quite easy to see that inequaities (2.2) and (2.3) are satisfied for the (k+1)th bock of p when we take the correction step throughout it. By KKT condition, we have g j, x j x rp = 0 for a j in this bock. Using this, aong with the fact that f(x j ) < f(x rp ) and non-negativity of λ j for a j, we derive (2.2). Simiary one can argue that by KKT g j, q j 1 p r p+2 p 1 = 0 for a j, hence λ i g i = r p+2 p 1 (λ i ) 2 g i 2, which means inequaity (2.3) is satisfied. After finding the iterates of one bock through a correction step, the agorithm switches back to taking a reguar step unti the next faiure of the inequaities. We can now present the agorithm in its entirety. Agorithm 1. x 0 = arbitrary; ρ : given; S =. for j = 1, 2,... x j+1 = argmin f(xj ) f(x j+1 ) g j 2 f(x) where E j = x j + Sp { } g j, d j, p S q j p, p S x j x rp x E j et λ j = for p = P,..., og 2 j if j + 1 = k p 2 p for some integer k p et r p = (k p 1) 2 p if p S S = S \ {p} ese (i.e. if p S), check

9 end Conjugate Gradient with Subspace Optimization 9 (f(x j+1 ) f(x rp )) and ( j λ i ) + j λ i g i, x i x rp < 0 j λ i g i j ρ (λ i ) 2 g i 2 if any of the above inequaities fais, S = S {p} end end end Theorem 2.2. Suppose m 8ρ, and x j is a sequence generated by agorithm 1 for soving probem (2.1); then v f (x (n+)m ) 1 2 v f (x nm ). Proof. et p P be the integer for which 2 p 1 m 2 p ; and et s p stand for 2 p. Using agorithm 1, we are guaranteed that for at east one of any two consecutive bocks of size s p inequaities (2.2) and (2.3) are satisfied. The size of this bock is s p m 8ρ and hence by emma 2.1 we have v f (x nm+2s p ) 1 2 v f (x nm ). (2.11) Since 2s p m, so f(x nm+m ) f(x nm+2s p ); hence v f (x nm+m ) v f (x nm+2s p ). (2.12) (2.11) and (2.12) gives us the resut we wanted to show. In our experiment with the agorithm, however, we ony incude the x j x rp term in the correction step. We wi discuss this and few other remarks on the impementation of the agorithm in the foowing section. 3. Impementation Soving Each Iteration. Obviousy the most important part in CGSO is soving the subspace optimization probem in each iteration. In our impementation of the agorithm, we used Newton s method for this task uness it fais to rapidy converge to the optimum. In the case of faiure of Newton s agorithm, the eipsoid method carries out the task of soving the optimization probem. In other words we assign an upper bound to the number of iterations that Newton s method may take, and if it fais to converge within the given number of iterations the agorithm switches to the eipsoid method for finding the next iterate. Reca that at (j + 1)th iterate, we search for x j+1 in the space of vectors x = x j + αg j + βd j + aq + br, where Q R n S is the matrix formed by coumns q j p for a p S; R is the matrix of the same dimension with coumns x j x rp for a p S; α, β R, and a, b R S are coefficients that we want to find. et y denote the variabe of the subspace optimization probem, i.e., y = [α, β, a t, b t ] t ; in addition et B = [ g j, d j, Q, R ] and K = S. We can now state the forma presentation of the subspace optimization probem, min f ( x j + By ) (3.1) y R K As mentioned above, we sove probem (3.1) with Newton s method. etting f(y) = f(x j +By) and using chain rue we get the foowing formuas for the gradient

10 10 Sahar Karimi and Stephen Vavasis and Hessian of each Newton s iteration, f(y) = B t f(x) (3.2) 2 f(y) = B t 2 f(x)b (3.3) Notice that some second order information of the function comes into pay in equation (3.3); and we beieve that this is one of the reasons that CGSO stands out in practice. We compute f(x) and 2 f(x) directy when f(x) is simpe enough. For more compicated functions we use automatic differentiation (AD) in backward mode to compute f(x) and 2 f(x)b. et B (k) denote kth coumn of matrix B. Backward AD enabes us to keep the computationa cost of f(x) within a constant factor of the objective function evauation cost; and the cost of computing 2 f(x)b (k) within a constant factor of the computationa cost of gradient evauation. The storage space required in backward AD, however, is more than the required storage in forward AD; and in worst case it can be proportiona to the number of operations required for computing f(x). In spite of that, we are abe to keep the storage required by backward AD for the cass of probems we studied in O(s), where s is the space required to compute f(x). Detais on our test probems are presented in section 3.2. For more information on AD, one may refer to [12]. In addition to the storage required by AD, we need to store x j, and matrix B; we aso need to update and store x rp, j λ i, j λ i g i, x i x rp j, λ i g i, j (λ i ) 2 g i 2 for a p {P,..., og 2 j}. In our experiment with CGSO we take P to be. We find that K is equa to 2 in every case except one, in which it reaches the vaue of 3. Hence the required storage space for the above eements is in O(n max{k, og 2 j}). A difficuty we need to overcome in soving a probem with CGSO, especiay when we are trying to achieve high accuracy, is round-off error. In particuar f(x j ) f(x j+1 ), which is required for computing λ j, gets more and more inaccurate as the iterates approach the optimum. To overcome this, we took advantage of the second order Tayor series expansion. We have a subroutine that anayze the absoute error of f(x j ) f(x j+1 ) computed directy and through Tayor series, i.e. f(x j ) f(x j+1 ) ( f(x j ) t (x j+1 x j ) (xj+1 x j ) t 2 f(x j )(x j+1 x j ); the one with smaer error is accepted. In some cases the error anaysis is not easy, hence a heuristic is used to choose the preferred formua. Computing the difference of two objective vaues is appeared in inequaity (2.2) as we, in f(x ) f(x 0 ); the same subroutine is used to compute this term Numerica Resuts. We have tested our agorithm on the foowing casses of probem: f 1 (x) = m i=1 og (at i x b i) f 2 (x) = c t x og (det (C Diag(x))) f 3 (x) = m i=1 (at i x b i) d where d is a given even integer. Functions f 1 and f 2 are og-barrier functions, and f 3 is an approximation to the infinity norm of the vector Ax b. Note that we need to restrict the degree, d, to even numbers to save the convexity of f 3. A the codes for CGSO, Eipsoid method, and automatic differentiation are written in MATAB. The Hessian of f 1 and f 3 can be derived easiy in cosed form. The Hessian of f 2 was obtained by appying backward mode AD by hand to a program that computes f 2.

11 Conjugate Gradient with Subspace Optimization 11 The stopping criteria we used for both CGSO and its subprobems are f(x) ɛ and f(x) ɛ N, respectivey. In our impementation, ɛ N at iteration j is B t f(x j ) 100 ; however, if the subprobem is soved by Newton s method, the obtained soution is usuay more accurate due to the fast oca convergence of the Newton s method. Notice that for f 1 and f 2 we have some hidden constraints, namey a t i x b i > 0 in f 1 and C Diag(x) 0 in f 2. Therefore CGSO switches to eipsoid method if Newton s method does not converge in 15 iterations or if the iterates get infeasibe. In [7], Hager and Zhang compared their variant of CG with -BFGS and PRP +, and estabished the superior performance of their variant of CG. Hence, we compare our agorithm with their variant of CG, represented by CG HZ in Tabe 3.1 that summarizes this comparison. The top portion of the tabe expains the instances that we generated randomy. Parameters m, n, d, and ɛ are defined as above; m f for f 2 represents c = m f e where e is the a ones vector. The point of this parameter is that arger vaues of m f force C Diag(x ) for the optimizer x coser to the boundary of the feasibe region, hence the probem become more i-conditioned. et A be the m by n matrix formed by a t i as its rows; ds. indicates the density, the percentage of nonzero entries of A for f 1 and f 3, and density of C for f 2. It. count for both CG HZ and CGSO indicates the tota number of major iteration the agorithm takes before convergence. S. count is the tota iteration performed by the ine search routine in CG HZ ; for each iterate of S we need one gradient evauation. In CGSO, one gradient and one Hessian evauation is required for each Newton s iterate; and one gradient is cacuated in each iteration of the eipsoid agorithm. As tabe 3.1 shows, the number of iterations CGSO takes is ess than CG HZ in a instances, and the number of iterations for soving subprobems is significanty ess than the number of ine search iterations; especiay for instances 3 and, this difference is considerabe. The ast row of the tabe impies that no correction step was taken in any of these instances, except one. Actuay the purpose of Ins. 7 is to show that correction step might be required for some probems. In this instance A is a square matrix of size with O(10 3 ) condition number. The correction step is taken throughout 7 bocks corresponding to P, for which the size of the subprobems reaches 3. This, however, is consistent with the theory because the arger the condition number is, the arger is; therefore 2 P may not be a good approximation of 8ρ Parameter ρ. Reca that ρ is a parameter of CGSO required for checking inequaity (2.3). As mentioned before, we do not check inequaity (2.3) in our impementation of the agorithm; instead we gather the vaues of ρ and study their trend. Tabe 3.2 summarizes the maximum vaue ρ reached in each instance over a vaues of p. Figure 3.1 and 3.2 depict ρ for Ins. 1 and Ins. 3, respectivey. We chose these instances because they have higher ρ max and iteration count in each category. Our observation suggests that a reasonaby sma bound for ρ shoud suffice. Notice that the potted vaues impy that ρ reaches its maximum for some p and then it decreases for higher vaues of p, so it does not grow with p. Furthermore, as we get coser to the optimum parameter ρ gets coser to 1. This is a common pattern for a test probems we had. Instead of fixing ρ one may adaptivey update this parameter. In other words, we can assign a vaue to ρ and if inequaity (2.3) fais for more than a certain number of bocks, then we increase ρ, and as the agorithm progresses towards the optimum we can decrease ρ.

12 12 Sahar Karimi and Stephen Vavasis Tabe 3.1 Comparison of CGSO and CG HZ f 1 f 2 f 3 Ins. 1 Ins. 2 Ins. 3 Ins. Ins. 5 Ins. 6 Ins. 7 m n m f d ds ɛ 1e-8 1e-8 1e-3 1e-8 1e-18 1e-18 1e-8 It. count CG HZ S. count It. count CGSO Newton eipsoid Corr Tabe 3.2 Parameter ρ Ins. 1 Ins. 2 Ins. 3 Ins. Ins. 5 Ins. 6 Ins. 7 ρ max Concusion. We presented CGSO for soving unconstrained strongy convex functions. CGSO is a variant of CG, since the update step in each iteration is a combination of previous gradients, and it reduces to inear CG when appied to a quadratic function. The coefficients of previous gradients, however, do not foow an updating rue and are found by a subspace optimization subprobem. We have discussed that these subprobems are mosty two dimensiona; in some cases the dimension of the subprobems may reach sighty higher vaues, but it certainy never exceeds O (og j), where j is the iteration count. We have aso shown that CGSO benefits from the optima compexity bound of O ( og ( 1 ɛ ) ) in theory. CGSO does not depend on any prior information about the function and can easiy be impemented. In practice, it outperforms the variant of CG proposed by Hager and Zhang [7] which is known to be stronger than other techniques such as -BFGS or PRP +. Practica efficiency of CGSO can be improved by incorporating some second order information of the function in soving the subprobems with Newton s method as discussed in the previous section.

13 Conjugate Gradient with Subspace Optimization p= p=5 p=6 p=7 p= Fig Parameter ρ for Ins p= p=5 p=6 p=7 p=8 p=9 p= Fig Parameter ρ for Ins. 3 REFERENCES [1] M. A-Baai. Descent property and goba convergence of the fetcher-reeves method with inexact ine search. IMA Journa of Numerica Anaysis, 5:121 12, [2] Y. H. Dai and Y. Yuan. A noninear conjugate gradient method with a strong goba convergence property. Siam Journa on Optimization, 10(1): , [3] R. Fetcher and C. M. Reeves. Function minimization by conjugate gradients. Computer Journa, 7:19 15, 196. [] J. C. Gibert and J. Noceda. Goba convergence properties of conjugate gradient methods for optimization. Siam Journa on Optimization, 2(1):21 2, [5] G. H. Goub and C. F. Van oan. Matrix Computations. Hopkins Fufiment Service, [6] W. W. Hager and H. Zhang. A survey of noninear conjugate gradient methods. [7] W. W. Hager and H. Zhang. A new conjugate gradient method with guaranteed descent and an efficient ine search. SIAM Journa on Optimization, 16(1): , [8] M. R. Hestenes and E. Stiefe. Methods of conjugate gradient for soving inear systems. Journa

14 1 Sahar Karimi and Stephen Vavasis of Research of the Nationa Bureau of Standards, 9:09 36, [9] A. S. Nemirovsky and D. B. Yudin. Probem Compexity and Method Efficiency in Optimization. John Wiey & sons, [10] Y. E. Nesterov. A method of soving a convex programming probem with convergence rate o( 1 k 2 ). Soviet mathematics, Dokady, 27(2): , [11] Y. E. Nesterov. Smooth minimization of nonsmooth functions. Math. Programming, 103: , [12] J. Noceda and S. J. Wright. Numerica Optimization. Springer Science, [13] E. Poak and G. Ribiére. Note sur a convergence de méthodes de directions conjuguées. Revue Francaise d informatique et de Recherche Opeŕationnee, 16:35 3, [1] Boris T. Poyak. Introduction to Optimization. Optimization Software Inc.,, [15] P. Tseng. On acceerated proxima gradient methods for convex-concave optimization [16] Y. Yuan and J. Stoer. A subspace study on conjugate gradient agorithms. ZAMM (Zeitschrift fr angewandte Mathematik und Mechanik), 11:69 77, 1995.

A Brief Introduction to Markov Chains and Hidden Markov Models

A Brief Introduction to Markov Chains and Hidden Markov Modes Aen B MacKenzie Notes for December 1, 3, &8, 2015 Discrete-Time Markov Chains You may reca that when we first introduced random processes,